SOFTWARE Riss-util Module ¶

Contents

SOFTWARE Riss-util Module

The riss_util module contains a variety of small programs and scripts developed by RISS staff to perform various bioinformatics tasks. The module is available on the lab cluster, Itasca, and Mesabi. To load the module run:

$ module load riss_util

Most of the scripts are written in perl. After loading the module you can view the source code for the scripts at /soft/riss_util/1.0/bin/. If neccessary you can copy a script to your home directory and modify it to suite your needs.

profile.pl ¶

NAME

profile.pl - profile the cpu and memory usage of the computer

SYNOPSIS

profile.pl [-s seconds] [-h] [-i] [-b bins] [-o logfile]

DESCRIPTION

This script collects total memory and cpu usage information for the computer/node it is running on, and when the script is killed it prints ASCII plots to standard output summarizing memory and cpu usage across time. After the plots is a list showing the most active process in each bin in the plots.

Options:

`-s seconds`	The number of seconds between polling cpu and memory usage
`-b bins`	The number of bins in the summary histograms
`-i`	Interactive mode: print update to screen after every poll
`-h`	Display usage information
`-o file`	Print output to file instead of STDOUT

EXAMPLE

Start profile.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”. Check the standard output file (jobname.oXXXXX) for the results:

$ profile.pl &

profiles.pl ¶

NAME

profiles.pl - Run profile.pl on all nodes allocated to a job

SYNOPSIS

profiles.pl [-s seconds] [-h] [-i] [-b bins]

DESCRIPTION

Generates memory and cpu usage information for multiple nodes. One nodeXXXX.log file is created for each node allocated to the current job.

Options:

`-s seconds`	The number of seconds between polling cpu and memory usage
`-b bins`	The number of bins in the summary histograms
`-h`	Display usage information

EXAMPLE

Start profiles.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”:

$ profiles.pl &

multi-profile.pl ¶

NAME

multi-profile.pl - profile the cpu and memory usage of a multi-node job

SYNOPSIS

multi-profile.pl [-s seconds] [-h] [-i] [-b bins] [-o logfile]

DESCRIPTION

Generates one plot summarizing memory and cpu usage across all nodes in a multi-node job

Options:

`-s seconds`	The number of seconds between polling cpu and memory usage
`-b bins`	The number of bins in the summary histograms
`-i`	Interactive mode: print update to screen after every poll
`-h`	Display usage information
`-o file`	Print output to file instead of STDOUT

EXAMPLE Start multi-profile.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”. View the profile.png image after the job finishes. The plots use boxplots to show the distribution of memory and cpu usage across all nodes at each timepoint bin. The top plot shows CPU load percentage, which is the number of threads running or ready to run, divided by the number of cores (thus the load can be higher than 100%):

$ multi-profile.pl &

cleanup ¶

NAME

cleanup - delete all but the most recent pbs.e and pbs.o output files

SYNOPSIS

cleanup [-d]

DESCRIPTION

Submitting the same pbs script to a queue multiple times results in many different standard error and standard out files. This script will delete all of the old files for you, leaving the most recent pair of files. This script finds all files ending in .pbs.e00000 and .pbs.o0000 and removes all but the most recent (as determined by the job number, not the file modification dates) .e and .o file for each .pbs file. Run without any options the script lists which files should be deleted and which should be kept. Run with the -d option the script will actually delete files.

Options:

`-d`	Delete old .e and .o files

fastqqualityplot.pl ¶

NAME

fastqqualityplot.pl - Generate per-base quality plot for multiple fastq files

SYNOPSIS

fastqqualityplot.pl -f /fastq/folder [-m mappingfile]

DESCRIPTION

Generate per-base quality plot for multiple fastq files

-f folder A folder containing fastq files to process

-m file Full path to a mapping file

-o file Name of the output image file (fastqqualityplot.png)

-p integer Number of processors to use (number of threads to run) (this doesn’t work yet...)

-s integer Subsample the specified number of reads from each fastq file. 0 = no subsampling

-h Print usage instructions and exit

-v Print more information while running (verbose)

EXAMPLE

Run the script:

$ fastqqualityplot.pl -f /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/ -s 4000 -o fastqqualityplot-sample2

insertsize.pl ¶

NAME

insertsize.pl - Calculate the insert size mean and standard deviation of a paired-end dataset

SYNOPSIS

insertsize.pl [-m 1] bowtieindex R1.fastq R2.fastq

DESCRIPTION

Calculate the insert size mean and standard deviation by aligning some reads from a pair of fastq files to a bowtie2 index

-b bowtieindex A Bowtie2 index

-m integer The first N million reads from the fastq files will be aligned (Default 1)

-p integers Number of threads to use (Default $PBS_NUM_PPN or 1);

-h Print usage instructions and exit

-v Print more information while running (verbose)

EXAMPLE

Run the script:

$ insertsize.pl bowtieindex R1.fastq R2.fastq

Runtime: 15 seconds using “-m .1 -p 8” on Itasca, 102 seconds using “-m 1 -p 8” on Itasca

insertplot.pl ¶

NAME

insertplot.pl - generate a fragment-length plot from Picard output

SYNOPSIS

insertplot.pl insert_summary1.txt [insert_summary2.txt ...] insertplot.pl -f filelist.txt

DESCRIPTION

Generate a plot summarizing multiple Picard-tools insert-size-metrics output files. R is required, as well as the R package ggplot2.

Options:: -f filelist.txt : provide a file with a list of picard insert-size-metrics output files, one per line. A
second tab-delimited column may be included containing sample names: -h : Print usage instructions and exit -v : Print more information whie running (verbose)

EXAMPLE

Generate a plot from six different picard output files:

$ cd /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/analysis
$ insertplot.pl heart.1/insertmetrics.txt heart.2/insertmetrics.txt heart.3/insertmetrics.txt skeletal.1/insertmetrics.txt heart.2/insertmetrics.txt heart.3/insertmetrics.txt

fastq-species-blast.pl ¶

NAME

fastq-species-blast.pl - Given a fastq file, blast a sample of the sequences and count how many hits there are to each species.

SYNOPSIS

fastq-species-blast.pl [-n number_of_sequences_to_blast] [-t num_threads] [-d blast_database(s)] input.fastq

DESCRIPTION

fastq-species-blast.pl can be used to blast a small number of fastq reads against a BLAST database in order to determine what species the fastq file contains, and if there are significant amounts of contaminating sequence from other species. The -n option is used to specify how many reads from the input.fastq file shoule be BLASTed (default is 10). The -t option specifies how many processor cores to use (default is 1, this script cannot run across multiple nodes). The -d option specifies which BLAST database to use (default is htgs). Any database installed with the local NCBI Blast installation can be used (the taxdb must be installed). Multiple databases can be blasted against: fastq-species-blast.pl input.fastq -d “human_genomic vector”

EXAMPLE

Blast 10 fastq sequences (the default) against the htgs database (the default):

$ fastq-species-blast.pl
/home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/heart-1_R1.fastq
6 out of 10 sequences (60%) have a hit in the htgs blast database
   Common name               Scientific name      # of sequences
   grivet                    Chlorocebus aethiops 1
   cattle                    Bos taurus           1
   white-tufted-ear marmoset Callithrix jacchus   1
   human                     Homo sapiens         3

fastq-cat.pl ¶

NAME

fastq-cat.pl - Concatenate FastQC files

SYNOPSIS

fastq-cat.pl /fastq/folder

DESCRIPTION

This script identifies samples spread across multiple fastq files and generates cat commands to concatenate them together. Symlink commands are generated for single-file samples. This script only generates the commands to concatenate and link files. Run “fastq-cat.pl FOLDER | bash” to generate the concatenated and linked files.

Options:

-f FOLDER Folder containing fastq files

EXAMPLE

Create a directory to contain the concatenated files:

$ mkdir fastq-cat
$ cd fastq-cat

Generate the concatenation commands:

$ fastq-cat.pl ~/fastq-files > fastq-commands.txt

Execute the concatenation commands:

$ bash fastq-commands.txt

redup.pl ¶

NAME

redup.pl - Remove exact duplicate reads from paired-end fastq files

SYNOPSIS

redup.pl [-n N] sample1_R1.fastq sample1_R2.fastq > topdups.fasta

Options:

`-n integer`	Print out this many of the most duplicated sequences
`-h`	Display usage information

DESCRIPTION

This script removes duplicate paired-end reads from the input files sample1_R1.fastq and sample1_R2.fastq and prints out unique reads to the files sample1_R1.fastq.unique and sample2_R2.fastq.unique. Reads must have the exact same sequence to be called duplicates, quality scores are ignored. The top N (default 20) most duplicated sequences are printed out in fasta format, making it convenient for using BLAST to identify them.

resync.pl ¶

NAME

resync.pl - Resynchronize a pair of paired-end fastq files.

SYNOPSIS

resync.pl sample1_R1.fastq sample1_R2.fastq [sample1_R1_synced.fastq sample1_R2_synced.fastq]

DESCRIPTION

Programs that process paired-end fastq files usually require that the Nth read in the R1 fastq file and the Nth read in the R2 fastq file are mates. Using trimming or filtering programs that aren’t paired-end aware often results in reads being removed from one paired-end fastq file but not the other, resulting in “unsyncronized” files. This program reads in two unsynchronized fastq files and writes out two synchronized fastq files. The synchronized files have properly paired reads, with singleton reads removed. Casava 1.7 and 1.8 read ID formats are supported. This program shouldn’t use much memory (<1GB), but maximum memory use could be equivalent to the size of one input file in a worst-case scenario.

Options:: -h : Display usage information -s : Save singletons to .singleton files

fasterqc.pl ¶

NAME

fasterqc.pl - Combine FastQC output images

SYNOPSIS

fasterqc.pl [-s 100] [-o fasterqc.png]

DESCRIPTION

This script combines FastQC output images into one large png image to make it easy to quickly assess the FastQC output from many samples. When FastQC is run it generates a zip file named SAMPLENAME_fastqc.zip. Run this script in a folder containing one or more of these SAMPLENAME_fastqc.zip files and it will generate a single image containing all of the FastQC images from all samples. It also prints out the “overrepresented sequences” for each sample to the file fasterqc.overrep.txt. Recommended maximum number of fastqc folders is 50. This script works with older and newer versions of FastQC, but won’t work with a mix of old and new version FastQC output files.

Options:

`-s percent`	Scale the final image by the specified percent (valid range 5-100, default 100). Files larger than 5000 pixels wide are automatically scaled to 5000 pixels wide
`-o file`	Save the final image in the specified file (default fasterqc.png)

EXAMPLE

Consolidate the results from 12 FastQC runs into one tiny image:

$ cd /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/fastqc
$ fasterqc.pl -s 10 -o fasterqc-sample.png

tophatplot.pl ¶

NAME

tophatplot.pl - Generate plots from tophat align_summary.txt output files

SYNOPSIS

tophatplot.pl align_summary1.txt [align_summary2.txt ...] tophatplot.pl -f filelist.txt

DESCRIPTION

Generate a plot summarizing mapping percentage for multiple samples

Options:

`-f file`	Provide a file with a list of align_summary.txt files, one per line. A second tab-delimited column may be included containing sample names. A third column may be included containing bam files from mapping unmapping reads against a spike-in control reference
`-h`	Display usage information

EXAMPLE

expressiontableplot.pl ¶

NAME

expressiontableplot.pl - Given a table of expression data, generate a series of summary plots including:

-MDS plot -Dendogram -Expression distribution violin plots -Expressed genes plot

SYNOPSIS

expressiontableplot.pl data.txt

DESCRIPTION

Generate a series of plots summarizing a table of expression data. The input file should be tab delimited with a header. There should be a row for each feature (gene, transcript, exon, etc), and a column for each sample. The first row should contain sample names and the first column feature IDs.

Options:

`-n`	Normalize expression values: 75% quartile normalization
`-m integer`	Minimum expression value
`-t string`	Feature type (gene, transcript, exon, etc)
`-h`	Display usage information
`-v`	Verbose output

EXAMPLE

Deprecated scripts ¶

These scripts are no longer supported:

tophatstatsPE.pl:

Tophat now produces a file name align_summary.txt containing alignment statistics. Use tophatplot.pl to summarize multiple align_summary.txt files

cuffplot.pl: Use cuffdiffplot.pl instead, it genreates more plots and uses ggplot2 instead of gnuplot

cuffdiff2_mds_plot.pl:

Use cuffdiffplot.pl instead, it generates an mds plot as well as several other useful plots

Support ¶

There is a discussion thread for the riss_util module in the MSI google group: https://groups.google.com/a/umn.edu/forum/#!categories/msi-user-questions/software Updates and changes to programs in the riss_util module are posted to the thread, and you may post feature requests or bug reports to the thread. You may also email RISS at help@msi.umn.edu