SOFTWARE Riss-util Module

The riss_util module contains a variety of small programs and scripts developed by RISS staff to perform various bioinformatics tasks. The module is available on the lab cluster, Itasca, and Mesabi. To load the module run:

$ module load riss_util

Most of the scripts are written in perl. After loading the module you can view the source code for the scripts at /soft/riss_util/1.0/bin/. If neccessary you can copy a script to your home directory and modify it to suite your needs.

profile.pl

NAME
profile.pl - profile the cpu and memory usage of the computer
SYNOPSIS
profile.pl [-s seconds] [-h] [-i] [-b bins] [-o logfile]
DESCRIPTION

This script collects total memory and cpu usage information for the computer/node it is running on, and when the script is killed it prints ASCII plots to standard output summarizing memory and cpu usage across time. After the plots is a list showing the most active process in each bin in the plots.

Options:
-s seconds The number of seconds between polling cpu and memory usage
-b bins The number of bins in the summary histograms
-i Interactive mode: print update to screen after every poll
-h Display usage information
-o file Print output to file instead of STDOUT
EXAMPLE

Start profile.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”. Check the standard output file (jobname.oXXXXX) for the results:

$ profile.pl &
../_images/profile-sample.jpeg

profiles.pl

NAME
profiles.pl - Run profile.pl on all nodes allocated to a job
SYNOPSIS
profiles.pl [-s seconds] [-h] [-i] [-b bins]
DESCRIPTION

Generates memory and cpu usage information for multiple nodes. One nodeXXXX.log file is created for each node allocated to the current job.

Options:
-s seconds The number of seconds between polling cpu and memory usage
-b bins The number of bins in the summary histograms
-h Display usage information
EXAMPLE

Start profiles.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”:

$ profiles.pl &

multi-profile.pl

NAME
multi-profile.pl - profile the cpu and memory usage of a multi-node job
SYNOPSIS
multi-profile.pl [-s seconds] [-h] [-i] [-b bins] [-o logfile]
DESCRIPTION

Generates one plot summarizing memory and cpu usage across all nodes in a multi-node job

Options:
-s seconds The number of seconds between polling cpu and memory usage
-b bins The number of bins in the summary histograms
-i Interactive mode: print update to screen after every poll
-h Display usage information
-o file Print output to file instead of STDOUT

EXAMPLE Start multi-profile.pl at the beginning of your pbs script (after loading the riss_util module) and put it in the background using “&”. View the profile.png image after the job finishes. The plots use boxplots to show the distribution of memory and cpu usage across all nodes at each timepoint bin. The top plot shows CPU load percentage, which is the number of threads running or ready to run, divided by the number of cores (thus the load can be higher than 100%):

$ multi-profile.pl &
../_images/multi-profile-sample.png

cleanup

NAME
cleanup - delete all but the most recent pbs.e and pbs.o output files
SYNOPSIS
cleanup [-d]
DESCRIPTION

Submitting the same pbs script to a queue multiple times results in many different standard error and standard out files. This script will delete all of the old files for you, leaving the most recent pair of files. This script finds all files ending in .pbs.e00000 and .pbs.o0000 and removes all but the most recent (as determined by the job number, not the file modification dates) .e and .o file for each .pbs file. Run without any options the script lists which files should be deleted and which should be kept. Run with the -d option the script will actually delete files.

Options:
-d Delete old .e and .o files

fastqqualityplot.pl

NAME
fastqqualityplot.pl - Generate per-base quality plot for multiple fastq files
SYNOPSIS
fastqqualityplot.pl -f /fastq/folder [-m mappingfile]
DESCRIPTION

Generate per-base quality plot for multiple fastq files

-f folder A folder containing fastq files to process
-m file Full path to a mapping file
-o file Name of the output image file (fastqqualityplot.png)
-p integer Number of processors to use (number of threads to run) (this doesn’t work yet...)
-s integer Subsample the specified number of reads from each fastq file. 0 = no subsampling
-h Print usage instructions and exit
-v Print more information while running (verbose)
EXAMPLE

Run the script:

$ fastqqualityplot.pl -f /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/ -s 4000 -o fastqqualityplot-sample2
../_images/fastqqualityplot-sample.png

insertsize.pl

NAME
insertsize.pl - Calculate the insert size mean and standard deviation of a paired-end dataset
SYNOPSIS
insertsize.pl [-m 1] bowtieindex R1.fastq R2.fastq
DESCRIPTION

Calculate the insert size mean and standard deviation by aligning some reads from a pair of fastq files to a bowtie2 index

-b bowtieindex A Bowtie2 index
-m integer The first N million reads from the fastq files will be aligned (Default 1)
-p integers Number of threads to use (Default $PBS_NUM_PPN or 1);
-h Print usage instructions and exit
-v Print more information while running (verbose)
EXAMPLE

Run the script:

$ insertsize.pl bowtieindex R1.fastq R2.fastq

Runtime: 15 seconds using “-m .1 -p 8” on Itasca, 102 seconds using “-m 1 -p 8” on Itasca

insertplot.pl

NAME
insertplot.pl - generate a fragment-length plot from Picard output
SYNOPSIS
insertplot.pl insert_summary1.txt [insert_summary2.txt ...] insertplot.pl -f filelist.txt
DESCRIPTION

Generate a plot summarizing multiple Picard-tools insert-size-metrics output files. R is required, as well as the R package ggplot2.

Options:
-f filelist.txt : provide a file with a list of picard insert-size-metrics output files, one per line. A
second tab-delimited column may be included containing sample names
-h : Print usage instructions and exit -v : Print more information whie running (verbose)
EXAMPLE

Generate a plot from six different picard output files:

$ cd /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/analysis
$ insertplot.pl heart.1/insertmetrics.txt heart.2/insertmetrics.txt heart.3/insertmetrics.txt skeletal.1/insertmetrics.txt heart.2/insertmetrics.txt heart.3/insertmetrics.txt
../_images/insertplot-sample.png

fastq-species-blast.pl

NAME
fastq-species-blast.pl - Given a fastq file, blast a sample of the sequences and count how many hits there are to each species.
SYNOPSIS
fastq-species-blast.pl [-n number_of_sequences_to_blast] [-t num_threads] [-d blast_database(s)] input.fastq
DESCRIPTION
fastq-species-blast.pl can be used to blast a small number of fastq reads against a BLAST database in order to determine what species the fastq file contains, and if there are significant amounts of contaminating sequence from other species. The -n option is used to specify how many reads from the input.fastq file shoule be BLASTed (default is 10). The -t option specifies how many processor cores to use (default is 1, this script cannot run across multiple nodes). The -d option specifies which BLAST database to use (default is htgs). Any database installed with the local NCBI Blast installation can be used (the taxdb must be installed). Multiple databases can be blasted against: fastq-species-blast.pl input.fastq -d “human_genomic vector”
EXAMPLE

Blast 10 fastq sequences (the default) against the htgs database (the default):

$ fastq-species-blast.pl
/home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/heart-1_R1.fastq
6 out of 10 sequences (60%) have a hit in the htgs blast database
   Common name               Scientific name      # of sequences
   grivet                    Chlorocebus aethiops 1
   cattle                    Bos taurus           1
   white-tufted-ear marmoset Callithrix jacchus   1
   human                     Homo sapiens         3

fastq-cat.pl

NAME
fastq-cat.pl - Concatenate FastQC files
SYNOPSIS
fastq-cat.pl /fastq/folder
DESCRIPTION

This script identifies samples spread across multiple fastq files and generates cat commands to concatenate them together. Symlink commands are generated for single-file samples. This script only generates the commands to concatenate and link files. Run “fastq-cat.pl FOLDER | bash” to generate the concatenated and linked files.

Options:

-f FOLDER Folder containing fastq files
EXAMPLE

Create a directory to contain the concatenated files:

$ mkdir fastq-cat
$ cd fastq-cat

Generate the concatenation commands:

$ fastq-cat.pl ~/fastq-files > fastq-commands.txt

Execute the concatenation commands:

$ bash fastq-commands.txt

redup.pl

NAME
redup.pl - Remove exact duplicate reads from paired-end fastq files
SYNOPSIS

redup.pl [-n N] sample1_R1.fastq sample1_R2.fastq > topdups.fasta

Options:
-n integer Print out this many of the most duplicated sequences
-h Display usage information
DESCRIPTION
This script removes duplicate paired-end reads from the input files sample1_R1.fastq and sample1_R2.fastq and prints out unique reads to the files sample1_R1.fastq.unique and sample2_R2.fastq.unique. Reads must have the exact same sequence to be called duplicates, quality scores are ignored. The top N (default 20) most duplicated sequences are printed out in fasta format, making it convenient for using BLAST to identify them.

resync.pl

NAME
resync.pl - Resynchronize a pair of paired-end fastq files.
SYNOPSIS
resync.pl sample1_R1.fastq sample1_R2.fastq [sample1_R1_synced.fastq sample1_R2_synced.fastq]
DESCRIPTION

Programs that process paired-end fastq files usually require that the Nth read in the R1 fastq file and the Nth read in the R2 fastq file are mates. Using trimming or filtering programs that aren’t paired-end aware often results in reads being removed from one paired-end fastq file but not the other, resulting in “unsyncronized” files. This program reads in two unsynchronized fastq files and writes out two synchronized fastq files. The synchronized files have properly paired reads, with singleton reads removed. Casava 1.7 and 1.8 read ID formats are supported. This program shouldn’t use much memory (<1GB), but maximum memory use could be equivalent to the size of one input file in a worst-case scenario.

Options:
-h : Display usage information -s : Save singletons to .singleton files

fasterqc.pl

NAME
fasterqc.pl - Combine FastQC output images
SYNOPSIS
fasterqc.pl [-s 100] [-o fasterqc.png]
DESCRIPTION

This script combines FastQC output images into one large png image to make it easy to quickly assess the FastQC output from many samples. When FastQC is run it generates a zip file named SAMPLENAME_fastqc.zip. Run this script in a folder containing one or more of these SAMPLENAME_fastqc.zip files and it will generate a single image containing all of the FastQC images from all samples. It also prints out the “overrepresented sequences” for each sample to the file fasterqc.overrep.txt. Recommended maximum number of fastqc folders is 50. This script works with older and newer versions of FastQC, but won’t work with a mix of old and new version FastQC output files.

Options:
-s percent Scale the final image by the specified percent (valid range 5-100, default 100). Files larger than 5000 pixels wide are automatically scaled to 5000 pixels wide
-o file Save the final image in the specified file (default fasterqc.png)
EXAMPLE

Consolidate the results from 12 FastQC runs into one tiny image:

$ cd /home/msistaff/public/garbe/sampledata/RNAseq/Hsap/fastq/fastqc
$ fasterqc.pl -s 10 -o fasterqc-sample.png
../_images/fasterqc-sample.png

tophatplot.pl

NAME
tophatplot.pl - Generate plots from tophat align_summary.txt output files
SYNOPSIS
tophatplot.pl align_summary1.txt [align_summary2.txt ...] tophatplot.pl -f filelist.txt
DESCRIPTION

Generate a plot summarizing mapping percentage for multiple samples

Options:
-f file Provide a file with a list of align_summary.txt files, one per line. A second tab-delimited column may be included containing sample names. A third column may be included containing bam files from mapping unmapping reads against a spike-in control reference
-h Display usage information

EXAMPLE

software/alignments-pairs.png software/alignments-pairs-pct.png

expressiontableplot.pl

NAME

expressiontableplot.pl - Given a table of expression data, generate a series of summary plots including:

-MDS plot -Dendogram -Expression distribution violin plots -Expressed genes plot
SYNOPSIS
expressiontableplot.pl data.txt
DESCRIPTION

Generate a series of plots summarizing a table of expression data. The input file should be tab delimited with a header. There should be a row for each feature (gene, transcript, exon, etc), and a column for each sample. The first row should contain sample names and the first column feature IDs.

Options:
-n Normalize expression values: 75% quartile normalization
-m integer Minimum expression value
-t string Feature type (gene, transcript, exon, etc)
-h Display usage information
-v Verbose output

EXAMPLE

software/expressedplot.png software/expressiondistributionplot.png software/dendogramplot.png software/mdsplot.png

Deprecated scripts

These scripts are no longer supported:

tophatstatsPE.pl:
 Tophat now produces a file name align_summary.txt containing alignment statistics. Use tophatplot.pl to summarize multiple align_summary.txt files
cuffplot.pl:Use cuffdiffplot.pl instead, it genreates more plots and uses ggplot2 instead of gnuplot
cuffdiff2_mds_plot.pl:
 Use cuffdiffplot.pl instead, it generates an mds plot as well as several other useful plots

Support

There is a discussion thread for the riss_util module in the MSI google group: https://groups.google.com/a/umn.edu/forum/#!categories/msi-user-questions/software Updates and changes to programs in the riss_util module are posted to the thread, and you may post feature requests or bug reports to the thread. You may also email RISS at help@msi.umn.edu