latest news

06.29.2019

VisCello; for visualization of single cell data.

access info ...

06.27.2018

Sample data provenance from 1,347 RNAseq samples.

access info ...

06.07.2018

ORNASEQ: Ontology for RNA sequencing.

access info ...

Penn SCAP-T Pipeline: Documentation

Back ↩ License (pdf)

Module: BLAST

Randomly samples the raw reads file 5,000 times and then blasts those reads against the neucleotide database. The resulting species mappings are summarized in the file 'sampleID.blast.stats.txt'.

Usage:
    ngs.sh blast [-r numReads] [-k kmer] -p numProc -s species sampleID
Input:
    sampleID/init/unaligned_1.fq
Output:
    sampleID/blast/blast.txt (blast output)
    sampleID/blast/sampleID.blast.stats.txt (species hit counts)
Requires:
    blastn ( ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ )
    randomSample.py ( https://github.com/safisher/ngs )
    parseBlast.py ( https://github.com/safisher/ngs )
Options:
    -r numReads - number of reads to randomly select (default = 5000)
    -k kmer - check for presence of k-mer in reads that failed to align
    -p numProc - number of cpu to use
    -s species - expected species

Run blast on 5000 reads randomly sampled from init/unaligned_1.fq. The output is put in a directory called 'blast'. The species.txt file contains number of reads mapping to each species (mouse, rat, human, bacteria).

The blast mappings are split into two files: reads.counted.txt and reads.notCounted.txt. Reads.notCounted.txt contains all reads that mapped to something that wasn’t one of the species we track (i.e. mouse, rat, human, bacteria, fly, zebra fish, yeast, ERCC). Browsing this file should elucidate the species of the uncounted mappings.

Blast paramters used:

num_descriptions: 10 
num_alignments: 10 
word_size: 15 
gapopen: 3 
gapextend: 1 
evalue: 1e-15

parseBlast.py

Blast hits are counted using parseBlast.py as follows:

Up to 10 mappings are reported per read ("num_descriptions: 10"). If any one of those 10 mappings is to the target species then the read is counted at mapping to the target. If the read does not map to the target species then the read is counted as mapping to all other species that were reported. For example if a read maps to human and mouse while human is the target species, then that read is only counted as mapping to human. If a read maps to mouse and rat with human being the target species then that read is counted both as a mouse hit and a rat hit.