latest news

06.29.2019

VisCello; for visualization of single cell data.

access info ...

06.27.2018

Sample data provenance from 1,347 RNAseq samples.

access info ...

06.07.2018

ORNASEQ: Ontology for RNA sequencing.

access info ...

Penn SCAP-T Pipeline: Documentation

Back ↩ License (pdf)

Module: TRIM

Trim reads using contaminants listed in the 'contaminants.fa' file.

Usage: ngs.sh trim [-i inputDir] [-p] [-c contaminantsFile] [-m minLen] [-q phredThreshold] [-rN] [-rAT numBases] [-se] sampleID
Input:
    /lab/repo/resources/trim/contaminants.fa (file containing contaminants)
    sampleID/inputDir/unaligned_1.fq
    sampleID/inputDir/unaligned_2.fq (paired-end reads)
Output:
    sampleID/trim/unaligned_1.fq
    sampleID/trim/unaligned_2.fq (paired-end reads)
    sampleID/trim/sampleID.trim.stats.txt
    sampleID/trim/contaminants.fa (contaminants file)
Requires:
    trimReads.py ( https://github.com/safisher/ngs )
    Python version 2.7 or later
Options:
    -i inputDir - location of source files (default: orig).
    -p - Pad paired reads so that they are the same length after all trimming has occured. N's will be added to the 3' end with '#' added to the quality score for each N that is added. This will not do anything for single-end reads. (default: no padding).
    -c contaminantsFile - file containing contaminants to be trimmed.
    -m minLen - Minimum size of trimmed read. If trimmed beyond minLen, then read is discarded. If read is paired then read is replaced with N's, unless both reads in pair are smaller than minLen in which case the pair is discarded. (default: 0).
    -q phredThreshold - replace base with 'N' if Phred score less than phredThrehold (default: 0). Quality scores are NOT scaled based on encoding scheme. So threshold value should be unscaled. For example, use a threshold of 53 to filter Illumina 1.8 fastq files (Phred+33) based on a Phred score of 20. The quality scores are not changed when low-quality bases are replaced.
    -rN - trim N's from both ends of reads (default: do not trim).
    -rAT numBases - number of polyA/T bases to trim (default: 0).
    -se - single-end reads (default: paired-end)

Trimmed data is placed in 'sampleID/trim'. The contaminants file that was used is copied into the trim directory for future reference.

The TRIM module uses the trimReads.py program to perform the trimming. trimReads.py is available from GitHub. Since the running of trimReads.py is handled by the TRIM module, the user only needs to be aware of the formatting of the contaminants file and the ordering of the trimming.


trimReads.py

Trimming will happen in the following order depending on which options are selected:

  1. Replace any base with N if the quality score is below the Phred threshold. Quality score is not changed.
  2. Remove N's from both ends.
  3. Process contaminants file, removing contaminants based on their order in the contaminants file.
  4. Single-end: discard read if shorter than the minimum length.
  5. Paired-end: if only one of the paired reads is shorter than the minimum length, then replace that read's sequence with N's and replace that read's quality scores with #. If both paired reads are shorter than the minimum length, then discard the read pair.
  6. Pad paired reads with N's so that they are both the same length. For every N that is added, also add a # to the quality score.
  7. Add "L:X" to the read header with X being the new length of the sequence (including any N's that were added to the sequence).

The contaminants file (fasta-like file) should be formatted as follows:

  1. Sequences must be on a single line (ie can't contain line breaks).
  2. No blank lines in file.
  3. Sequence header is space delimited list of options that begins with a '>' (similar to fasta files). Option names and values should be separated by a colon.
     Example header "> name:oligo end:3 size:10 windows:5"
2. Options:
  • name: this option is required for every sequence
  • method: there are three trimming methods (0 = full contaminant, 1 = mapped contaminant, 2 = identity based). (default 2)
    • 0) Full Contaminant: Full contaminant trimming means that when a k-mer is mapped then it is expected that the entire contaminant mapped and the read is trimmed accordingly. For example lets assume we have a k-mer that is located 4 bases from the 5' end of a contaminant. If that k-mer maps then we would shift where we trim the read by 4 bases in the direction of the 5' end of the read. We would then remove all bases from that position to the 3' end of the read, regardless of the additional bases mapped to the contaminant.
    • 1) Mapped Contaminant: Mapped contaminant trimming means that when a k-mer is mapped then we extend the mapping and trimmed accord to the mapping. For example lets assume we have a k-mer that is located 4 bases from the 5' end of a contaminant. If that k-mer maps then we would extend the mapped region one base at a time, in the 5' direction until we found a base that didn't map. We would then trim from that postion to the 3' end of the read.
    • 2) Identity Based: If a k-mer maps to the read then the location of the mapping is used to anchor the contaminant to the read. The percent and total identity between the contaminant and the read is computed. If both the percent and total identity are above a user-defined threshold then the read is trimmed from the beginning of the contaminant to the 3' end of the read. If not then the read is not trimmed.
  • size: size of k-mer (default 7)
  • windows: how many k-mers to seek. can not be larger than (contaminant length - k-mer). (default 6)
  • percentIdentity: percent identity threshold for trimming method 2 (0.0 < percentIdentity <= 1.0). (default 0.9)
  • totalIdentity: total identity threshold for trimming method 2. If this is less than the k-mer size then it will have no impact on trimming. (default 16)

Here is an example of the contents of a contaminants file:

> name:indexAdapter method:2 percentIdentity:.9 totalIdentity:16 size:10
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
> name:univAdapter method:2 percentIdentity:.9 totalIdentity:16 size:10
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT