Kim Lab Website

Software Repository
Kim Laboratory
Computational Evolutionary Biology

latest news

06.29.2019

VisCello; for visualization of single cell data.

access info ...

06.27.2018

Sample data provenance from 1,347 RNAseq samples.

access info ...

06.07.2018

ORNASEQ: Ontology for RNA sequencing.

access info ...

contact us

Biology Department
School of Arts and Sciences
University of Pennsylvania
301a Lynch Laboratory
433 S University Avenue
Philadelphia, PA 19104 USA

off: (215) 746-5187
lab: (215) 898-8395
fax: (215) 898-8780

email: junhyong@sas.upenn.edu

additional links

Sniper: SNP identification using Probability of Every Read

Maintained by Daniel F. Simola

License (pdf)

About Sniper

Sniper is a Bayesian probabilistic model that enables SNP discovery in both unique and repetitive regions of a genome by utilizing the information from multiply-mapped sequence reads.

Citation

Daniel F. Simola and Junhyong Kim
Sniper: Improved SNP discovery by multiply mapping deep sequenced reads
Genome Biology 2011, 12:R55 doi:10.1186/gb-2011-12-6-r55
Link to article

Downloads

Latest Sniper.py script and demo

Version history

Sniper.py Python script (version 1.6.4) (Tue, 4 December 2012)

Added --path flag to specify absolute path information for distributed calls. Added check for proper sam header in unique.map files before distributed genotyping (avoids looping through entire file). Included Python.h file in sniper_pkg to build snipercore.c, when Python.h is not in your default path; this Python.h file is for Python v2.7. NB, this file should be located in your python directory, e.g., /path/to/include/python2.x/Python.h.
Sniper.py Python script (version 1.6.3) (Wed, 14 November 2012)

Changed the default fastq nucleotide quality encoding to phred33 for bowtie: --phred33-quals.
Sniper.py Python script (version 1.6.2) (Tue, 13 November 2012)

Changed the default maxdepth parameter to 500; i.e., if a locus has greater than 500 alignments, sniper will randomly subsample 500 reads for SNP calling at this locus. Unix sort (used for sorting unique sam files) now uses the sniper_results/tmp/ directory for temporary files, rather than the default $TMPDIR environment variable.
Sniper.py Python script (version 1.6.1) (Tue, 4 October 2011)

Added new 1-parameter prior probability model that provides a uniform prior on non-reference alleles: --prior uniform: h0:1-3*theta, t1:theta, h2:theta, t2:theta. The preivous uniform prior was renamed as "naive". Added --integer-quals flag. Added --btcustom <STR> flag for specifying arbitrary CLI options to bowtie. Fixed a bug when manually specifying PE read insertion size. Fixed a bug where only the first sample ID in a map file would be genotyped under default settings.
Sniper.py Python script (version 1.6.0) (Wed, 24 August 2011)

Added --snponly/--so <Q> flag, where only non-reference genotypes having stringency at least Q are printed, rather than genotypes for every nucleotide locus. Added the --groupchrom flag, which forces sniper to group jobs for multiple chromosomes into a single job. This is default for -x 1 only. Fixed a bug where multiple SAM headers could get saved to a single file.
Sniper.py Python script (version 1.5.9) (Tue, 19 July 2011)

Added --bychrom flag, which generates a separate file for each chromsoome/scaffold in the reference genome. This allows you to add additional processors to an ongoing run to decrease runtime. This should also result in the fastest overall runtime when used with multiple processors (-x <INT>), so by default setting -x > 1 activates --bychrom.
Sniper.py Python script (version 1.5.8) (Wed, 25 May 2011)

Minor bug fix.
Sniper.py Python script (version 1.5.7) (Wed, 25 May 2011)

No longer need to provide a map file for genotyping if preexisting read maps exist. Simply specify the read map directory and sample IDs.
e.g. sniper.py --snp -i <readmapdir> -ids myfirstgenome
Sniper.py Python script (version 1.5.6) (Tue, 24 May 2011)

Added License statement to version information.
Sniper.py Python script (version 1.5.5) (Thu, 19 May 2011)

Changed the default read map organization to --combo. That is, by default only 2 map files (samplename.unique.map and samplename.degenerate.map) will be created by -O/-rO. To split SE and PE portions, use --splitmap. Otherwise improved clarity of CLI help.
Sniper.py Python script (version 1.5.4) (Tue, 17 May 2011)
- The read likelihood computation has been implemented in C, resulting in ~3x speed improvement. Note, this requires that you build the included snipercore.c script, generating a shared object module for python. Run python setup.py build in the sniper directory, then make sure the snipercore.so file is in your PYTHONPATH. This code will be used automatically.
- By default SE and PE maps are created separately for unique and degenerate reads. Instead, SE and PE maps can now be merged during map organization using the -O -combo flags. For genotyping, this results in fewer files that need random access, thus improving disk I/O performance, especially when running on a networked filesystem.
- An optional --maxdepth INT is provided to cap the number of reads used to evaluate genotype at a single locus. This is set to 200 reads by default. If the read pileup at a locus exceeds 200, a subset of 200 will be selected randomly.
- If -x > 1 and only one sample is selected for processing, genotyping will be divide each reference chromosome/scaffold into x pieces.
- Bug fix: resolved a high-memory situation during SNP filtering (--sig).
Sniper.py Python script (version 1.5.3)

Minor bug fix where infinite looping could occur while loading the degenerate map index.
Sniper.py Python script (version 1.5.2)

Minor performance improvements.
Sniper.py Python script (version 1.5.1)

Improves the way degenerate maps are indexed, resulting in smaller index files.
Sniper.py Python script (version 1.5)

This is a very stable build and should be used over all others. Aside from minor bug fixes, this version includes several useful features:
- Easy to perform genotyping only if you have a preexisting map file (either unique or multiply mapped) using the --link or --exO flags (see help documentation for more information).
- Indexing of read map files using --index, resulting in substantial memory improvements. Notably the unique and degenerate maps are sorted and indexed for stream-processing, decreasing memory usage. For degenerate maps. a sorted file suitable for stream-processing is created. However this file can be rather large since for each position in the genome it lists all alignments (up to user specified -d) related to reads that have one alignment covering the position of interest. If you do not want to generate the deg map index, use the --indexuni flag instead. By default the portion of the deg index relevant to the entire chromosome is loaded into memory. If memory usage is top priority, you can stream-process this file using the --stream flag.
Sniper.py Python script (version 1.4)

Unique reads are low processed in a stream, just like other SNP callers. Following read map organization (-O) unique maps are sorted and indexed (or using --index). During genotyping the unique maps no longer need to be scanned, reducing memory usage and runtime.
Sniper.py Python script (version 1.3.5)

Added --mapqual flag which uses read quality values during bowtie mapping.
Sniper.py Python script (version 1.3.4)

Minor bug fixes.
Sniper.py Python script (version 1.3.3)

Mon, 14 Mar 2011: Fixed a big where using the best-noguess mapping style (--bestno) would result in a unique map.
Sniper.py Python script (version 1.3.2)

Mon, 14 Mar 2011: It turns out that linecache loads the entire read map into memory, thus negating any actual memory savings for large data sets. So, I implemented a custom file indexing and read caching procedure. A partial or full index of the read map is generated (determined by --lip <INT> flag), which is used for random access of particular reads from disk.

For example, a 1.5GB index in memory will cover about 10 million reads. So a large data set, say 100x coverage on the human genome, will require about 3e9 bp x 100-fold / 200 bp reads ~ 1e9 reads, and thus 1e9/1e7 = 100 x 1.5 GB = 150 GB of RAM to index the entire data set. In this case setting --lip 10 will reduce this to 15GB RAM in turn for 10 scans of the read map. This can be reduced further if multiprocessing is used (-x <INT>) in which case roughly only a 1/x fraction of the read map must be indexed per process. Of course the size of the degenerate map will inflate the index size.
Sniper.py Python script (version 1.3)

Wed, 09 Mar 2011: Drastically reduced memory requirements using linecache module. Corrected a situation where degenerate paired end alignments could falsely get incorporated into the read pileup at the wrong position.
Sniper.py Python script (version 1.2)

Tue, 08 Mar 2011: Fixed a bug in post-mapping read organization where only 1 of 2 PE reads would be written to PE.unique.map files.
Sniper.py Python script (version 1.1)

Thu, 03 Mar 2011: Improved/integrated multi-CPU and multi-processor parallelization using Sun Grid Engine. Included --fwer flag for family-wise error rate multiple test correction of per-locus posterior probability cutoff required for significance.
Sniper.py Python script (version 1.0)

Mon, 31 Jan 2011: Initial release

Requirements

Requires Unix/Mac OS X/CYGWIN with Python 2.5 or 2.6 (and preferably 64-bit for memory allocation > 4GB).

Demo instructions

Expand zip file.
unzip sniper_pkg.zip
Move to demo directory.
cd sniper_pkg
Follow instructions in enclosed README.txt file

Help using `sniper.py`

Standard execution:

sniper.py -m 'map_file.txt' -r 'reference_genome.fasta' -z

Two-step execution for paired-end read data:

sniper.py -m 'map_file.txt' -r 'reference_genome.fasta' --ins

sniper.py -m 'map_file.txt' -r 'reference_genome.fasta' -z

This will align raw NGS reads, perform genotyping, and provide diagnostics and new assemblies for all samples according to the map file. Results will be stored in a new directory 'sniper_results' in your current working directory.

Standard arguments:

-m: map file (required)
-r: reference genome (required, fasta format)
-o: primary directory to save results directory (default: current working directory, './')
--name: name of the results directory (default: 'sniper_results')
-z: perform/redo all procedures (same as -a -O -s --diagnostics --saveassmebly)
-a: align reads (generate read map)
-O: organize read map into singly-mapped and multiply-mapped reads
-exO: given existing read map, organize read map into singly-mapped and multiply-mapped reads
-s: SNP calling
-lip: number of separate pieces to load into memory; typically ≤ number of chromosomes in the reference genome (larger values are slower but use less memory)
--diagnostics: generate text file summarizing SNP calls
--saveassembly: create a new fasta file of the reference genome modified by significant SNP calls. (NB it is recommended to use in conjunction with --Q to specify desired phred stringency. Ex: --saveassembly -Q 40 or --saveassembly -Q 20,30,40.)

Important parameters:

Example:

sniper.py -m 'map_file.txt' -r 'reference_genome.fasta' -z -k 2 -d 10 -e 30 --lambda 0.5 --readlength 38 --prior resequence --theta 0.01

Arguments:

-k: maximum number of alignment mismatches (default: 1)
-d: maximum number of alignments per read (default: 25)
-e: global sequencing error rate (specified as phred value; E.g. Q=30 == P<0.001) (default: 35)
--lambda: reweight the likelihood model towards read-specific or global binomial probability (default: 0.67 towards binomial)
--readlength: length of each sequence read (default: automatic detection)
--prior: prior probability distribution (default: resequence)

Options:
1. resequence: h0:1-theta-theta^2, t1:theta/2, h2:theta/2, t2:theta^2
2. resequence-het: h0:1-theta-theta^2, t1:theta, h2:theta^2/2, t2:theta^2/2
3. maq: h0:(1-theta-theta^2)/2, t1:theta, h2:(1-theta-theta^2)/2, t2:theta^2
4. uniform: h0:0.25, t1:0.25, h2:0.25, t2:0.25
5. haploid: h0:1-theta, t1:theta
--theta: expected divergence (heterozygous and homozygous) from reference genome (default: 0.001)

Other notable arguments:

--peonly: only use paired end reads for SNP identification
--uniq: only use uniquely mapping reads for SNP identification
--colorspace: input files are in ABI SOLiD colorspace format (suffix .csfasta and .qual). NB, currently only works with single-end reads
--colorspacetheta: Haploid proportion of divergence between sample and reference genomes (default: 3/2 theta)
--loadinpieces/--lip: Integer number of groups into which unique reads will be broken (default: 10). Larger values will decrease memory requirements.
--stringency/-Q: Comma-delimited list of phred-quality scores used to filter SNPs
(default: 0,5,10,13,20,30,40,50,60,70,80,90,100,110,120,130,140,150,155,160)
--restrict: Specify a region or file containing a list of regions over which to perform genotyping (default: all)
EX: --restrict chrom01,0,1000,+ will genotype loci 0 to 1000 on chromosome 1, inclusive
EX: --restrict chrom01,0,1000,- will genotype all of chromosome 1 excluding loci 0 to 1000
A boundary file should be a tab-delimited list of one or more regions similarly described, replacing commas with tabs.

To estimate the insert size range for paired-end reads, use --ins
To estimate the expected per-nucleotide sequencing error rate, use --globalerror
To restrict genotyping to a subset of samples specified in the map file, use --sampleids sid1,sid2,...,sidx

Map file format:

The map file is a standard tab-delimited text file indicating how each raw sequence reads file should be processed by sniper. The format is organized as one row per sequenced lane and requires 7 columns of information row, as follows (you can copy as paste):

Run	Lane	SampleID	Alias	RefGenome	PairedEnd	Path
1	1	myfirstgenome	foo	/path/to/ref_egenome	0	/path/to/reads/run1/
2	5	mysecondgenome	bar	/path/to/ref_egenome	1	/path/to/reads/run2/

In this example, the first sample was run on lane 1 and contains single-end reads. The second sample was sequenced on lane 5 and contains paired-end reads. The standard mapper used with sniper.py is bowtie, so the specified reference is the name of the respective index file, excluding file suffix.

The first line shows the headers but is treated as a comment; comments are discarded by Sniper.

Raw sequence read files should contain as a substring the label indicated in the run column and placed in the /path/to/reads directory.

For example the first sample file may be named s_1_sequence.txt or testing1.txt. The second sample should have 2 files since it is paired-end, named s_5_1_sequence.txt and s_5_2_sequence.txt. Colorspace files should take the format s_1_sequence.csfasta and s_1_sequence.qual for the fasta and quality files, respectively.
Additional fields may be added to the map file after the required columns (e.g. concentration, date sequenced, author, etc.).

Estimating insert size distribution for paired-end data

By default paired-end reads are mapped using an insert size range of 0 to 350 nt. Users may specify custom min and max bounds using

--minins: Specify the minimium insert size for all samples
--maxins: Specify the maximum insert size for all samples

Alternatively use --ins to estimate the empirical insert size distribution for each sample specified in the map file:

--ins: Estimate insert size distributions for all samples specified in the map file.
--insstat : Use the percentile pct of the insert size distribution for alignment (default: 0.99)

Note that alignments must first be generated using -a. A file PE_insert_size.txt will be created in the specified output directory that Sniper will use by default if samples are mapped again (using --realign/--ra).

Other CLI arguments

--bowtie/-btd: Specify the base directory of your bowtie installation (e.g. --btd /usr/bin/bowtie-0.12.5) (default: '')
--keepbowtie/--kb: Do not delete the original bowtie SAM file following read organization
--policy: Specify Bowtie alignment policy (default: bestk). Options:
- basic: bowtie -n max_num_mismatches -m 1
  (Maq-like single best alignment)
- all: bowtie -v max_num_mismatches -a
  (Min-mismatches, all alignments)
- bestk: bowtie -v max_num_mismatches -k max_num_alignments --best
  (Min-mismatches, top k out of m total alignments)
- bestm: bowtie -v max_num_mismatches -a -m max_num_alignments --best
  (Min-mismatches, top m if read has at most m alignments)
--peonly: utilize paired-end reads only for SNP calling
--kpe: Permit a maximum number of alignment mismatches for paired-end reads
--kse: Permit a maximum number of alignment mismatches for single-end reads
--qual33: Read qualities are on phred 33 scale instead of default phred 64 scale
--trimreads n5,n3: For alignment, consider a read substring from 1+n5...|r|-n3
--colorspace/-C: NGS data are in ABI SOLiD colorspace
--realign/--ra: perform a new Bowtie alignment, replacing existing alignments
--reorganize/--rO: generate new organized read maps, replacing existing ones
--redosnp/--rs: redo SNP calling, replacing existing files in the snps/ subdirectory

Software RepositoryKim LaboratoryComputational Evolutionary Biology

latest news

06.29.2019

06.27.2018

06.07.2018

contact us

additional links

Sniper: SNP identification using Probability of Every Read

About Sniper

Citation

Downloads

Version history

Requirements

Demo instructions

Help using sniper.py

Standard arguments:

Important parameters:

Other notable arguments:

Map file format:

Estimating insert size distribution for paired-end data

Other CLI arguments

Software Repository
Kim Laboratory
Computational Evolutionary Biology

Help using `sniper.py`