Statistical Assignment Package (SAP)

Main
Research
Publications
Software
CV
Teaching
Contact
Blogs
Index

Introduction

This page contains the documentation for SAP. The package is the presented in two papers (currently in press). If any of the technical documentation is unclear or if you come across dead links please send me an email. If you need to cite SAP use the appropriate of these two references. The package has matured considerably since these papers were submitted. Versions of code used in the publications are available OnRequest.

The package is continually under development. Planned features see the News and Downloads Page.

You can subscribe to this RSS feed for news about SAP such as new versions, bug fixes, possible issues with untested platforms etc.

Go to Installing SAP for download and install instructions.

Contents

Introduction
Contents
What the program does
Theoretical outline
Running the program
Expected run time
Terminal output
HTML output
Caching and restarting analyses.
Using a custom local database.
Running SAP in parallel on many computers.
Program options
General options
NetBlast search
Homologue set compilation
Alignment
Output
Parallel computing
Caching
Caveats and recommended use
The minimum identity parameter and the incomplete database problem.
Neighbour joining vs. Markov chain monte carlo

What the program does

The program has five stages:

  1. A set of relevant homologues is compiled using NetBlast searches against GenBank.
  2. The set of homologues is aligned using ClustalW2.
  3. Markov Chain Monte Carlo (MCMC) or neighbour-joining + bootstrapping is used to sample a large number of phylogenetic trees of the sample-sequence and the set of homologues.
  4. Assignment statistics are calculated from the sampled phylogenies.
  5. HTML output is generated with summary information for the entire analysis as well as detailed information for each assignment.

In parallel mode the stages two, three, and four are distributed to other machines.

Theoretical outline

The assignment of sequences can be done using a number of different statistical frameworks. SAP uses a Bayesian approach where the probability that an environmental sequence belongs to specific taxonomic groups is assigned a probability. We will make the assumption that the species to which the sequence is assigned is represented in a reference database. If this is not the case, the method will calculate the probability that the sequence belongs to any of the taxonomic groups actually represented this database. It is important to emphasize that this method, like any other comparable methods, can only assign sequences to taxonomic groups that are actually represented in the database. SAP makes no attempt to model the structure and sampling representation of the databases to evaluate the probability that the sequence truly belongs to some other taxon not represented in the database.

The probability that the sample-sequence falls as a monophyletic group with database sequences belonging to a given taxon can be formulated as a posterior probability

latex2png equation

where latex2png equation is the sample-sequence, latex2png equation is taxon latex2png equation, and latex2png equation is the set of data base sequences. Including only database sequences for which latex2png equation reduces latex2png equation to a set of relevant homologues that can be aligned to the sample-sequence. Based on a compiled set of homologues a large number of phylogenetic trees can be sampled from a Markov chain Monte Carlo simulation. Since the probability of sampling a tree where the sample-sequence forms a monophyletic group with a given taxon, latex2png equation, is proportional the likelihood, latex2png equation the posterior probability can be calculated as the fraction of sampled trees where latex2png equation.

SAP now also offers a fast alternative to approximate the probability of assignment using neighbour-joining + bootstrapping. The neighbour-joining approach should only be used when the MCMC analysis is not practically possible. The statistical interpretation of the generated bootstrap proportions is not the same as that of the posterior probability generated by the MCMC. For large values (>80), however, the two measures seem to be in agreement. For smaller values the bootstrap proportions do not constitute valid measures of confidence.

Running the program

As of now, SAP is a command line tool. So you need a terminal window or command prompt to run it. The syntax is this:

sap [OPTIONS] <fasta file> [<fasta file> ...]

Do not use spaces in your file names. Use underscore characters instead like in this example:

sap --project myproject -x 90 input_sequences.fata and_some_more.fasta

Expected run time

The time to complete an analysis depends on the speed of your Internet for downloading sequences and annotation and on the way phylogenies are sampled. For each sample-sequence the MCMC analysis takes about an hour. The neighbour-joining + bootstrap analysis takes a few minutes.

Terminal output

SAP prints progress information in the terminal window or to a file if standard output collected in a file. Below is an example of the output for one sample-sequence.

myFastaFile -> myFastaID:
        Retrieval of homologs:
                Entry status: (c)=cached, (d)=downloaded,
                Error types:
                              (!M)=Memory error, (!D)=Download error,
                              (!T)=Annotation error, (!?)=Unknown error

                Searching database... done.
                        66644230(c)* 171854887(c)* 158934074(c)* 157922406(c)* 157922404(c)*
                        157922400(c)* 157922399(c)* 157922398(c)* 157922397(c)*
                        157922396(c) 157922395(c)* 157922394(c) 157922392(c)* 66644281(c)
                        66644273(c)* 66644233(c)* 66644232(c)* 66644231(c)* 66644229(c)*
                        21667367(c) 57639439(c) 57639438(c) 57639437(c) 9279868(c)
                        9279867(c) 9279866(c) 9279859(c) 9279858(c) 9279857(c) 9279856(c)
                        9279855(c) 9279854(c) 9279853(c) 9279852(c) 9279851(c) 9279850(c)
                        9279849(c) 9279848(c) 9279847(c) 7620584(c)* 157922408(c)*
                        157922407(c)* 157922405(c)* 157922401(c)* 66644283(c)* 66644282(c)
                        66644280(c)* 66644279(c)* 66644278(c)* 66644277(c)* 66644276(c)*
                        66644275(c)* 66644274(c) 66644272(c)* 66644271(c)* 66644269(c)
                        66644256(c) 66644252(c) 66644251(c) 66644250(c) 66644248(c)
                        66644247(c) 66644246(c) 66644245(c) 66644244(c) 66644243(c)
                        66644240(c) 66644239(c) 66644235(c) 66644234(c) 66644228(c)*
                        21667360(c) 7620588(c) 7620587(c) 7620586(c) 7620585(c) 7620583(c)
                        7620582(c) 7620581(c) 7620578(c) 7620577(c) 7620576(c) 7620575(c)
                        7620574(c) 7620573(c) 7620572(c) 7620571(c) 7620570(c) 9279907(c)
                        9279863(c) 9279862(c) 52551076(c) 52551075(c) 52551074(c)
                        52551073(c) 157922403(c) 9279865(c) 9279864(c) 66644227(c)
                        7620579(c) 9279905(c)* 9279911(d)* 34329954(c)* 56182384(c)*
                        21667358(c) 18025252(c)* 25071158(c) 25071157(c) 22796303(d)*
                        148790857(d)* 148790856(d)* 126256023(d)* 112253730(d)* 66644286(c)*
                        66644270(c) 66644268(c) 66644267(c) 66644266(c) 66644265(c)
                        66644264(c) 66644263(c) 66644262(c) 66644261(c) 66644260(c)
                        66644259(c) 66644258(c) 66644257(c) 66644255(c) 66644254(c)
                        66644253(c) 66644249(c) 66644242(c) 66644241(c) 66644238(c)
                        66644237(c) 66644236(c) 78057750(d)* 78057749(d) 78057748(d)*
                        78057747(d)* 78057746(d)* 78057745(d)* 78057744(d)* 78057743(d)
                        78057742(d)* 78057741(d)*
        50 significant homologues found.
        50 homologues in set:
                1 phyla: Streptophyta
                1 classes: Coniferopsida
                4 orders: Coniferales Lamiales Piperales Solanales
                8 families: Scrophulariaceae Podocarpaceae Sciadopityaceae Araucariaceae
                        Montiniaceae Cupressaceae Piperaceae Gesneriaceae
                26 genera: Grevea Athrotaxis Thujopsis Fokienia Piper Nemesia Calocedrus
                        Cupressus Callitropsis Sinningia Sequoia Glyptostrobus
                        Platycladus Cryptomeria Metasequoia Tetraclinis Taxodium
                        Sequoiadendron Chamaecyparis Araucaria Microbiota Podocarpus
                        Juniperus Sciadopitys Agathis Taiwania
        Last accepted E-value is 2.402550e-06
        Ratio of lowest to highest bit score is: 0.593898074532
        WARNING: Diversity goal not reached.
        WARNING: Relative bit-score cut-off (0.50) not reached.

        Writing homologues to fasta file... done.
        Issuing sub-tasks:
                ClustalW alignment
                Barcoder tree sampling
                Tree statstics computation

myFastaFile_myFastaID: Alignment:  Computing... done
myFastaFile_myFastaID: Sampling trees:  Computing... done.
myFastaFile_myFastaID: Calculating tree statistics:  Computing... done.

Computing tree statistics summary...
myFastaFile_myFastaID: Calculating tree statistics:  Using cached results.
done

        Generating HTML output...
100.00% myFastaFile_myFastaID
done

Each section of output is identified by the fasta file name and the sample-sequence id. After that progress on compiling the homologue set is printed. Here only one Blast search is performed but several rounds of blast searches may be required. The id of each downloaded GenBank entry is printed along with a entry status and possibly an error indicator as explained in the output. A star indicates that the entry is included in the set. After compilation of the homologue set some statistics are printed. In this case two warnings are issued. "Diversity goal not reached" notifies the user that the standard taxonomic diversity goal of two phyla, three classes, five orders, six families and ten genera have not been reached. This is usually not a problem. However, you do need to check that the diversity of the in the set is not too small. "Relative bit-score cut-off (0.50) not reached" warns the user of the risk that the set of homologues does not adequately exhaust the database. This need not be a concern either. By the end of the analysis it is checked if the set exhausts the database (see the section on HTML output below).

An alignment is then computed, trees are sampled using MCMC, and summary statistics is calculated for the assignment. Finally a grand summary is computed and HTML pages are generated.

HTML output

The tables on the main result page show the posterior probabilities for each taxonomic level exceeding a assignment probability cut-off. Click the sequence name to get details on the assignment. Clicking a taxonomic name takes you to the NCBI TaxBrowser browser. If you want to look at a particular assignment you can also find it in the list of input sequences by following the link at the top of this page. If you click a name you are taken to a summary taxonomy, in the form of a taxonomy tree, showing the posterior probabilities of assigning the sequence to the different taxa. At each tip in the tree is the name of the taxon with the probability of assignment and the number of sampled phylogenies supporting the assignment. Below this tree the multiple alignment of the sample-sequence and homologues is shown. Clicking the names of homologues takes you to the GenBank entry.

Caching and restarting analyses.

SAP caches sub-results in the analysis for two reasons. Sequences and taxonomic annotation downloaded from GenBank may be of use in the assignment of more than one sample-sequence. Caching saves download time. The other reason is that in the unlikely event that you abort the analysis before it is done you can restart the analysis loosing only the step in the analysis where the program was aborted.

The Blast cache stores the blast results. The GenBank cache stores the sequences and taxonomic annotation downloaded from GenBank. The homologue cache stores the compiled set of homologues retrieved from GenBank. The alignment cache contains the alignment files generated by ClustalW. The trees cache stores all output files for the sampled trees. The tree statistics cache stores the inference based on the sampled trees. The stats cache stores miscellaneous files as well as information on pairwise differences between sample-sequences.

The caching of sub-results can amount to quite a lot on the hard disk if you have many sequences. The cache for each sample-sequence will take up about 5-6 Mb on your hard disk. If you sample more than the default number of trees by increasing the number of MCMC generations I think you can roughly calculate the required hard disk space this way: Bytes 15 * #homologs * #MCMCgenerations=

NOTE: If you restart the analysis using different parameters the program will delete the parts of the cache that conflicts with the changed options.

Using a custom local database.

The database file needs to be in Fasta format. The taxonomic annotation for each sequence can be supplied either by using the NCBI formatted header line like this:

>gi|222142915|gb|FJ606797.1| Bombus ignitus vitellogenin mRNA, complete cds
AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT
TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA

or just the GI number as sequence id like this:

>222142915
AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT
TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA

in which case the taxonomic annotation is retrieved from NCBI. You can also supply the taxonomic annotation in the header line in following format. Note the user of colons and semicolons:

>mySequenceID ; family: Hominidae, genus: Homo, species: Homo sapiens ; H. sapiens (myself)
AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT
TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA

The following taxonomic levels are recognized: superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, supertribe, tribe, subtribe, supergenus, genus, subgenus, species group, species subgroup, species, subspecies. The name after the last semicolon is the organism name. This is sometimes different from the species or subspecies name.

To run the program against a local database specify the database fasta file as argument to the database option:

sap --database local_database.fasta --project myproject input_sequences.fasta

The first time the program is run against the fasta file it will build a database and write to the directory of the fasta file for future use.

Running SAP in parallel on many computers.

Apart from running normally on one machine the program can also run in two alternative parallel modes. For this to work each machine used must mount the same file system as the one where the analysis is started from.

Use the —hostfile option to specify a file with a newline separated list of host names. The program will then use this pool of machines to distribute the assignment of individual sequences between the machines.

Use the —sge option to run parallel assignments on a Sun Grid Engine cluster with a shared file system. The program uses the SGE queue to manage its sub-tasks. This means that sub-tasks are submitted to the queue individually.

Program options

The options described below are supplied to modify the behaviour of the program for use in situations out of the ordinary. Before you head out trying them all out remember that they all have been assigned their default values for good reason.

Some of the options take an argument. -str-: string, -int-: integer, -float-: floating point number. If you specify a string argument containing spaces or starting with '-' you must enclose it in quotes.

General options

--version
show program's version number and exit.
--onlinehelp
Open online manual in default browser.
-d, --project -str-
Output directory for this project. Results will be written to this directory and cached results will be read from and written to this directory. Defaults to project-hour.minute-month.day.year.
-D, --database -str-
Name of database plugin to use. Default is online GenBank. You can also specify the path to a file in FASTA format to serve as database.
-S, --assignment -str-
Name of sampling plugin to use. Default is Barcoder. The alternative is ConstrainedNJ.
-A, --alignment -str-
Name of alignment plugin to use. Default is ClustalW2.

NetBlast search

-t -float-, --minsignificance -float-
Lower E-value bound for accepting homologue sequences.
-s -float-, --significance -float
E-value cutoff for significant hits.
-n -int-, ---nrsignificant -int-
Minimum number of accepted significant homologues required.
-l -str-, --limitquery -str-
Entrez query to limit blast database. Eg. "COI[GENE] AND Hominidae[ORGANISM]".
--minidentity -float-
Minimum global alignment similarity of best blast hit.
-r -float-, --relbitscore -float-
Ratio of best to worst length normalized bit score accepted.
-m -int-, --maxblasthits -int-
Number of blast hits to retrieve.
--nolowcomplexfilter
Disable low-complexity filtering of sample-sequences.

Homologue set compilation

-q, --quickcompile
Make pairwise alignments to sample-sequence to further ensure that the homologue maches up.
-b BESTHITS, --besthits BESTHITS, --softalignmentlimit BESTHITS
Maximal number of homologues to add to alignment purely based on significant E-value.
-a ALIGNMENTLIMIT, --alignmentlimit ALIGNMENTLIMIT
Maximal number of sequences allowed in the homologue set.
--quick
Do not make pairwise alignments to sample-sequence to ensure that match has exactly same length.
-i INDIVIDUALS, --individuals INDIVIDUALS
Number of best matching individuals from a species to include in homologue set.
--subspecieslevel
Include one of each relevant subspecies (and not only species) if possible.
-p PHYLA, --phyla PHYLA
Number of phyla to try to get into homology set.
-c CLASSES, --classes CLASSES
Number of classes to try to get into homology set.
-o ORDERS, --orders ORDERS
Number of orders to try to get into homology set.
-f FAMILIES, --families FAMILIES
Number of families to try to get into homology set.
-g GENERA, --genera GENERA
Number of genera to try to get into homology set.
--minimaltaxonomy -int-
Minimal number of annotated taxonomic levels in accepted taxonomy annotation.
--harddiversity
Discard homologue set if diversity specifications are not met.
--forceexcludegilist -str-
Space separated list (in quotes) of gi numbers to force exclude in each homology set. Takes priority over forceincludegilist.
--forceincludegilist -str-
Space separated list (in quotes) of gi numbers to force include in each homology set.
--forceidentity -float-
Minimal accepted match identity of forced included GIs to sample-sequence.
--forceincludefile -str-
File name of entries to force include in each homology set.
--nofillin
Do not Fill in more individuals up to alignment limit even if nr. of individuals can be kept even.
--fillinall
Fill in more individuals up to alignment limit disrespecting nr. of individuals.
--fillintomatch -str-
Used with the force include options to fill in individuals up to the number of of homologues for the specified species or subspecies.
--flanks FLANKS
Extra flanks to add to the homologue before aligning the set.

Alignment

--alignmentoption -str-
Options passed to ClustalW2. Specify once for every option. Defaults to "-gapopen=50".

Output

-x -int-, --ppcutoff -int-
Posterior probability cut-off (in percent) for assignments reported on the main summary page. Default is 95. Specify the option more than once to get reports for more than one cut-off.
--nocopycache
Do not copy cache for duplicate sample-sequences.
--warnongaps
Issue a warning if the sample-sequence has internal gaps in the alignment.
--svg
Generate publication grade SVG of the taxonomy summaries instead of ASCI trees. The SVG pictures are easily converted to EPS, PS or PSF format using the free program Incscape.

Parallel computing

--hostfile -str-
Use this option to specify a file with a newline separated list of host names. The program will then use this pool of machines to distribute the assignment of individual sequences between the machines using ssh.
--sge -int-
Use this option to run parallel assignments on a Sun Grid Engine cluster with a shared file system. Use the qrun.sh script to do this. The program uses the SGE queue to manage its sub-tasks. This means that sub-tasks are submitted to the queue individually. The sge option is given two arguments. The first is a job name for the sub-jobs. The second is the maximal number of queued sub-jobs allowed.

Caching

--blastcache -str-
Where to put cached blast files.
--genbankcache -str-
Where to put cached GenBank files.
--resultdir -str-
Where to put cached result HTML files.
--datadir -str-
Where to put the fixed input files.

Caveats and recommended use

SAP goes a long way to automate the task of assigning unknown sequences to taxa. The values of the default parameters have been tested in a variety of different scenarios and have their default values for good reason. They should be changed with caution, but sometimes they must be. In the following I will try to outline some of the non standard scenarios.

The minimum identity parameter and the incomplete database problem.

As you know the GenBank does not have all species. In the cases where the species or genus the sample-sequence is from is not represented in GenBank SAP assigns to the most likely of the taxa that are in GenBank. This is always a risk, and often it is not possible to tell if this has happened. In other situations, however, it is more clear. These are the cases where it is not possible to find any homologs with a sequence similarity within what is expected from a match to the correct species.

This is where the "minidentity" (minimum identity) parameter of SAP comes in. The reasoning behind setting the "minidentity" value is that it should describe an upper bound of the diversity expected (lower bound on similarity) in the taxonomic units you want to assign to. If you want to assign the species 0.95 would often be a good choice. If you only want to assign to genus because the species is not included in the database a lower value would be appropriate. What this lower values depends on the expected diversity within the genera in question.

So the role of the minidentiry parameter is to act as a primitive safeguard to keep SAP from assigning to something crazy (e.g. with similarity as low as 70%) just because there are no better matches in GenBank.

Neighbour joining vs. Markov chain monte carlo

SAP currently has two alternative methods of doing the statistical assignment: ConstrainedNJ which is a neighbour joining + bootstrapping algorithm and Barcoder that is a Bayesian approach very much like MrBayes. Both allow topological constraints to be imposed on the sampled trees so that known taxonomic relationships are enforced. Barcoder produces a posterior probability of assignment whereas ConstrainedNJ produces a bootstrap proportion what can be thought of as an approximation to this. Barcoder is default and is the more reliable of the two. However, ConstrainedNJ is a lot faster and has a valid use in cases where the number of unknown sequence makes the time consuming MCMC approach infeasible.


Kasper Munch - kaspermunch@berkeley.edu