
This page contains the documentation for SAP. The package is the presented in two papers (currently in press). If any of the technical documentation is unclear or if you come across dead links please send me an email. If you need to cite SAP use the appropriate of these two references. The package has matured considerably since these papers were submitted. Versions of code used in the publications are available OnRequest.
The package is continually under development. Planned features see the News and Downloads Page.
You can subscribe to this RSS feed for news about SAP such as new versions, bug fixes, possible issues with untested platforms etc.
Go to Installing SAP for download and install instructions.
The program has five stages:
In parallel mode the stages two, three, and four are distributed to other machines.
The assignment of sequences can be done using a number of different statistical frameworks. SAP uses a Bayesian approach where the probability that an environmental sequence belongs to specific taxonomic groups is assigned a probability. We will make the assumption that the species to which the sequence is assigned is represented in a reference database. If this is not the case, the method will calculate the probability that the sequence belongs to any of the taxonomic groups actually represented this database. It is important to emphasize that this method, like any other comparable methods, can only assign sequences to taxonomic groups that are actually represented in the database. SAP makes no attempt to model the structure and sampling representation of the databases to evaluate the probability that the sequence truly belongs to some other taxon not represented in the database.
The probability that the sample-sequence falls as a monophyletic group with database sequences belonging to a given taxon can be formulated as a posterior probability
where
is the sample-sequence,
is
taxon
, and
is the set of data
base sequences. Including only database sequences for which
reduces
to a set of
relevant homologues that can be aligned to the sample-sequence. Based on a compiled set of
homologues a large number of phylogenetic trees can be sampled from a Markov chain Monte
Carlo simulation. Since the probability of sampling a tree where the sample-sequence forms
a monophyletic group with a given taxon,
, is
proportional the likelihood,
the
posterior probability can be calculated as the fraction of sampled trees where
.
SAP now also offers a fast alternative to approximate the probability of assignment using neighbour-joining + bootstrapping. The neighbour-joining approach should only be used when the MCMC analysis is not practically possible. The statistical interpretation of the generated bootstrap proportions is not the same as that of the posterior probability generated by the MCMC. For large values (>80), however, the two measures seem to be in agreement. For smaller values the bootstrap proportions do not constitute valid measures of confidence.
As of now, SAP is a command line tool. So you need a terminal window or command prompt to run it. The syntax is this:
sap [OPTIONS] <fasta file> [<fasta file> ...]
Do not use spaces in your file names. Use underscore characters instead like in this example:
sap --project myproject -x 90 input_sequences.fata and_some_more.fasta
The time to complete an analysis depends on the speed of your Internet for downloading sequences and annotation and on the way phylogenies are sampled. For each sample-sequence the MCMC analysis takes about an hour. The neighbour-joining + bootstrap analysis takes a few minutes.
SAP prints progress information in the terminal window or to a file if standard output collected in a file. Below is an example of the output for one sample-sequence.
myFastaFile -> myFastaID:
Retrieval of homologs:
Entry status: (c)=cached, (d)=downloaded,
Error types:
(!M)=Memory error, (!D)=Download error,
(!T)=Annotation error, (!?)=Unknown error
Searching database... done.
66644230(c)* 171854887(c)* 158934074(c)* 157922406(c)* 157922404(c)*
157922400(c)* 157922399(c)* 157922398(c)* 157922397(c)*
157922396(c) 157922395(c)* 157922394(c) 157922392(c)* 66644281(c)
66644273(c)* 66644233(c)* 66644232(c)* 66644231(c)* 66644229(c)*
21667367(c) 57639439(c) 57639438(c) 57639437(c) 9279868(c)
9279867(c) 9279866(c) 9279859(c) 9279858(c) 9279857(c) 9279856(c)
9279855(c) 9279854(c) 9279853(c) 9279852(c) 9279851(c) 9279850(c)
9279849(c) 9279848(c) 9279847(c) 7620584(c)* 157922408(c)*
157922407(c)* 157922405(c)* 157922401(c)* 66644283(c)* 66644282(c)
66644280(c)* 66644279(c)* 66644278(c)* 66644277(c)* 66644276(c)*
66644275(c)* 66644274(c) 66644272(c)* 66644271(c)* 66644269(c)
66644256(c) 66644252(c) 66644251(c) 66644250(c) 66644248(c)
66644247(c) 66644246(c) 66644245(c) 66644244(c) 66644243(c)
66644240(c) 66644239(c) 66644235(c) 66644234(c) 66644228(c)*
21667360(c) 7620588(c) 7620587(c) 7620586(c) 7620585(c) 7620583(c)
7620582(c) 7620581(c) 7620578(c) 7620577(c) 7620576(c) 7620575(c)
7620574(c) 7620573(c) 7620572(c) 7620571(c) 7620570(c) 9279907(c)
9279863(c) 9279862(c) 52551076(c) 52551075(c) 52551074(c)
52551073(c) 157922403(c) 9279865(c) 9279864(c) 66644227(c)
7620579(c) 9279905(c)* 9279911(d)* 34329954(c)* 56182384(c)*
21667358(c) 18025252(c)* 25071158(c) 25071157(c) 22796303(d)*
148790857(d)* 148790856(d)* 126256023(d)* 112253730(d)* 66644286(c)*
66644270(c) 66644268(c) 66644267(c) 66644266(c) 66644265(c)
66644264(c) 66644263(c) 66644262(c) 66644261(c) 66644260(c)
66644259(c) 66644258(c) 66644257(c) 66644255(c) 66644254(c)
66644253(c) 66644249(c) 66644242(c) 66644241(c) 66644238(c)
66644237(c) 66644236(c) 78057750(d)* 78057749(d) 78057748(d)*
78057747(d)* 78057746(d)* 78057745(d)* 78057744(d)* 78057743(d)
78057742(d)* 78057741(d)*
50 significant homologues found.
50 homologues in set:
1 phyla: Streptophyta
1 classes: Coniferopsida
4 orders: Coniferales Lamiales Piperales Solanales
8 families: Scrophulariaceae Podocarpaceae Sciadopityaceae Araucariaceae
Montiniaceae Cupressaceae Piperaceae Gesneriaceae
26 genera: Grevea Athrotaxis Thujopsis Fokienia Piper Nemesia Calocedrus
Cupressus Callitropsis Sinningia Sequoia Glyptostrobus
Platycladus Cryptomeria Metasequoia Tetraclinis Taxodium
Sequoiadendron Chamaecyparis Araucaria Microbiota Podocarpus
Juniperus Sciadopitys Agathis Taiwania
Last accepted E-value is 2.402550e-06
Ratio of lowest to highest bit score is: 0.593898074532
WARNING: Diversity goal not reached.
WARNING: Relative bit-score cut-off (0.50) not reached.
Writing homologues to fasta file... done.
Issuing sub-tasks:
ClustalW alignment
Barcoder tree sampling
Tree statstics computation
myFastaFile_myFastaID: Alignment: Computing... done
myFastaFile_myFastaID: Sampling trees: Computing... done.
myFastaFile_myFastaID: Calculating tree statistics: Computing... done.
Computing tree statistics summary...
myFastaFile_myFastaID: Calculating tree statistics: Using cached results.
done
Generating HTML output...
100.00% myFastaFile_myFastaID
done
Each section of output is identified by the fasta file name and the sample-sequence id. After that progress on compiling the homologue set is printed. Here only one Blast search is performed but several rounds of blast searches may be required. The id of each downloaded GenBank entry is printed along with a entry status and possibly an error indicator as explained in the output. A star indicates that the entry is included in the set. After compilation of the homologue set some statistics are printed. In this case two warnings are issued. "Diversity goal not reached" notifies the user that the standard taxonomic diversity goal of two phyla, three classes, five orders, six families and ten genera have not been reached. This is usually not a problem. However, you do need to check that the diversity of the in the set is not too small. "Relative bit-score cut-off (0.50) not reached" warns the user of the risk that the set of homologues does not adequately exhaust the database. This need not be a concern either. By the end of the analysis it is checked if the set exhausts the database (see the section on HTML output below).
An alignment is then computed, trees are sampled using MCMC, and summary statistics is calculated for the assignment. Finally a grand summary is computed and HTML pages are generated.
The tables on the main result page show the posterior probabilities for each taxonomic level exceeding a assignment probability cut-off. Click the sequence name to get details on the assignment. Clicking a taxonomic name takes you to the NCBI TaxBrowser browser. If you want to look at a particular assignment you can also find it in the list of input sequences by following the link at the top of this page. If you click a name you are taken to a summary taxonomy, in the form of a taxonomy tree, showing the posterior probabilities of assigning the sequence to the different taxa. At each tip in the tree is the name of the taxon with the probability of assignment and the number of sampled phylogenies supporting the assignment. Below this tree the multiple alignment of the sample-sequence and homologues is shown. Clicking the names of homologues takes you to the GenBank entry.
SAP caches sub-results in the analysis for two reasons. Sequences and taxonomic annotation downloaded from GenBank may be of use in the assignment of more than one sample-sequence. Caching saves download time. The other reason is that in the unlikely event that you abort the analysis before it is done you can restart the analysis loosing only the step in the analysis where the program was aborted.
The Blast cache stores the blast results. The GenBank cache stores the sequences and taxonomic annotation downloaded from GenBank. The homologue cache stores the compiled set of homologues retrieved from GenBank. The alignment cache contains the alignment files generated by ClustalW. The trees cache stores all output files for the sampled trees. The tree statistics cache stores the inference based on the sampled trees. The stats cache stores miscellaneous files as well as information on pairwise differences between sample-sequences.
The caching of sub-results can amount to quite a lot on the hard disk if you have many
sequences. The cache for each sample-sequence will take up about 5-6 Mb on your
hard disk. If you sample more than the default number of trees by increasing the number of
MCMC generations I think you can roughly calculate the required hard disk space this way:
Bytes 15 * #homologs * #MCMCgenerations=
NOTE: If you restart the analysis using different parameters the program will delete the parts of the cache that conflicts with the changed options.
The database file needs to be in Fasta format. The taxonomic annotation for each sequence can be supplied either by using the NCBI formatted header line like this:
>gi|222142915|gb|FJ606797.1| Bombus ignitus vitellogenin mRNA, complete cds AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA
or just the GI number as sequence id like this:
>222142915 AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA
in which case the taxonomic annotation is retrieved from NCBI. You can also supply the taxonomic annotation in the header line in following format. Note the user of colons and semicolons:
>mySequenceID ; family: Hominidae, genus: Homo, species: Homo sapiens ; H. sapiens (myself) AGCGATAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATAT TAGGATCACGTAGACCACAGATATAGCGATAGGATCACGTAGACCACAGATATCCTAA
The following taxonomic levels are recognized: superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, supertribe, tribe, subtribe, supergenus, genus, subgenus, species group, species subgroup, species, subspecies. The name after the last semicolon is the organism name. This is sometimes different from the species or subspecies name.
To run the program against a local database specify the database fasta file as argument to the database option:
sap --database local_database.fasta --project myproject input_sequences.fasta
The first time the program is run against the fasta file it will build a database and write to the directory of the fasta file for future use.
Apart from running normally on one machine the program can also run in two alternative parallel modes. For this to work each machine used must mount the same file system as the one where the analysis is started from.
Use the —hostfile option to specify a file with a newline separated list of host names. The program will then use this pool of machines to distribute the assignment of individual sequences between the machines.
Use the —sge option to run parallel assignments on a Sun Grid Engine cluster with a shared file system. The program uses the SGE queue to manage its sub-tasks. This means that sub-tasks are submitted to the queue individually.
The options described below are supplied to modify the behaviour of the program for use in situations out of the ordinary. Before you head out trying them all out remember that they all have been assigned their default values for good reason.
Some of the options take an argument. -str-: string, -int-: integer, -float-: floating point number. If you specify a string argument containing spaces or starting with '-' you must enclose it in quotes.
--version--onlinehelp-d, --project -str--D, --database -str--S, --assignment -str--A, --alignment -str--t -float-, --minsignificance -float--s -float-, --significance -float-n -int-, ---nrsignificant -int--l -str-, --limitquery -str---minidentity -float--r -float-, --relbitscore -float--m -int-, --maxblasthits -int---nolowcomplexfilter-q, --quickcompile-b BESTHITS, --besthits BESTHITS, --softalignmentlimit BESTHITS-a ALIGNMENTLIMIT, --alignmentlimit ALIGNMENTLIMIT--quick-i INDIVIDUALS, --individuals INDIVIDUALS--subspecieslevel-p PHYLA, --phyla PHYLA-c CLASSES, --classes CLASSES-o ORDERS, --orders ORDERS-f FAMILIES, --families FAMILIES-g GENERA, --genera GENERA--minimaltaxonomy -int---harddiversity--forceexcludegilist -str---forceincludegilist -str---forceidentity -float---forceincludefile -str---nofillin--fillinall--fillintomatch -str---flanks FLANKS--alignmentoption -str--x -int-, --ppcutoff -int---nocopycache--warnongaps--svg--hostfile -str---sge -int---blastcache -str---genbankcache -str---resultdir -str---datadir -str-SAP goes a long way to automate the task of assigning unknown sequences to taxa. The values of the default parameters have been tested in a variety of different scenarios and have their default values for good reason. They should be changed with caution, but sometimes they must be. In the following I will try to outline some of the non standard scenarios.
As you know the GenBank does not have all species. In the cases where the species or genus the sample-sequence is from is not represented in GenBank SAP assigns to the most likely of the taxa that are in GenBank. This is always a risk, and often it is not possible to tell if this has happened. In other situations, however, it is more clear. These are the cases where it is not possible to find any homologs with a sequence similarity within what is expected from a match to the correct species.
This is where the "minidentity" (minimum identity) parameter of SAP comes in. The reasoning behind setting the "minidentity" value is that it should describe an upper bound of the diversity expected (lower bound on similarity) in the taxonomic units you want to assign to. If you want to assign the species 0.95 would often be a good choice. If you only want to assign to genus because the species is not included in the database a lower value would be appropriate. What this lower values depends on the expected diversity within the genera in question.
So the role of the minidentiry parameter is to act as a primitive safeguard to keep SAP from assigning to something crazy (e.g. with similarity as low as 70%) just because there are no better matches in GenBank.
SAP currently has two alternative methods of doing the statistical assignment: ConstrainedNJ which is a neighbour joining + bootstrapping algorithm and Barcoder that is a Bayesian approach very much like MrBayes. Both allow topological constraints to be imposed on the sampled trees so that known taxonomic relationships are enforced. Barcoder produces a posterior probability of assignment whereas ConstrainedNJ produces a bootstrap proportion what can be thought of as an approximation to this. Barcoder is default and is the more reliable of the two. However, ConstrainedNJ is a lot faster and has a valid use in cases where the number of unknown sequence makes the time consuming MCMC approach infeasible.