Script List


About the script list

This is my collection of bioinfomatically related scripts. They do very boring but necessary stuff so this page serves to save other people the agony. Many of the scripts rely on bioperl, so without that installed many of them are no good. A few of the scripts are dependent on other programs.

The list

About the script list
The list
Working with sequence files
Working with GFF files
Handling table files
Working with trainingsets and gene predictions

Working with sequence files
Cuts out Fasta entries from a sequence file based on GFF lines of features in the sequence entries.
Dumps the prediction track from a labelled fasta formatted prediction
Extract sub-sequences from sequences on stdin based on a (perl) regular expression given on the cmd line. Input sequences in labeled fasta format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option.
Takes a labeled Fasta from STDIN and transforms labels to GFF entries The label-to-feature translations are given as eg C=coding I=intron etc. Only specified labels are turned into GFF lines
Takes a labeled Fasta from STDIN and transforms labels to GFF entries The label-to-feature translations are given as eg C=coding I=intron etc. Only specified labels are turned into GFF lines
This script scrambles the bases in the sequence specified by labels. When specifying more than one label each type of sequence is scrampled seperately.
This script splits LFasta entries into subentries of at most some specified length. The ids of the subentries will get a '-OFFSET:<int>' tag appended to show where the sunentry fits in the original entry.
Takes one of more seqfiles (STDIN is default) and masks them using gff lines from an external file and/or named features from the sequence annotation.
This script counts n-mers in a set of sequences. NOTE: Does not handle Ns in the entries.
Prunes a seqfile of open reading frames. Either by masking or by removing sequence entries (this is default).
Picks out sequence entries from a sequence file based on id/accession. It takes a newline separated list of ids/accessions to pick, either as first argument or from STDIN. Genbank and EMBL entries is identified by their accession. Fasta by id. Indexing is only supported for Fasta files. Note that the whole description line is not returned when using indexes. Only the first non-white-space after the '>' is returned which is also that the index uses as unique lookup id.
This script does reformatting between sequence formats. It handles Genbank, EMBL, Fasta and all the other formats supported by bioperl. In addition it formats to labeled fasta (lfa) which is the a handy extention of the fasta format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the —labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and Fasta can be handled by using the gff option. This specifies a gff file that is read from or written to depending on the which way the formatting goes.
This script generates gff from seqfiles.
This script creates a sql table from one or more sequence files in a pre-created SQL database.
Breaks a fasta file into smaller files. It evaluates each small file after a new seq entry has been written to it. That means that the specified limit at most may be exceded by the size of the next sequence.
Reports the differences between two sequence files, and prints either the sequences unique to file1 or the sequences that the files have in common.
Gets or skips the a number of first entries in a sequence file.
This script tests wether a created sequence file can be read and written by the Bio::SeqIO obj. That is, whether it is syntactically correct (in Bioperl's point of view).
Extracts all splice sites from an EMBL, Genbank or Swissprot file, as given in the CDS feature information. Each slice site is only reproted once if appears on several CDS features. (If the features are ordered from the start of the sequence.) The output is Fasta. The description line contains an id and a gff string specifying the location of the splice site sequence on the parent sequence and whether the splice site is 5' or 3'.
This script splits EMBL/Genbank/Swissprot entries in a file/stream into subfiles for each organism.
This script untangles Labeled Fasta as it comes out if you treat it as ordinary Fasta in a Seq or SeqIO object.

Working with GFF files
Filters a GFF stream based on given options.
This script makes a tab seperated table out of a gff line with lots of messy stuff in the last column.
Finds intersection of two GFF files. Prints lines from file2 that intersect (or do not intersect) with file1.
This script converts GFF lines with first entry in the format: <ID>-OFFSET:<OFFSET> into just <ID> and adjusts the start and end coordinates according to <OFFSET>.
This script finds the intersection between gff lines in file1 to those in file2. It can also be used to find the best matching gff lines in file2. NOTE: If you use —presorted with —features the files have to be sorted by strand, id, start, and end. With —strands it has to be by feature, id, start, and end. With both —feautures and —strands it has to be by strand, feature, id, start, and end.
This script sorts gff lines from standard input by first sequence name, then start and then end.
This script splits a sorted stream of GFF/GTF lines into files each containing only lines for one sequence.
This script sums the length of gfflines in a file or stream.

Handling table files
Makes a 2D or 3D histogram data-set for gnuplot from data in specified column(s) of an input file. Lines not starting with a real number are ignored. It also understands "framed" tables dumped from databases.
This script adds a column to a column file specified by an arithmetic string Eg. '\$2 - \$1 / 2'
This script filters a table based on a specifies string of conditions. Quote the string in '' not to interpolate the \$ vars in the shell. See examples.
This script calculates average etc for numbers in column <col> lines starting with '#' are ignored

Blasts a file against a blast database, parses the blastreports and outputs GFF lines for query, hits and hsps.
This script creates database from blast report.

Working with trainingsets and gene predictions
This script adds a prediction track to labeled Fasta entries as specified by a gff file. This is usefull for comaparing predictions. You can add more than one track of predictions from one gff file. In this case the prefix will be the first letter (capitalised) of the source field. Unless you use the onlydefined option you must use the prefix option. Otherwise the script will not know which prefix that goes with the tracks in entries that are not part of a prediction.
Reads a file with three columns: id1 id2 score from STDIN. Self scores are assumed to be included. Apply hobohm 1 or 2 and output list of reduced IDs.
This script changes all I labeling (intron) to 0, 1, or 2. The input is LFasta file with coding sequence labeled 'C', introns labeled 'I'.
This script plots plp output from the decodeanhmm program if run with the following flags: -ps -pl -plp
This script makes statistics for one or more predictions in a LFasta file. The LFasta file has to hold both the label track together with the prediction track(s) that you want to compare to. NOTE that it makes the assumption that each LFasta entry holds one and only one annotated gene. So you may get strange results for the gene stats it there is more than one in the annotation or prediction.

Kasper Munch -