**Downloading MDBlocks:**

We distribute the source code for the program MDBlocks as a gzipped tar file. This includes the source code for ranlib and it includes a makefile.

**Data Format For MDBlocks:**

To use MDBlocks, you must have phased haplotype data. The format is simple. The file must begin with information on the number of haplotypes (Seqs) and the number of Markers in each. The first two lines must be:

Seqs: XXX

Markers: YYY

Where XXX is the number of haplotypes in the dataset and YYY is the number of SNPs at each data set. Notice that the format here is quite strict---you have to have each of these lines in the proper order , there must be a colon immediately following both "Seqs" and "Markers", and there must be a space between the colon and the number (XXX or YYY above).

After that, each line is a haplotype. The first entry of each row must be a string (with no white space in it). This could be a name given to each of the haplotypes, for example. The remaining columns are white-space delimited (tabs or spaces) indicators of the SNP alleles carried at each of the loci. SNP alleles must be denoted either as 0's or 1's. "-1" is used to denote missing data. "-2" is used to denote "ambiguous heterozygosity." This case arises when phasing is done using trios, and the two parents and child in the trio are all heterozygous at the locus. With regard to the "-2"s, the haplotypes are considered to be paired. Hence, if a "-2" is used in haplotype n, where n is an odd number then there must be a "-2" at the same locus at haplotype (n+1). A small example data set would look like:

Seqs: 10 Markers: 15 Ex1 1 1 1 0 0 -2 0 1 0 1 1 0 0 0 0 Ex2 1 -1 1 0 0 -2 1 0 0 0 0 1 1 1 0 Ex3 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 Ex4 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 Ex5 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 Ex6 0 0 0 0 1 0 1 0 0 0 0 1 1 0 -1 Ex7 0 0 0 0 1 0 0 0 0 -1 0 1 1 0 0 Ex8 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 Ex9 0 0 -2 0 1 0 0 0 0 0 0 1 1 0 1 E10 1 1 -2 0 0 0 0 0 0 0 0 1 1 0 1

**Command Line Options:**

MDBLocks is invoked by the command "MDBlocks". It takes a single required argument, which is the data file name. The file name may be followed by two different options, "-g" and "-s". The "-g" option makes the program use the IADP (the iterated approximate dynamic programming algorithm) which is typically much faster than the default IDP (iterated dynamic programming algorithm) which is used when the "-g" option is not invoked. The "-s" option allows you to specify two random number seeds to seed the random number generator for dealing with missing data. The two numbers must be positive integers and must follow the "-s". Three example command lines are:

MDBlocks myfile.txt

MDBlocks myfile.txt -g

MDBlocks myfile.txt -g -s 23786 98733

**Program Standard Output:**

MDBLocks directs output both to standard output and to various output files. The program output to standard output from running the above data file with the command "MDBlocks myfile.txt -g" looks like:

Program MDBlocks Version 1.0 Released 8 MAY 03 written by Eric C. Anderson (dr_eriq@uclink.berkeley.edu) and John A. Novembre (novembre@socrates.berkeley.edu) Copyright (c) by The Regents of the University of California Please see user documentation for full software agreement. Seeds: 1269384 6471 Ex1 Ex2 Ex3 Ex4 Ex5 Ex6 Ex7 Ex8 Ex9 E10 Filling 4 Unresolved Heterozygosity Holes: Filling 3 Missing Data Holes: Computing Matrices, a = 0 Computing Matrices, a = 1 Computing Matrices, a = 2 Computing Matrices, a = 3 Computing Matrices, a = 4 Computing Matrices, a = 5 Computing Matrices, a = 6 Computing Matrices, a = 7 Computing Matrices, a = 8 Computing Matrices, a = 9 Computing Matrices, a = 10 Computing Matrices, a = 11 Computing Matrices, a = 12 Computing Matrices, a = 13 Computing Matrices, a = 14 Assumed R = 1, Calculated R = 2 Continuing iterations... Assumed R = 2, Calculated R = 3 Continuing iterations... Assumed and Calculated R values are converged to 3 Outputting detailed score summary to CodeLengths.tex The following line is designed to be easy to grep (using the @) out of the stdout. @146.499746,4,0 5 13 15 ,0.000 0.000 , 2 2 2,648177177,1603874186

The synopsis of this is as follows:

- The first few lines announce the program and the copyright
- The next line gives the random number seeds used
- The next line lists the names of all the rows in the data set
- The next two lines report how many SNPs are missing data.
- Then, while essential quantities are being computed, the program reports on progress with the "Computing Matrices" lines
- Then, it goes into iterations of the dynamic programming algorithm, and briefly reports progress on that
- Finally it reports that a detailed summary has been output to the file CodeLengths.tex
- And then the results are printed on a single line that begins
with an "@". This line is comma-separated, and has the following
fields:
- 146.499746 --- this is the overall description length in the number of bits
- 4 --- this is the number of blocks found + 1 (for historical reasons it is R+1)
- 0 5 13 15 --- this reports the positions of the blocks. SNPs are considered to be numbered from 0 to NumberOfMarkers - 1. The line 0 5 13 15 means that there is a block that includes SNPs 0 to 4, one that includes SNPs 5 to 12, and one that includes SNPs from 13 to 14. The 15 occurs on the end as a tag --- it is always the number of SNPs in the data set.
- 0.000 0.000 --- this field gives the Kullback-Leibler distances between the matrix P* and a matrix in which each row is determined simply by the marginal proportions of the types in the next block. (The first number is for the change between block 1 and 2, the second between block 2 and 3, etc.). These are not as useful as we had first imagined they might be.
- 2 2 2 --- this lists the type of distribution used to compute the code length for describing the haplotype frequencies within each block. The numbers here correspond to the numbers used in the manuscript. In particular, a 1, 2, or 3, in the program output corresponds to I_q^(k) = 1, 2, or 3, respectively.
- Finally, 648177177,1603874186 are the two random numbers which the random number generator finished with.

**Program File Output:**

There are two files that the program generates:

- "CodeLengths.tex"---This is a LaTeX file that gives a detailed breakdown of the description length for the block designation found by MDBlocks. Run LaTeX on it to get an easy-to-read table.
- "FileName.log" --- for any data file name, a file is created and then appended to each time MDBlocks is run on that particular data file. It records the starting time, starting seeds, ending seeds and ending time of each program run.

**Copyright Notice:**

Copyright (c). The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing opportunities. Created by Eric C. Anderson and John A. Novembre, Department of Integrative Biology, University of California, Berkeley. IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.