5q31 data:

In our analyses of the 5q31 data , we used haplotype data kindly provided to us by Mark J. Daly. This data set was a list of 516 haplotypes. The first 258 were the haplotypes tranmitted to the child in each of 258 trios, and the last 258 were the haplotypes not transmitted to the child in the 258 trios. The haplotypes had been reconstructed with respect to missing data in such a way as to avoid bias in the TDT.

Briefly, if both parents and the child in the trio were missing no genotypes at a SNP, then the allele carried on the transmitted and untransmitted haplotypes is easily determined unless all three members of the trio are heterozygous, in which case the alleles carried on the transmitted and untransmitted chromosomes are termed "ambguously heterozygous" and denoted in the data by an "h".

However, if genotypes are missing in some family members, the assignment of alleles at a given SNP locus to the transmitted and untransmitted haplotypes of parent A (for example) is done according to the following rules. (Let A and B be the two parents and C denote the child).

The dataset that results from such a procedure is in the file dalyphased.txt in which each row is a haplotype, the first column in each row is the Family ID and the remaining columns in each row are the alleles carried at the 103 SNPs on the haplotype. The first 129 pairs of rows are the haplotypes transmitted to the child in each trio, and the last 129 pairs of rows are the haplotypes not transmitted to the child in each trio. The alleles are coded as follows: 1=A, 2=C, 3=G, 4=T, "0" denotes missing genotype data, and "h" denotes ambiguous heterozygosity as above. (This file can be recreated from the original trio data in raw_data.txt by applying the awk script phaselikedaly.txt to the data, i.e., on a Unix or Linux machine save the file as myscript.sh and then run the command "myscript.sh raw_data.txt" after downloading raw_data.txt).

To run this data in our program MDBlocks, we format it so that the alleles are coded as 0's and 1's at each SNP, missing genotype data is denoted by "-1" and ambiguously heterozygous sites are recorded as "-2". Such a formatted file is available for download as daly4mdblocks.txt.