/* Founder_test.c is the C program that performs the test of the founder effect hypothesis, as described in the paper "A population genetic test of founder effects and implications for Ashkenazi Jewish diseases" by Montgomery Slatkin, American Journal of Human Genetics, in press (2004). The program can be compiled using the gcc compiler available on Linux and most Unix systems. The input is read from two files. One contains the genetic parameters (defined below) and has to have .gen extention (e. g. file1.gen). The other contains the demographic parameters (defined below) and has to have .dem extention (e. g. file2.dem). This separation allows the analysis of several diseases using several demographic scenarios, as was done in analysis of Ashkenazi Jewish diseases. Once the program is compiled, different cases are run from the command line. If the program is named Founder_test, then type Founder_test file1 file2 tF maxrep where file1 contains the genetic parameters, file2 contains the demographic parameter, tF is the hypothetical time of the founder event, and maxrep is the number of replicates. tF is in generations and is not necessarily an integer. maxrep must be an integer. The program will create an output file file2.file1.tF=XX.out, where XX will be replaced by tF on the command line. This file will contain the genetic, demographic and other parameters and the output from the run. The program can also create a tab-separated file, file file2.file1.tF=XX.Fm that contains the histogram of the number of lineages at tF, F(m) in the paper, suitable for plotting. To obtain the histogram file, add a 1 to the end of the command line, e. g. Founder_test file1 file2 tF maxrep 1 The format for both input files is the same, the parameter name is followed by the parameter value. Parameters can be in any order. The only exception is that in the .dem file, the number of transitions must be read before the vector of population sizes or population growth rates. .dem file (demographic parameters) transitions (the number of changes in past population growth rate) tt (times at which effective size is specified): a list containing one time for each transition. ne (effective population size at the beginning of each interval), a list containing (transitions+1) values. The first number is the current effective size (t=0) and the rest are the sizes at the values of tt listed. t2max (maximum of t2 used in creating the histogram of of the posterior distribution of t2, which is used to compute the 95% credible set): default is 1000 which should be adequate for most purposes. r_last (rate of exponential growth before the last transition); default is 0. Comments on .dem file To model a population of constant size 1000 use transitions 0 ne 1000 To model a population a population with current size 1 million and past growth rate 0.05 use transitions 0 ne 1e6 r_last 0.05 The demographic model used in the paper cited (with effective size 1000 at t=77.2 and effective size 2000 at t=26.08) was generated by tmax 1000 transitions 9 ne 3500000 2000000 250000 53333 66666 5000 2000 50000 1000 500000 tt 4 9.4 13.32 14.08 20 26.08 36.16 77.2 78 r_last 0.1 Other combinations of effective sizes are obtained by changing the second from last and fourth from last values on the list for ne. .gen file (genetic parameters) i (number of mutant chromosomes in sample) xT (mutant allele frequency in population) j0 (number of non-recombinant chromosomes in sample) theta (recombination rate between mutant and marker loci) pN (marker allele frequency on non-mutant chromosomes) For example, data for the I1307K allele at the APC locus and the D5S135 marker locus is modeled by i 87 xT 0.06 j0 83 theta 0.0046 pN 0.343 initseed 12345 meaning that 87 chromosomes carrying I1307K at the APC locus were sampled, 83 of those carried ancestral allele. The population frequency of I1307K is 0.06. The recombination rate is 0.0046 and the frequency of the ancestral allele on non-I1370K chromosomes is 0.343 Comments on .gen file If initseed is specified, then that is the initial seed of the random number generator. If the same seed is used with the same parameter values, identical results should be obtained. If initseed is not in the .gen file, then the initial seed is obtained from the time function and will be reported. If file1.dem and file2.gen contains the parameters given above, and Founder_test file1 file2 77.2 1000 is typed on the command line, the result should be tF = 77.2 maxrep = 1000 histogram = 0 transitions = 9 ne: 3.5e+06 2e+06 250000 53333 66666 5000 2000 50000 1000 500000 tt: 4 9.4 13.32 14.08 20 26.08 36.16 77.2 78 r_last = 0.1 Vector of growth rates is computed r: 0.139904 0.385082 0.394109 -0.293605 0.437543 0.150706 -0.319333 0.0953222 -7.76826 0.1 i = 87 xT = 0.06 j0 = 83 theta = 0.0046 pN = 0.343 initseed = 12345 n = 1450 Neutrality test: P(j>=j0) = 0.00172464 t2ave = 163.1: (142, 184), N(t2) = 101.104 Pr[m=5] = 0.0488049 Pr[m=6] = 0.0314293 Pr[m=7] = 0.121639 Pr[m=8] = 0.0787667 Pr[m=9] = 0.0845917 Pr[m=10] = 0.188396 Pr[m=11] = 0.112154 Pr[m=12] = 0.0876747 Pr[m=13] = 0.081472 Pr[m=14] = 0.0455076 Pr[m=15] = 0.0606965 Pr[m=16] = 0.0265407 Pr[m=17] = 0.016146 Pr[m=18] = 0.00464355 Pr[m=19] = 0.00372894 Pr[m=20] = 0.00293029 Pr[m=21] = 0.00337731 Pr[m=22] = 0.000873594 Pr[m=24] = 0.000627533 Founder lineages test: Pr(m<=1) = 0, Ave(m) = 10.4693 Program took 1.05 minutes should be printed to the screen and printed to a new file, file1.file2.tF77.2.out Comment The runnning time (for the 1000 replicates specified on the command line) was for a 700 mhz Dell system running Linux and will probably be different for other systems. If instead, Founder_test file1 file2 77.2 1000 1 is typed on the command line (i. e. a 1 is added to the end) the same results will be obtained with an additional file, file1.file2.tF77.2.Fm containing m Pr(m) 5 0.04880 6 0.03143 7 0.12164 8 0.07877 9 0.08459 10 0.18840 11 0.11215 12 0.08767 13 0.08147 14 0.04551 15 0.06070 16 0.02654 17 0.01615 18 0.00464 19 0.00373 20 0.00293 21 0.00338 22 0.00087 24 0.00063 will be created. This program may be freely copied and modified. I would appreciate being told of any problems with it. Montgomery Slatkin June 17, 2004 */ // BEGINNING OF C PROGRAM #include #include #include #include #include #include #define min(x, y) (((x) < (y)) ? x : y) #define max(x, y) (((x) > (y)) ? x : y) #define NODELIMIT 1300000 #define BIGTIME 1e9 struct nodetype { int ident; int terminal; int descendents; double coaltime; struct nodetype *left, *right; }; struct nodetype node[NODELIMIT]; int main(int argc, char *argv[]) { FILE *ipt, *opt, *cpt; long start_time; int repno, maxrep, nodeno, j; int nextnode, lineages, *popvec; char name[60], dummy[99]; double tau, t, deltau; double probsum, weightsum; double *times; struct nodetype *npt, *lpt, *rpt; double u, v, P; int nodecount, phaseno, nF; double *taustar; double kFsum, *Fmvec; double nFsum, kFave; double t2ave, t2size, *t2histogram; double tF; int histogram; double kF01; double t2sum; int t2low, t2high; double *r=NULL; double r_last=0.0; double *tt; double *ne=NULL; int transitions=0; int t2max=1000; int initseed=-1; int i; int n; int j0; double theta=0.0; double pN; double xT; void processtree(struct nodetype *npt, double u, double v, int i, int j, double *ppt, double *wpt, int *countpt, double tF, double *times, double *Fmvec, double *t2pt, double *t2hist); double readdouble(char *c, FILE *ipt, FILE *opt); int readint(char *c, FILE *ipt, FILE *opt); void crash(char *text); double size(double t, double *ne, double *r, double *tt, int transitions); void readdvector(char *name, double *v, int first, int last, FILE *ipt, FILE *opt); void zerovec(double *v, int low, int high); double sumvec(double *v, int i, int j); double size(double t, double *ne, double *r, double *tt, int transitions); void write_time(FILE *pt, long net_time, long start_time); void randomize(int *v, int m, int n); double unif(void); double expdist(double a); int randint(int max); void gsrand(int s); int *ivector(long nl, long nh); double *vector(long nl, long nh); if (argc < 5) crash("2 filenames, tF and maxrep must be on command line"); tF = strtod(argv[3], NULL); printf("tF = %g\n", tF); maxrep = atoi(argv[4]); printf("maxrep = %d\n", maxrep); start_time = time(NULL); sprintf(name, "%s.%s.tF%g.out", argv[1], argv[2], tF); opt = fopen(name, "w"); fprintf(opt, "maxrep = %d\n", maxrep); fprintf(opt, "tF = %g\n", tF); if (argc == 5) histogram = 0; else histogram = atoi(argv[5]); printf("histogram = %d\n", histogram); if (histogram) { printf("Histogram of m values will be printed.\n"); fprintf(opt, "Histogram of m values will be printed.\n"); sprintf(name, "%s.%s.tF%g.Fm", argv[1], argv[2], tF); cpt = fopen(name, "w"); fprintf(cpt, "m\tPr(m)\n"); } sprintf(name, "%s.dem", argv[2]); ipt = fopen(name, "r"); while (1) { fscanf(ipt, "%s", dummy); if (feof(ipt) || !strcmp(dummy, "end")) break; else if (!strcmp(dummy, "t2max")) t2max = readint(dummy, ipt, opt); else if (!strcmp(dummy, "transitions")) { transitions = readint(dummy, ipt, opt); tt = vector(0, transitions); r = vector(0, transitions); ne = vector(0, transitions); } else if (!strcmp(dummy, "ne")) { if (ne) readdvector(dummy, ne, 0, transitions, ipt, opt); else crash("transitions must be read before ne."); } else if (!strcmp(dummy, "tt")) { if (ne) { tt[0] = 0; readdvector(dummy, tt, 1, transitions, ipt, opt); } else crash("transitions must be read before tt."); } else if (!strcmp(dummy, "r_last")) r_last = readdouble(dummy, ipt, opt); } // end, while if (transitions) { printf("Vector of growth rates is computed\n"); fprintf(opt, "Vector of growth rates is computed\n"); for (phaseno=0; phaseno 1e-6) taustar[phaseno+1] = taustar[phaseno] + (exp(r[phaseno]*(tt[phaseno+1]-tt[phaseno]))-1) / (2*ne[phaseno]*r[phaseno]); else taustar[phaseno+1] = taustar[phaseno] + (tt[phaseno+1]-tt[phaseno]) / (2 * ne[phaseno]); n = i / xT; if (n >= NODELIMIT / 2) crash("n is too large for NODELIMIT"); else { printf("n = %d\n", n); fprintf(opt, "n = %d\n", n); } u = theta * (1 - pN); v = theta * pN; gsrand(initseed); popvec = ivector(1, n); times = vector(0, i+1); times[0] = times[1] = BIGTIME; times[i+1] = 0.0; // so qsort can be used, they should not change for (nodeno=1, npt=node+1; nodeno<2*n; nodeno++, npt++) { npt->ident = nodeno; if (nodeno <= n) { npt->terminal = 1; npt->descendents = 1; npt->coaltime = 0.0; npt->left = npt->right = 0; } else npt->terminal = 0; } Fmvec = vector(1, i); zerovec(Fmvec, 0, i); nFsum = t2ave = 0.0; t2histogram = vector(1, t2max); zerovec(t2histogram, 0, t2max); probsum = weightsum = 0.0; nodecount = 0; // number of nodes with i descendents for (repno=1; repno<=maxrep; repno++) { nextnode = n + 1; for (nodeno=1; nodeno<=n; nodeno++) popvec[nodeno] = nodeno; randomize(popvec, 1, n); phaseno = 0; for (nF=0, tau=0.0, lineages=n; lineages>1; lineages--) { deltau = expdist(lineages * (lineages - 1.0) / 2); tau += deltau; while (tau > taustar[phaseno+1] && phaseno < transitions) phaseno++; if (fabs(r[phaseno] > 1e-6)) t = tt[phaseno] + log(1 + 2 * ne[phaseno] * r[phaseno] * (tau - taustar[phaseno])) / r[phaseno]; else t = tt[phaseno] + 2 * ne[phaseno] * (tau - taustar[phaseno]); if (nF == 0 && t >= tF) nFsum += nF = lineages; npt = &node[nextnode]; lpt = &node[popvec[lineages-1]]; rpt = &node[popvec[lineages]]; npt->coaltime = t; npt->left = lpt; npt->right = rpt; npt->descendents = lpt->descendents + rpt->descendents; popvec[lineages-1] = nextnode++; randomize(popvec, lineages-1, lineages-1); } npt = &node[nextnode-1]; // point to the root processtree(npt, u, v, i, j0, &probsum, &weightsum, &nodecount, tF, times, Fmvec, &t2ave, t2histogram); } P = probsum / weightsum; printf("Neutrality test: P(j>=j0) = %g\n", P); fprintf(opt, "Neutrality test: P(j>=j0) = %g\n", P); t2ave /= probsum; t2size = size(t2ave, ne, r, tt, transitions); t2sum = 0.0; t2low = 0; while (t2sum < 0.025) t2sum += t2histogram[++t2low]; t2sum = 0.0; t2high = t2max + 1; while (t2sum < 0.025) t2sum += t2histogram[--t2high]; printf("t2ave = %.1f: (%d, %d), N(t2) = %g\n", t2ave, t2low, t2high, t2size); fprintf(opt, "t2ave = %.1f: (%d, %d), N(t2) = %g\n", t2ave, t2low, t2high, t2size); kFsum = sumvec(Fmvec, 0, i); for (kFave=0.0, j=0; j<=i; j++) if (Fmvec[j] > 0.0) { Fmvec[j] /= kFsum; kFave += j * Fmvec[j]; printf("Pr[m=%d] = %g\n", j, Fmvec[j]); fprintf(opt, "Pr[m=%d] = %g\n", j, Fmvec[j]); } nFsum /= maxrep; kF01 = sumvec(Fmvec, 0, 1); // 1 is correct printf("Founder lineages test: Pr(m<=1) = %g, Ave(m) = %g\n", kF01, kFave); fprintf(opt, "Founder lineages test: Pr(m<=1) = %g, Ave(m) = %g\n", kF01, kFave); if (histogram) for (j=0; j<=i; j++) if (Fmvec[j] >= 5e-6) fprintf(cpt, "%d\t%.5f\n", j, Fmvec[j]); write_time(opt, time(NULL) - start_time, start_time); exit(1); } double size(double t, double *ne, double *r, double *tt, int transitions) { int phaseno; void crash(char *text); phaseno = 0; while (t > tt[phaseno+1] && phaseno < transitions) phaseno++; return ne[phaseno] * exp(-r[phaseno] * (t - tt[phaseno])); } void readdvector(char *name, double *v, int first, int last, FILE *ipt, FILE *opt) { int i; printf("%s: ", name); fprintf(opt, "%s: ", name); for (i=first; i<=last; i++) { fscanf(ipt, "%lf", &v[i]); printf("%g ", v[i]); fprintf(opt, "%g ", v[i]); } printf("\n"); fprintf(opt, "\n"); } void processtree(struct nodetype *npt, double u, double v, int i, int j, double *ppt, double *wpt, int *countpt, double tF, double *times, double *Fmvec, double *t1pt, double *t1hist) { double weight; int tcount; void findtimes(struct nodetype *npt, double *times, int *cpt); void processtree(struct nodetype *npt, double u, double v, int i, int j, double *ppt, double *wpt, int *countpt, double tF, double *times, double *Fmvec, double *t1pt, double *t1hist); void updatesums(int i, int j, double u, double v, double weight, double tF, double *times, double *ppt, double *Fmvec, double *t1pt, double *t1hist); if (npt->left->descendents == i) { (*countpt)++; weight = npt->coaltime - npt->left->coaltime; *wpt += weight; tcount = 2; findtimes(npt->left, times, &tcount); updatesums(i, j, u, v, weight, tF, times, ppt, Fmvec, t1pt, t1hist); } else if (npt->left->descendents > i) processtree(npt->left, u, v, i, j, ppt, wpt, countpt, tF, times, Fmvec, t1pt, t1hist); if (npt->right->descendents == i) { (*countpt)++; weight = npt->coaltime - npt->right->coaltime; *wpt += weight; tcount = 2; findtimes(npt->right, times, &tcount); updatesums(i, j, u, v, weight, tF, times, ppt, Fmvec, t1pt, t1hist); } else if (npt->right->descendents > i) processtree(npt->right, u, v, i, j, ppt, wpt, countpt, tF, times, Fmvec, t1pt, t1hist); } void updatesums(int i, int j, double u, double v, double weight, double tF, double *times, double *ppt, double *Fmvec, double *t1pt, double *t1hist) { double dataprob; int kF; double t1; double recomcalc(int j0, double u, double v, double *t, int i, double *j0pt); double mutchance, j0prob, unif(void); dataprob = recomcalc(j, u, v, times, i, &j0prob); *ppt += weight * dataprob; t1 = times[2] + weight * unif(); *t1pt += weight * dataprob * t1; t1hist[(int) t1] += weight * dataprob; if (times[2] > tF) { kF = 2; while (times[kF+1] > tF) kF++; Fmvec[kF] += weight * j0prob; } else if (times[2] + weight > tF) { mutchance = tF - times[2]; Fmvec[0] += j0prob * mutchance; Fmvec[1] += j0prob * (weight - mutchance); } else Fmvec[0] += j0prob * weight; } double recomcalc(int j0, double u, double v, double *times, int i, double *j0pt) { int j, k; double sum, delt, p12, p21; double **transmat; double *pv1, *pv2; // probabilities of j As void fillbinomial(double p, int n, double *v); void matfill(double **m, int k, double p21, double p12); void matrixvector(double *v1, double **m, double *v2, int n); int compare(const void *e1, const void *e2); double *vector(long nl, long nh); void freevector(double *v, long nl, long nh); double **matrix(long nrl, long nrh, long ncl, long nch); void freematrix(double **m, long nrl, long nrh, long ncl, long nch); qsort((void *) times, (size_t) i+2, sizeof(double), compare); pv1 = vector(0, i); pv2 = vector(0, i); pv2[0] = 0.0; pv2[1] = 1.0; for (k=2; k<=i; k++) { pv1[0] = pv2[0]; pv1[k] = pv2[k-1]; for (j=1; j=j0; j--) sum += pv2[j]; freevector(pv1, 0, i); freevector(pv2, 0, i); return sum; } void findtimes(struct nodetype *npt, double *times, int *cpt) { if (npt->terminal == 0) { times[*cpt] = npt->coaltime; (*cpt)++; findtimes(npt->left, times, cpt); findtimes(npt->right, times, cpt); } } void matfill(double **m, int k, double p21, double p12) { int i, j, ii; double sum, *plusvec, *minusvec; double *vector(long nl, long nh); double **matrix(long nrl, long nrh, long ncl, long nch); void fillbinomial(double p, int n, double *v); void freevector(double *v, long nl, long nh); for (j=0; j<=k; j++) { plusvec = vector(0, k-j); fillbinomial(p21, k-j, plusvec); minusvec= vector(0, j); fillbinomial(p12, j, minusvec); for (i=0; i<=k; i++) { if (i<=j) for (sum=0.0, ii=0; ii<=min(k-j,i); ii++) sum += plusvec[ii] * minusvec[ii+j-i]; else for (sum=0.0, ii=0; ii<=min(k-i,j); ii++) sum += minusvec[ii] * plusvec[ii+i-j]; m[k-i][k-j] = sum; } freevector(plusvec, 0, k-j); freevector(minusvec, 0, j); } } void fillbinomial(double p, int n, double *v) { int i; if (p == 0.0) { v[0] = 1.0; for (i=1; i<=n; i++) v[i] = 0.0; } else if (p == 1.0) { v[n] = 1.0; for (i=0; i 0.0 && p <= 0.5) { v[0] = pow(1-p, n); for (i=1; i<=n; i++) v[i] = v[i-1] * (n-i+1.0) * p / i / (1-p); } else if (p < 1.0 && p > 0.5) { v[n] = pow(p, n); for (i=n-1; i>=0; i--) v[i] = v[i+1] * (i+1) * (1-p) / (n-i) / p; } } void matrixvector(double *v1, double **m, double *v2, int n) { int i, j; for (i=0; i<=n; i++) for (v1[i]=0.0, j=0; j<=n; j++) v1[i] += m[i][j] * v2[j]; } int compare(const void *e1, const void *e2) { const double *x1, *x2; x1 = e1; x2 = e2; return (*x1 > *x2) ? -1 : (*x1 < *x2) ? 1: 0; } #include #define A 16807 #define M 2147483647 #define Q 127773 #define R 2836 static int seed; int grand(void) { int test; test = A * (seed % Q) - R * (seed / Q); if (test > 0) seed = test; else seed = test + M; return(seed); } /* SRAND - Seed the random number generator */ void gsrand(int s) { seed = s; } /* DRAND - Double uniform random number generator */ #define RM 2147483647.0 double unif(void) /* This is drand renamed to be consistent with my usage */ { int grand(void); return ((double) grand() / RM); } int randint(int max) /* Returns a random integer between 1 and max */ { int i; double unif(); i = unif() * max + 1; if (i <= max) return(i); else return(max); } double expdist(double a) // this was modified to return a double { double unif(void); return -log(unif()) / a; } /* end, expdist */ // randomly swaps elements m through n of v void randomize(int *v, int m, int n) { int randint(int max), i, j, temp; for (i=m; i<=n; i++) { j = randint(n); temp = v[i]; v[i] = v[j]; v[j] = temp; } } void crash(char *s) { printf("%s\n", s); exit(0); } double readdouble(char *c, FILE *ipt, FILE *opt) { double param; fscanf(ipt, "%lf", ¶m); printf("%s = %g\n", c, param); if (opt) fprintf(opt, "%s = %g\n", c, param); return param; } int readint(char *c, FILE *ipt, FILE *opt) { int param; fscanf(ipt, "%d", ¶m); printf("%s = %d\n", c, param); if (opt) fprintf(opt, "%s = %d\n", c, param); return param; } double *vector(int nl, int nh) { double *v; v=(double *)malloc((size_t) ((nh-nl+1+1)*sizeof(double))); return v-nl+1; } int *ivector(long nl, long nh) { int *v; v=(int *)malloc((size_t) ((nh-nl+1+1)*sizeof(int))); return v-nl+1; } void freevector(double *v, int nl, int nh) { free((char*) (v+nl-1)); } void freematrix(double **m, long nrl, long nrh, long ncl, long nch) { free((char*) (m[nrl]+ncl-1)); free((char*) (m+nrl-1)); } void zerovec(double *v, int low, int high) { int i; for(i=low; i<=high; i++) v[i] = 0.0; } double sumvec(double *v, int low, int high) { int i; double sum; for (sum=0.0, i=low; i<=high; i++) sum += v[i]; return sum; } void write_time(FILE *pt, long net_time) { if (net_time <= 60) { printf("Program took %ld seconds\n", net_time); if (pt) fprintf(pt, "Program took %ld seconds\n", net_time); } else if (net_time > 60 && net_time <= 3600) { printf("Program took %4.2f minutes\n", net_time / 60.0); if (pt) fprintf(pt, "Program took %4.2f minutes\n", net_time / 60.0); } else { printf("Program took %4.2f hours\n", net_time / 3600.0); if (pt) fprintf(pt, "Program took %4.2f hours\n", net_time / 3600.0); } } double **matrix(long nrl, long nrh, long ncl, long nch) { long i, nrow=nrh-nrl+1,ncol=nch-ncl+1; double **m; m=(double **) malloc((size_t)((nrow+1)*sizeof(double*))); m += 1; m -= nrl; m[nrl]=(double *) malloc((size_t)((nrow*ncol+1)*sizeof(double))); m[nrl] += 1; m[nrl] -= ncl; for(i=nrl+1;i<=nrh;i++) m[i]=m[i-1]+ncol; return m; }