Information for Slatkin's program to estimate pairwise levels of gene flow and to regress the estimated level of gene flow, M^, on geographic distance. What follows this text is a program I have written to estimate levels of gene flow between pairs of sampling locations. There is also a sample input and two sample output files that I will describe later. This is based on the program I used to produce the figures in my paper "Isolation by distance in equilibrium and non-equilibrium populations" which appeared in Evolution (47(1): 264-279; 1993). This version corrects some minor errors that were found in earlier versions, but those errors do not change the conclusions in that paper. I thank Michael Hellberg, Michel Raymond, Francois Rousset, and Neal Taylor for looking at the code in detail and bringing errors to my attention. This program may be copied and distributed. Questions should be sent to me (Montgomery Slatkin, Department of Integrative Biology, University of California, Berkeley 94720-3140; slatkin@garnet.berkeley.edu). The program was also revised to make use of dynamic memory allocation and can be run on an IBM PC or on a UNIX system. On the IBM, I used the Turbo C compiler but there is no reason that it should not work with other compilers and on other computers. I have tried to conform to the ANSI C as enforced by the Turbo C compiler. With that compiler, I get two warning messages but they do not affect the functioning of the program. When you compile, use the Large Memory Model. Some compilers, such as the gnu C complier on the NeXT, do not want the statement #include . Other compilers, such as the Turbo C compiler on the PC need that statement. I have put it in as a comment. You can remove it or use it as required. To run the program you need to put your data in the same format as in the sample input file. --The first line contains the number of locations sampled. Only the number is needed. Anything that follows is for your information. --After that are the sample sizes, one per line. Again, anything that follows the number is not used by the program. --Then the number of loci. The first line for each locus is not used but you must have at least one character on that line, e.g. the locus name or other information about the locus. --Then the number of alleles at that locus. If that number is greater than 20, you will get an error message and the program will quit. If you really have more than 20 alleles at a locus, increase the value of ALLELELIMIT accordingly and recompile. --The allele frequencies are for the locations in the order specified in the list of sample sizes. As far as the program is concerned, the first location is location 0, the second is location 1, etc. The frequencies can be separated by any number of spaces, carriage returns or tabs. The format is strictly for your convenience and ease of reading. However, some systems have a limit to the number of characters in a line. --After the last allele frequency is the matrix of geographic distances. The program assumes that there are d(d-1)/2 distances, where d is the number of locations sampled, in the order 0-1, 0-2,..., 0-(d-1), 1-2,...,1-(d-1), 2-3, ..., (d-2)-(d-1) [Remember, the locations are numbered 0, 1, ..., (d-1)]. The distances can be separated by any number of spaces, tabs or carriage returns. The matrix format I used just helps you read the input easily. Using Turbo C, you should be able to just use the "run" command. The program will then ask you for the name of your input file, which you presumably have already chosen, and the names of two output files. The first file, is the output file that contains all the information from the run, including all the input and matrices of the M^ (pronounced "M hat" and written correctly with the carat over the M) values computed using both the Weir and Cockerham theta statistic and Nei's GST statistic. [As a reminder, M^ = (1/FST - 1)/4, where FST is estimated by one of these two methods]. At the end are the two sets of regression coefficients, a and b, which result from regressing log10(M^) against the log10(distance) [i.e. log10(M^) = a + b log10(distance)]. The second output file, the graphics output, contains only the values of distances, M^ computed from each method, and the logarithms of those three numbers for each pair of locations. They are separated by tabs and can be read directly by Cricket Graph and other graphics programs. Other comments. --It is possible that some of the M^ values obtained using theta are negative. That is not an error. It just means that the true FST values for those pairs of locations are extremely small and are probably not significantly different from zero. The a and b values do not include the negative values. You should not get negative M^ values using Gst. However, I prefer theta because Weir and Cockerham (Evolution, 47(3):855-863) have shown that thete (which they call beta in that paper) provides an estimator of M^ which has a low bias. --There is no sophisticated error checking for the input. The program assumes that the number of locations, the number of loci and the number of alleles at each locus are what you say. If your input file is inconsistent in some way, the program will crash and not tell you why, and I will not be able to tell you why either. If you do get a mysterious crash, check the output file. It will probably tell you that the program thinks the number of loci, sampling locations or allele frequencies are not what you think they are. That means that you have a format error in your input file. --When values of M^ are relatively small, the values from theta are roughly one half of the values from GST. That is not an error and in fact conforms to the analytic theory. When only two locations are sampled, there is a factor of 2 difference in what theta and GST estimate. For larger values of M^, the differences between the two ways of computing M^ are not consistent. I would not place much faith in estimates of M^ greater than 10 or so, other than that they indicate very high levels of gene flow or effective panmixia. Montgomery Slatkin September 28, 1994 ---------------------- C Program #include #include #include /* #include YOUR COMPILER WILL TELL YOU IF YOU NEED THIS */ #define ALLELELIMIT 20 /* maximum number of allelic types at any locus */ struct infotype { /* contains information about demes sampled */ int samsize; double x, y; /* location of deme sampled */ }; struct infotype *samples; int maxlocus, maxsam, *maxallele, s1, s2; double ***table; double **theta, /* pairwise estimates of Nm using Weir and Cockerham */ **gst, /* pairwise estimates of Nm using Nei, 1973 */ **distmat; /* geographic distances */ FILE *ipt, *opt, *grp; void main() { int s1, s2; double weir(int s1, int s2, int n1, int n2); double nei(int s1, int s2); char testchar, inname[20], outname[20], graphname[20]; void readdata(), readdist(), writemat(double **m, int max); void writecorr(double **v); printf("Name of input file? "); scanf("%s", inname); printf("Name of output file? "); scanf("%s", outname); printf("Name of Cricket graph file? "); scanf("%s", graphname); ipt = fopen(inname, "r"); /* Allele frequencies and parameters */ opt = fopen(outname, "w"); /* Output matrix of Mhat values */ grp = fopen(graphname, "w"); /* File for Cricket Graph input */ readdata(); readdist(); for (s1=0; s1 0.0) fprintf(grp, "%6.3f\t", log10(theta[s1][s2])); else fprintf(grp, "\t"); fprintf(grp, "%6.3f\n", log10(gst[s1][s2])); } } /* end, main */ void readdata() { int samno, alleleno, locno; int *ivector(int nl, int nh); struct infotype *create_info(int n); double **dmatrix(int nrl, int nrh, int ncl, int nch); void readln(void), printdata(char *c), zerotable(void); double ***d3tensor(int n1l, int n1h, int n2l, int n2h, int n3l, int n3h); fscanf(ipt, "%d", &maxsam); readln(); fprintf(opt, "%1d locations sampled\n", maxsam); maxallele = ivector(0, maxlocus - 1); samples = create_info(maxsam); for (samno=0; samno= ALLELELIMIT) { fprintf(opt, "There are too many alleles (%1d) at locus %1d\n", maxallele[locno], locno); fprintf(opt, "Increase the value of ALLELELIMIT and recomplile the program\n"); printf("There is a problem with the input file, see your output file\n"); exit(1); } for (alleleno=0; alleleno 0.0) return(dst / ht); else return(-1.0); } /* end, Nei */ double dsqr(double x) { return(x * x); } void writemat(double **m, int max) { int s1, s2; for (s1=0; s1 0.0) && (mhat[s1][s2] < 10000.0)) { x = log10(distmat[s1][s2]); y = log10(mhat[s1][s2]); xcount++; xsum += x; ysum += y; x2sum += dsqr(x); xysum += x * y; } xbar = xsum / xcount; ybar = ysum / xcount; varx = x2sum / xcount - dsqr(xbar); covxy = xysum / xcount - xbar * ybar; b = covxy / varx; a = ybar - b * xbar; fprintf(opt, "\na = %g, b = %g\n\n", a, b); } /* end, writecorr */ double **dmatrix(int nrl, int nrh, int ncl, int nch) { int i; double **m; void nrerror(char error_text[]); m=(double **) malloc((unsigned) (nrh-nrl+1)*sizeof(double*)); if (!m) nrerror("allocation failure 1 in dmatrix()"); m -= nrl; for (i=nrl; i<=nrh; i++) { m[i]=(double *) malloc((unsigned) (nch-ncl+1)*sizeof(double)); if (!m[i]) nrerror("allocation failure 2 in dmatrix()"); m[i] -= ncl; } return m; } /* end dmatrix */ double ***d3tensor(int n1l, int n1h, int n2l, int n2h, int n3l, int n3h) { int i, j; double ***m; void nrerror(char error_text[]); m=(double ***) malloc((unsigned) (n1h-n1l+1)*sizeof(double**)); if (!m) nrerror("allocation failure 1 in d3tensor()"); m -= n1l; for (i=n1l; i<=n1h; i++) { m[i]=(double **) malloc((unsigned) (n2h-n2l+1)*sizeof(double*)); if (!m[i]) nrerror("allocation failure 2 in d3tensor()"); m[i] -= n2l; for (j=n2l; j<=n2h; j++) { m[i][j]=(double *) malloc((unsigned) (n3h-n3l+1)*sizeof(double)); if (!m[i][j]) nrerror("allocation failure 3 in d3tensor()"); m[i][j] -= n3l; } } return m; } int *ivector(int nl, int nh) { int *v; void nrerror(char error_text[]); v = (int *)malloc((unsigned) (nh-nl+1)*sizeof(int)); if (!v) nrerror("allocation failure in ivector()"); return v-nl; } struct infotype *create_info(int n) { int i; struct infotype *tpt; void nrerror(char error_text[]); tpt = (struct infotype *)malloc((unsigned) n*sizeof(struct infotype)); if (!tpt) nrerror("Allocation failure in creat_info()"); return tpt; } void nrerror(char error_text[]) { fprintf(stderr, "You have probably run out of memory...\n"); fprintf(stderr, "%s\n", error_text); fprintf(stderr, "...the program must terminate...\n"); exit(1); } __________________________ Sample input file 7 DEMES SAMPLED 39 N11D.......11D 1 37 GENESIS....13D 2 42 PARIGO.....13D 3 56 CLAM.ACRES.21D 4 24 HANG.GARD..21D 5 40 ROSE.GARD..GAL 6 28 NEW.VENT...GAL 7 12 LOCI HK 2 ALLELES 0.000 0.000 0.000 0.000 0.038 1.000 0.967 1.000 1.000 1.000 1.000 0.962 0.000 0.033 IDH-1 3 ALLELES 0.176 0.163 0.250 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 0.021 0.000 0.000 0.824 0.837 0.750 0.989 0.979 1.000 1.000 IDH-2 3 ALLELES 1.000 1.000 1.000 1.000 1.000 0.459 0.434 0.000 0.000 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000 0.000 0.531 0.566 PGM 5 ALLELES 0.051 0.028 0.037 0.018 0.000 0.000 0.000 0.346 0.375 0.481 0.709 0.652 0.097 0.083 0.590 0.583 0.481 0.273 0.348 0.411 0.472 0.013 0.014 0.000 0.000 0.000 0.468 0.444 0.000 0.000 0.000 0.000 0.000 0.024 0.000 GPI 4 ALLELES 0.029 0.016 0.000 0.011 0.059 0.031 0.032 0.066 0.056 0.040 0.000 0.000 0.016 0.011 0.882 0.927 0.934 0.983 0.926 0.938 0.936 0.022 0.000 0.025 0.006 0.015 0.016 0.021 AAT-1 4 ALLELES 0.000 0.000 0.000 0.000 0.000 0.000 0.010 0.000 0.024 0.000 0.000 0.000 0.661 0.813 0.982 0.952 0.973 0.929 0.926 0.339 0.167 0.018 0.024 0.027 0.071 0.074 0.000 0.010 AAT-2 2 ALLELES 0.000 0.023 0.006 0.000 0.000 0.981 0.946 1.000 0.977 0.994 1.000 1.000 0.019 0.054 PEP-2 4 ALLELES 0.025 0.000 0.000 0.149 0.000 0.000 0.000 0.975 1.000 1.000 0.851 0.972 0.197 0.235 0.000 0.000 0.000 0.000 0.000 0.803 0.765 0.000 0.000 0.000 0.000 0.028 0.000 0.000 PEP-4 2 ALLELES 0.138 0.000 0.000 0.029 0.000 0.000 0.000 0.863 1.000 1.000 0.971 1.000 1.000 1.000 MPI 5 ALLELES 0.000 0.000 0.053 0.000 0.000 0.016 0.000 0.094 0.050 0.158 0.007 0.094 0.000 0.000 0.604 0.650 0.632 0.787 0.750 0.242 0.500 0.292 0.300 0.158 0.206 0.156 0.629 0.265 0.010 0.000 0.000 0.000 0.000 0.113 0.235 EST-2 3 ALLELES 0.027 0.000 0.000 0.026 0.125 1.000 1.000 0.973 0.980 0.971 0.974 0.875 0.000 0.000 0.000 0.020 0.029 0.000 0.000 0.000 0.000 EST-3 2 ALLELES 1.000 0.929 0.971 1.000 1.000 0.985 1.000 0.000 0.071 0.029 0.000 0.000 0.015 0.000 158 158 1215 1213 2300 2300 2 1057 1055 2525 2525 1057 1055 2525 2525 8 3582 3582 3582 3582 1 _______________________________ Sample output file 7 locations sampled Location 0, sample size = 39 Location 1, sample size = 37 Location 2, sample size = 42 Location 3, sample size = 56 Location 4, sample size = 24 Location 5, sample size = 40 Location 6, sample size = 28 12 loci sampled Input data set Locus 0, 2 alleles 0.000 0.000 0.000 0.000 0.038 1.000 0.967 1.000 1.000 1.000 1.000 0.962 0.000 0.033 Locus 1, 3 alleles 0.176 0.163 0.250 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 0.021 0.000 0.000 0.824 0.837 0.750 0.989 0.979 1.000 1.000 Locus 2, 3 alleles 1.000 1.000 1.000 1.000 1.000 0.459 0.434 0.000 0.000 0.000 0.000 0.000 0.010 0.000 0.000 0.000 0.000 0.000 0.000 0.531 0.566 Locus 3, 5 alleles 0.051 0.028 0.037 0.018 0.000 0.000 0.000 0.346 0.375 0.481 0.709 0.652 0.097 0.083 0.590 0.583 0.481 0.273 0.348 0.411 0.472 0.013 0.014 0.000 0.000 0.000 0.468 0.444 0.000 0.000 0.000 0.000 0.000 0.024 0.000 Locus 4, 4 alleles 0.029 0.016 0.000 0.011 0.059 0.031 0.032 0.066 0.056 0.040 0.000 0.000 0.016 0.011 0.882 0.927 0.934 0.983 0.926 0.938 0.936 0.022 0.000 0.025 0.006 0.015 0.016 0.021 Locus 5, 4 alleles 0.000 0.000 0.000 0.000 0.000 0.000 0.010 0.000 0.024 0.000 0.000 0.000 0.661 0.813 0.982 0.952 0.973 0.929 0.926 0.339 0.167 0.018 0.024 0.027 0.071 0.074 0.000 0.010 Locus 6, 2 alleles 0.000 0.023 0.006 0.000 0.000 0.981 0.946 1.000 0.977 0.994 1.000 1.000 0.019 0.054 Locus 7, 4 alleles 0.025 0.000 0.000 0.149 0.000 0.000 0.000 0.975 1.000 1.000 0.851 0.972 0.197 0.235 0.000 0.000 0.000 0.000 0.000 0.803 0.765 0.000 0.000 0.000 0.000 0.028 0.000 0.000 Locus 8, 2 alleles 0.138 0.000 0.000 0.029 0.000 0.000 0.000 0.863 1.000 1.000 0.971 1.000 1.000 1.000 Locus 9, 5 alleles 0.000 0.000 0.053 0.000 0.000 0.016 0.000 0.094 0.050 0.158 0.007 0.094 0.000 0.000 0.604 0.650 0.632 0.787 0.750 0.242 0.500 0.292 0.300 0.158 0.206 0.156 0.629 0.265 0.010 0.000 0.000 0.000 0.000 0.113 0.235 Locus 10, 3 alleles 0.027 0.000 0.000 0.026 0.125 1.000 1.000 0.973 0.980 0.971 0.974 0.875 0.000 0.000 0.000 0.020 0.029 0.000 0.000 0.000 0.000 Locus 11, 2 alleles 1.000 0.929 0.971 1.000 1.000 0.985 1.000 0.000 0.071 0.029 0.000 0.000 0.015 0.000 Theta 74.01 13.87 2.09 3.44 0.13 0.13 26.22 2.31 4.03 0.12 0.12 2.59 4.98 0.12 0.12 14.81 0.10 0.10 0.13 0.13 6.58 GST 29.78 16.28 3.92 5.34 0.25 0.25 22.19 4.27 6.13 0.24 0.24 4.72 7.26 0.23 0.24 16.03 0.21 0.21 0.23 0.24 9.29 a = 1.840079, b = -0.660922 a = 1.764167, b = -0.563035 __________________________ 158.000 74.014 29.783 2.199 1.869 1.474 158.000 13.865 16.283 2.199 1.142 1.212 1215.000 2.085 3.921 3.085 0.319 0.593 1213.000 3.437 5.337 3.084 0.536 0.727 2300.000 0.126 0.248 3.362 -0.898 -0.606 2300.000 0.126 0.253 3.362 -0.900 -0.598 2.000 26.216 22.189 0.301 1.419 1.346 1057.000 2.313 4.266 3.024 0.364 0.630 1055.000 4.033 6.134 3.023 0.606 0.788 2525.000 0.122 0.237 3.402 -0.915 -0.626 2525.000 0.120 0.242 3.402 -0.920 -0.616 1057.000 2.586 4.722 3.024 0.413 0.674 1055.000 4.985 7.260 3.023 0.698 0.861 2525.000 0.117 0.230 3.402 -0.933 -0.638 2525.000 0.116 0.237 3.402 -0.935 -0.625 8.000 14.807 16.026 0.903 1.170 1.205 3582.000 0.100 0.208 3.554 -1.002 -0.683 3582.000 0.097 0.215 3.554 -1.013 -0.668 3582.000 0.125 0.230 3.554 -0.902 -0.639 3582.000 0.125 0.239 3.554 -0.902 -0.622 1.000 6.585 9.287 0.000 0.819 0.968