############################################################## # ============================================================ # Lab 5: Phylogenetically Independent Contrasts using APE in R # ============================================================ # by Nick Matzke and Nat Hallinan (and whoever else adds to this PhyloWiki page) # Copyright 2011-infinity # matzkeATberkeley.edu # February 2011 # # Please link/cite if you use this, email me if you have # thoughts/improvements/corrections. # # Much of the text of this lab was derived from Nat # Hallinan's 2009 lab for IB200b on PIC in R. # ############################################################## # # Free to use/redistribute under: # Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) # # This program is free software; you can redistribute it and/or # modify it under the terms of the above license, linked here: # # http://creativecommons.org/licenses/by-nc-sa/3.0/ # # Summary: # # You are free: # # * to Share -- to copy, distribute and transmit the work # * to Remix -- to adapt the work # # Under the following conditions: # # * Attribution -- You must attribute the work in the manner # specified by the author or licensor (but not in any way that # suggests that they endorse you or your use of the work). # * Noncommercial -- You may not use this work for commercial purposes. # # * Share Alike -- If you alter, transform, or build upon this work, # you may distribute the resulting work only under the same or # similar license to this one. # # http://creativecommons.org/licenses/by-nc-sa/3.0/ # ################################################################### # # Assignment for Thursday, Feb. 10 # # See the questions under "Your Assignment" near the end of the lab # # Purpose: Learn about the general strategies one uses to look # at correlation between variables; applying these to the special # situation of when the data is character data from species that # are related by a phylogeny. # ################################################################### # Stuff that is handy at the beginning of a script ###################################################### # R, close all devices/windows/ quartz windows graphics.off() # Turn off all BS stringsAsFactors silliness # R tends to read options(stringsAsFactors = FALSE) ###################################################### # Working directories: # One of the first things you want to do, usually, is # decide on your working directory # On my Mac, this is the directory wd = "/Users/nick/Desktop/_ib200a/ib200b_sp2011/lab04" # On a PC, you might have to specify paths like this: #wd = "c:\\Users\\nick\\Desktop\\_ib200a\\ib200b_sp2011\\lab03" #desc: set working directory setwd(wd) #desc: get working directory getwd() #desc: list the files in the directory list.files() # Sourcing scripts # Local sourcing: # source a script (windows machines require \\ instead of /) # sourcedir = '/_njm/' #source3 = '_genericR_v1.R' #source(paste(sourcedir, source3, sep="")) # You need the APE package to source genericR_v1.R scripts # install.packages("ape") library(ape) # Remote sourcing: source("http://ib.berkeley.edu/courses/ib200b/scripts/_genericR_v1.R") source("http://ib.berkeley.edu/courses/ib200b/scripts/_R_tree_functions_v1.R") # ######################################################### # Lab 5: Independent Contrasts ######################################################### # # Today we're going to use R to derive and analyze independent contrasts for continuous # data. First we're going to learn about some statistical models in R by analyzing the raw # data from our last lab. Then we'll generate some phylogenetically independent contrasts # and see how they affect our analyses. Finally, in the last section you will be given a # second data set to analyze and explain. # # Several other programs can do PIC, including CAIC, PDAP and the PDAP:PDTree module in # Mesquite. We will stick with R for now, but you may want to explore some other options # on your own. # # A word to the wise: I will use two abbreviations in this lab, "PC" (Principle Component) # and "PIC" (Phylogenetically independent Contrast). These are very different things and you # should be sure to keep them straight. # ########################################################## # Statistical Analysis in R ########################################################## # # First open your workspace from the last lab (or re-run that script to re-generate # everything). Type "ls()" to see all your objects. # # Statistical Distributions # # R has a number of statistical distributions automatically embedded in the software. # These can be very useful for exploring and analyzing data. To see a list of # distributions look at the Intro to R documentation on line. I will use an example of the # normal distribution, which we are all familiar with. Later we will use the bivariate # normal distribution and the binomial distribution. # # For any distribution you can add a prefix before the name of the distribution to create # different functions: # # r --- (n,parameters) will return n random draw from a distribution. # q --- (p,parameters) will return the value for the distribution which # has the given p-value. # p --- (q,parameters) will return a p-value for a given value from the distribution. # d --- (x,parameters) will return the probability density for a given value. # # For example, type: #desc: generate 10 random numbers, uniformly distributed between 0 and 10 (random_numbers = runif(10, min=0, max=10)) #desc: show the relative density of each of these numbers under #desc: two different uniform distributions dunif(0, min=0, max=1) dunif(0, min=0, max=10) #desc: show the PDF (probability distribution function, i.e. probability of #desc: getting that number or lower) for each random number, under these #desc: uniform distributions punif(random_numbers, min=0, max=1) punif(random_numbers, min=0, max=10) #desc: generate 10 numbers between 0 and 1 (probabilities = seq(0, 1, 0.1)) #desc: the values corresponding to the input #desc: probabilities under uniform distributions qunif(probabilities, min=0, max=1) qunif(probabilities, min=0, max=10) # Let's compare our snout-vent length data to a normal distribution. One assumption of # linear regression is that all the parameters are normally distributed. hist(Anole.ordered[,1]) mn = mean(Anole.ordered[,1]) stand = sd(Anole.ordered[,1]) # That looks pretty crappy. It looks like it has tons of positive skew. Let's see what the # p-values look like for some of our biggest numbers: quant = pnorm(sort(Anole.ordered[,1]),mn,stand) quant[28:30] # Considering that we only have 30 values here, the fact that 3 of them are in the top # 98.7% percentile is more than a little suspicious. (What percentile would you expect # them to be in?) The p-values of our data could be considered the expected percentage of # data points that should be less than that value. Furthermore the cumulative probability # of the sampled data points represent the percentage of data points that are less than # that value for the actual distribution. If our data is in fact normal, then our p-values # should match up with the cumulative probability of our data points, which are distributed # evenly between 0 and 1. Let's plot our p-values against some evenly distributed # fractions and draw a straight line with slope equals 1 to see if they are the same: plot((1:30)/31, quant) abline(0,1) # No, definitely not. # # On the flip side let's see what values we expect at the 90th percentile: qnorm(0.90,mn,stand) # While we actually got: sort(Anole.ordered[,1], decreasing=TRUE)[3] # For the normal distribution you can actually make plots of these comparisons # automatically using qqnorm, but you could make these plots yourself anyway for any # distribution. # # Finally let's look at what this distribution would look like if it had the same # parameters, but was normal: hist(rnorm(30, mn, stand)) hist(rnorm(3000, mn, stand)) ######################################## # Parametric Correlation of Variables ######################################## # # Now let's plot all our data against each other to see what it looks like: pairs(Anole.ordered) # As you can see all these variables seam to be positively correlated. This is not # surprising, since they are all measures of body size. For our initial analysis we are # going to look at mass as a function of snout-vent length, so plot them against each # other: plot(Anole.ordered[,1], Anole.ordered[,2]) # To test for this correlation we will use the cor.test function based on Pearson's product # moment correlation. cor.test(Anole.ordered[,1],Anole.ordered[,2],alternative= "g") # alternative="g" makes the test one tailed for a positive correlation between the # variables. We can make this test one tailed, because we can a priori assume a positive # relationship between measures of size. If you could not make this assumption, then you # should keep the test two-tailed. # # A bunch of information will appear on the screen. At the bottom you are given the # calculated correlation (the covariance divided by the total variance) and 95% confidence # intervals. Above that you see statistics testing the hypothesis that there is no # correlation between these data. The t-statistic is used with 28 degrees of freedom and # the null hypothesis is completely rejected. # # Assumptions of Pearson's Correlation # # That looks great, right? Low p-values high R2? No, it sucks ass. The problem is that # the data does not match the assumptions of the model, in particular the data are not from # a bivariate normal distribution. That means not only is each data set normally # distributed, but for every value of one parameter the other is normally distributed. # First let's test the assumption that each variable is normally distributed. qqnorm(Anole.ordered[,1]) qqline(Anole.ordered[,1]) # This is a plot of the quantiles of your data against the expected quantiles of a normal # distribution. qqline puts a line through the first and third quartiles. If this # distribution was close to normal, then all the dots would be close to the line. As you # can see they are not, the terminal values fall far from the line. Repeat this for your # mass as well. # # Since neither data set is normally distributed, you know that together they do not fit a # bivariate normal distribution, but let's check anyway just for fun. This is a little # tricky, so I wrote a function that will plot the cumulative probability of our data # against their p-values from the multivariate normal distribution. The values from these # distributions should match exactly, so we'll add a line of slope 1 through the origin for # comparison. Load the mnormt package... #install.packages("mnormt") library(mnmormt) # ...then source the functions in "test.mvn.R" (written by previous GSI Nat Hallinan, # I believe)... # Remote sourcing: source("http://ib.berkeley.edu/courses/ib200b/scripts/test.mvn.R") # ...and then: qqmvn(Anole.ordered[,1:2]) abline(0,1) # Wow, that really sucks. ################################# # Transforming Data ################################ # # Luckily we do have some recourse, transforming the data. Transforming data is a science # and an art, we are not going to spend any time on it here. We will just rely on the # simple log-log transform that we already did. Not only is this justified in that you may # expect this type of data to be log normal, it also makes sense because these data should # pass through the origin, so a log log transform will show a linear relationship between # any two variables that have a relationship y=bxa, which is a more general form of the # normal linear relationship passing through the origin. # # Let's check all our assumptions for the log transformed data: qqnorm(Anole.log[,1]) qqline(Anole.log[,1]) qqnorm(Anole.log[,2]) qqline(Anole.log[,2]) qqmvn(Anole.log[,1:2]) abline(0,1) # Still not perfect, but over all pretty good, and a lot better than what we saw before. # Let's test for this correlation: cor.test(Anole.log[,1], Anole.log[,2], alternative= "g") # That looks great, still highly correlated. # ################################################## # Nonparametric Tests ################################################# # # So what do you do, if you can't get your data to fit the assumptions of your model? You # use a nonparametric test, which does not rely on any assumptions. These tests are more # robust than parametric tests, but less powerful. There are a nearly infinite number of # nonparametric tests that depend on what hypothesis you're testing. In this case we want # to see if one parameter is larger when the other parameter is larger. We're going to try # two test statistics, Kendall tau and Spearman's rho. Both tests look to see if higher # ranks of one variable are associated with higher or lower ranks of another. # # For the Kendall test: cor.test(Anole.log[,1], Anole.log[,2], alternative= "g", method="k") # This gives you the test statistic, z, an estimated p-value and Tau. The p-value is # one-tailed as before. As you can see, the results are still highly significant, although # the p-value is a little larger. Check out this test for the data set that has not been # log transformed: cor.test(Anole.ordered[,1], Anole.ordered[,2], alternative= "g", method="k") # The results are identical, because taking the log of a data set does not change the ranks # of the data points. If y > x, then ln(y) > ln(x). # # # We can also do the Spearman test, which is very similar: cor.test(Anole.log[,1], Anole.log[,2], alternative= "g", method="s") # Obviously these two variables are highly correlated. # # Testing Multiple Correlations at Once # # We are not limited to doing the tests for correlations between variables 2 at a time. # # The R package "picante" has a function cor.table, which can do every pairwise comparison # at once. Load picante... install.packages("picante") library(picante) # ...then type: cor.table(Anole.log) # You will get two tables. The first table shows the Pearson's correlation coefficient for # each of the pairwise comparisons; the second shows the p-values for these correlations. # Unlike cor.test, cor.table can only do two-tailed tests, since our test is one tailed, we # can divide all these p-values by 2. As you can see, all these correlations are highly # significant. However, do they match the assumptions of multivariate normality? qqbvn(Anole.log) # This is another little function I wrote to test these assumptions. The little plots # along the diagonal are comparisons of the distribution of each variable to the normal # distribution; the other little plots to the top right are comparisons of each pair of # variables to their expectation under a bivariate normal distribution, just like the plot # produced by qqmvn. The big plot on the bottom left is a comparison of all the data # together to their expectation under a multivariate normal distribution. The numbers # represent p-values for the Shapiro-Wilks test of normality; significant values indicate # rejection of the normal distribution. Simulating data showed that both these tests may # be a little harsh, but these results clearly indicate that very little of this data # really fits a normal distribution. # # Luckily cor.table can do the same nonparametric tests as cor.test can. cor.table(Anole.log, cor.method="k") # and cor.table(Anole.log, cor.method="s") # Obviously, all these results are significant, no matter what test you use. ############################### # Principle Components Analysis ############################## # # When we originally looked at the data it was clear that all the characters were highly # correlated with size. What if we wanted to look at changes in theses quantitative # characters independently of that one big factor? Well, we can look at the principle # components of this same set of characters. Principle components are the same data points # but on a different set of axes. The originally set of axes that our characters were # measured in are rotated, so that the total variance is minimized. This new set of axes # separates out the parts of the individual characters that are linearly correlated with # each other into different components. The idea is that each of the new axes explain how # the data varies together, rather than leaving all the data separated as they were # measured. # # Why would you want to do this? Let's take as an example a data set with only two columns # hind leg length and fore leg length. One thing you could learn from a PCA is how # correlated these characters are. If they are totally uncorrelated (that would be a crazy # looking animal), then each of the PCs would explain 50% of the variation. One PC would # be fore leg length and the other hind leg. On the other hand, if they are highly # correlated, then the first PC would explain most of the variation in both fore and hind # limbs, and would be called something like limb length. The second PC would explain only # a little of the variance and would correspond to something like fore leg to hind leg # ratio. The idea is that measurements along the PCs are somehow more informative than # those along the original axes. You learn more talking about limb length and fore # leg/hind leg ratio than you do by talking about fore leg and hind leg length # independently. For data sets with a number of variables PCA can be quite useful for # picking out highly informative groupings of variables. PCs = prcomp(Anole.log) # First let's look at some basic information about our new axes: summary(PCs) screeplot(PCs) # This will show three rows describing the amount of variance described by each PC. Look # at the proportion of variance rows. The plot shows the same thing. As you can see the # first component explains 96.7% of the variance. That's a lot, and we can assume that # this component is essentially size. The standard is to say that all the components that # cumulatively explain 95% of the variance are significant, and the rest are just a bunch # of noise. In this case that is just the first component. We could just as well set it # at 99% or whatever other arbitrarily chosen number. # # Now let's look at the vectors these components define: PCs$rotation # The first component shows positive slopes for all of our values, this is what we would # expect for size. The slope is particularly biased to mass, as we might expect, since # mass should scale to the cube of length in a species of uniform shape. On the other hand # the lamellae are relatively unaffected. The second component is characterized mostly by # an increase in tail length with a slight increase in the hind legs, and interestingly a # decrease in mass and in the fore legs. Thus this represents some shift in overall body # shape. To see how the different taxa line up on these new axes type: biplot(PCs) # The x-axis is PC1, so taxa to your left are smaller. The y-axis is PC2, so divergence # along this axis shows divergence in this first shape parameter; taxa to the top have # relatively long tails and hind legs, and relatively low mass and short fore legs for # their size. The vectors in the middle show how much each of our original variables # contributes to each component. As you can see, these two components are completely # uncorrelated, that is a consequence of PCA, and tells you nothing. # # To compare other PCs (like say the 2nd and 3rd) use the choices command: biplot(PCs,choices=c(2,3)) # If for some reason you want values of each of your taxa for each of the PCs, you can find # them in PCs$x. ################################################ # Phylogentically Independent Contrasts ################################################ # # Finally, we're going to get into the meat of this lab. We've already discussed the # theory behind PICs in lecture, so let's just get right down to it. To calculate PICs for # the logs of our traits on our tree: pic(Anole.log[,1],Anole.ultra,var.contrasts=TRUE) # This will return a matrix of 29 rows and 2 columns. Each row refers to one internal # node, and the name of the row gives the name of that node. Column 1 shows the contrast # across that node and column 2 shows it's expected variance. Let's make an array # containing the contrasts for all our characters and all their variances: # create array 29x2x6 Anole.PIC = array(dim=c(29, 2, 6)) # fill it in for (i in 1:6) { Anole.PIC[ , , i] = pic(Anole.log[,i], Anole.ultra, var.contrasts=TRUE) } # "array" will create an array with the dimensions given. for(i in 1:6) will repeat the # functions given in the command for every element in the vector 1:6 being the value of i. # I bet you can figure out the rest. # # Confirming Assumptions of PIC # # So, we have our contrasts, but just like with our linear correlations, we need to make # sure that our data fit the assumptions of the model. Luckily David Ackerly has provided # us with a function that makes this all very easy. # # First load the picante package. Source the R script diagnostics_v3.R (originally # written by David Ackerly; modified/improved slight by Nick Matzke) and run it; if it # doesn't work try diagnostics2.R. source("http://ib.berkeley.edu/courses/ib200b/scripts/diagnostics_v3.R") # ...then type: diagnostics(Anole.log[,1],Anole.ultra) # Six plots will appear. The two on the left will test whether the positivized PICs fit a # half-normal distribution centered at 0. The top left is just a histogram of the PICs and # it should look like just half of a normal distribution. The plot on the bottom right is # a QQ-plot of the PICs against a half normal. It should fit approximately a line. Both # those plots look good. # # You want the next three plots to show no significant relationship. The top middle is the # most important; it plots the standardized contrasts against their branch lengths. If # there is a significant negative relationship here, then that implies your branch lengths # are bad. Because contrasts are standardized by dividing by branch lengths, long # arbitrary branch lengths can make the contrasts too small. This problem can be # alleviated by replacing all your branch lengths with their logs. There is a negative # slope here; it's not significant, so we wouldn't normally worry about it, but just for # fun let's log convert the branches: Anole.tree.log = Anole.ultra Anole.tree.log$edge.length = log(Anole.ultra$edge.length + 1) diagnostics(Anole.log[,1], Anole.tree.log) # The p-value is now even higher for that relationship. The plot on the top right shows # the relationship between the contrasts and the age of the node; 0 is the time of the root # and the present can be found to the right. A positive relationship indicates extremely # low signal and a negative extremely high. The plot in the bottom middle shows the # relationship between the reconstructed value at a node and the contrasts; a positive # relationship implies that the data should be log transformed, before the contrast is # taken. To see an example of this look at the untransformed mass contrasts to the # transformed ones: diagnostics(Anole.ordered[,2],Anole.ultra) diagnostics(Anole.log[,2],Anole.ultra) # See, more justification for a log transformation. The plot on the bottom right is of # course the plots of k that we learned about during lecture on Thursday. Values of k # close to 1 show the expected amount of signal under Brownian Motion. If you have a lot # of taxa, you may want to suppress the calculation of k with "calcK=FALSE", as it can be # cumbersome. # # Investigate the contrasts for our other traits. Are you satisfied that they fit the # assumptions? Great then let's move on. # # # Parametric Significance of PICs # # The first thing that we want to do is test for a significant relationship between the # PICs. It is important to recognize the difference between the relationship between the # raw data and the relationship between the PICs. In the first case you are asking whether # taxa with greater length have larger mass; in the second your question is do evolutionary # increases in length correspond with increases in mass. # As always let's start by plotting all our factors against each other: pairs(Anole.PIC[,1,]) # They all look very correlated. First we'll make sure that our contrasts match the # assumptions of Pearson's test. We've already concluded that our data is normal from the # half-normal plots in the tests of PIC assumptions. So all we need to test is that our # data fits a bivariate normal distribution. Remember that PICs must have a mean of 0, so # we're going to have to have to force the fit through the origins by setting the means # ourselves: qqbvn(Anole.PIC[,1,],means=rep(0,6)) # That looks pretty good. Why would you expect the PICs to fit a multivariate normal # distribution better than the raw data? The p-values for the Shapiro-Wilks test are not # displayed, because its assumptions are not met by PICs, so we're going to have to go off # of fit alone. I might be concerned about the fits of all the comparisons with the # lamellae or the tail lengths, but the rest look great. # # Now we can look for a linear correlation. We also have to force our fit through the # origin here. cor.test can't do that, but cor.table can, if we use the cor.type="c" (for # contrast) command: cor.table(Anole.PIC[,1,],cor.type="c") # As you can see all these results are very significant. Keep in mind that we made 15 # separate comparisons; you would expect to find one comparison with a p-value less than # 0.05 more than 1 in 20 times, when you look at 15 comparisons. As a general rule you # should multiply your p-values by the number of comparisons you make. The number of # comparisons can easily be calculated as n(n-1)/2, where n is the number of variables that # we have. Let's do that to see the effects on our results. results.cor = cor.table(Anole.PIC[,1,], cor.type="c") results.cor$P * 15/2 # We divided by two, because these p-values are for a two tailed-test and our test is one # tailed. Of course our results are still all significant. They were so significant # before, that we wouldn't expect multiplying by 7.5 to have much of an effect. For a # quick view of which comparisons are significant try: (results.cor$P * 7.5) < 0.05 # Nonparametric Significance of PICs # # As with raw data, PICs may not always fit the assumptions of linear regression. In that # case you need to do a nonparametric test. The first and most basic test is to ask # whether one character increases, when the other increases. This is a very straight # forward test. The first thing to do is create a logical vector that is true when both # PICs are positive or negative, false when one is negative and the other positive, and # excludes contrasts for which either character shows no change. ref = (Anole.PIC[,1,1] != 0) && (Anole.PIC[,1,2] != 0) both.pos = (Anole.PIC[ref,1,1] * Anole.PIC[ref,1,2]) > 0 length(both.pos) sum(both.pos) # For 26 out of 29 comparisons between contrasts either both show an increase or a # decrease. We can test the significance of this result by comparing it to the binomial # distribution. The binomial distribution is a distribution that takes one of two values. # The probability of getting one result or the other is one of the parameters in the model. # We will assume that there is a 50% chance of getting each result. In that way we will # try to reject the null hypothesis that it is equally likely for the contrasts to change # in the same direction as it is for them to change in the opposite. pbinom(3,29,0.5) # We used 29-26=3, instead of 26, because this formula evaluates the probability of getting # that many results or fewer, so for a one tailed test of positive correlation, we want to # know the probability of getting as many negative results or fewer. As you can see this # relationship is highly significant. # # cor.table also has been programmed for you to do non-parametric rank correlation tests # for independent contrasts. As with parametric tests, you must use the cor.type="c" # command: cor.table(Anole.PIC[,1,], cor.method="k", cor.type="c") # and cor.table(Anole.PIC[,1,], cor.method="s", cor.type="c") # Are all those results significant? Don't forget to multiply by 7.5. (Would you like to # make a chart of all the possible comparisons between PICs using the binomial # distribution? If so see the appendix.) # # PCA on PICs # # You can also generate PCs from PICs. Why would you want to do that? Why is that any # better than doing it on the raw data? The idea is that the PCs of the PICs represent the # principle axes of evolutionary change. In other words, the ways that things actually # change when they evolve. This can obviously be a very interesting bit of information. # We have to do one little trick to do PCA on PICs to force them through the origin. We # are going to create a mirror image of the PICs and stick that together with the other # PICs. This will double the amount of data, but since PCA does not do any statistical # tests, degrees of freedom don't matter, so we'll be fine. PIC.mirror = rbind(Anole.PIC[,1,], (Anole.PIC[,1,] * (-1)) ) PCs.PIC = prcomp(PIC.mirror) # Now let's check them out: summary(PCs.PIC) # As you can see, the first PC still describes the vast majority of the variance, and it is # probably change in size. Let's look at the vectors: PCs.PIC$rotation # These look very much like the original PCs that we generated. For some data sets there # can be large differences. When would you expect to see such a difference? # Finally, let's plot them: biplot(PCs.PIC) # or biplot(PCs.PIC,choices=c(2,3)) # You will see your contrasts spread out over the PCs. This tells you how much change of # each type happened along each contrast. Remember you mirrored your contrasts, so each # one will appear twice on opposite sides of the plot. # # Your Assignment # # Here is your assignment for this week. I'm really going to grade this one, not just mark # off that you did it, so do a good job. Turn it in to me on paper in class next Tuesday, # the 24th of February. Please try to be concise. # # You should already have downloaded the nexus file PIC.assignment from the web site. This # file has a tree with 49 taxa and a data matrix with 3 continuous characters. This is a # modified version of a file from the Mesquite examples. # # 1. Are these characters correlated without considering the phylogeny? What are the # p-value of the correlations? Is the correlation positive or negative? What statistical # test did you use? Why did you choose that test? Was the test two-tailed or one-tailed? # Why? # # 2. Are these characters correlated, when you do consider the phylogeny? Does the data # fit the assumptions of independent contrasts? What are the p-value of the correlations? # Is the correlation positive or negative? What statistical test did you use? Why did you # choose that test? Was the test two-tailed or one-tailed? # # 3 Why is there a difference in result between these two tests for significance? Be # specific; you should both look over the contrast plots and plot reconstructions of all # these characters on a tree (in R or Mesquite). Does this say anything interesting about # the evolution of these taxa? # # You should always ask yourself these questions for any analysis involving PICs. ####################################################################### # Appendix: A Brief Digression on Writing Functions ####################################################################### # # Up until now I've avoided the subject of writing functions. Let's write a function # that will produce a table of results from a two-tailed sign test for every possible # comparison between traits. I just need to explain a few things first. # In R functions are treated as objects, just like variables. When you write a function, # you write "function" followed by parentheses, which have the possible input parameters. # You can define defaults for those parameters at this time. You also assign that function # to a named object like any other object in R. The function will then appear in your list # of objects. # # "for (num in vector)" will run the commands that follow it for every element in the # vector. Replacing num with that element. # # "if" is followed by some conditional statement in parentheses. The subsequent commands # are only carried out if the conditional statement is true. If it is not true the # commands following "else" are executed. # # All of these execute a series of commands that appear after them; by after I mean either # on the same line, or surrounded by "{}". # Create a function object named sign.test # with one input paramater, x sign.test = function(x) { # How many columns in x n = dim(x)[2] # Create two n by n matrices counts = ps = array(0,dim=c(n,n)) # logical vector indicating which changes are 0 not0 = x!=0 # start a loop that will run variable i from 1 to n-1 for (i in 1:(n-1)) { for (j in (i+1):n) { # Neither should be 0 ref = not0[,i] && not0[,j] both.pos = (x[ref,i] * x[ref,j]) > 0 # The number of positive results pos = sum(both.pos) # Add those to above the diagonal counts[i,j] = pos # The number of negative results neg = length(both.pos)-pos # Add those to below the diagonal counts[j,i] = neg # For 2-tailed want pos or neg correlation times = min(pos,neg) # we do the test # times 2 cause two-tailed ps[i,j] = 2*pbinom(times,pos+neg,0.5) # Below diahonal matches the top ps[j,i] = ps[i,j] } } # return your results as a list list_to_return = list(counts=counts,ps=ps) return(list_to_return) } # Now we can run the test on all our data: sign.test(Anole.PIC[,1,]) # There, don't you feel accomplished. The first table shows the number of positive results # above the diagonal and the number of negative results below. The second shows the # p-values. As you can see many of our correlations are significant, but not all. Don't # forget to multiply by 7.5. To see exactly which are significant. (sign.test(Anole.PIC[,1,])$ps * 7.5) < 0.05 # How does that compare to the other non-parametric tests?