Linear regression is known as a least squares method of examining data for trends. As the name implies, it is used to find “linear” relationships. To begin our discussion, let’s turn back to the “sum of squares”:
, where each xi is a data point for
variable x, with a total of n data points.
The sum of squares is used in a variety of ways. One common way is in the calculation of variance, which is a measure of the variability in a dataset (though a different term also called variance is used to measure uncertainty – we will not be dealing with this other term here):
.
Note that this equation looks very similar to the equation for the mean, given as:
.
The only major difference is that instead of taking the mean of the data points, we’re taking the mean of the squared distances between the data points and their mean. (Note: The n-1 term in the denominator of the variance is due to bias in estimation that occurs due to sample size, and its explanation is beyond the scope of this course). Please note, also, that the standard deviation of a dataset is just the square root of the variance.
Another common way that the sum of squares is used is in the calculation of covariance. The covariance is calculated from a form of the sum of squares called the “sum of cross-products”. This term is calculated using two variables, and is given as the following:
.
This term is very similar to the sum of squares, but it looks at the relationship between two variables. We use this term when we have data pairs, or coordinates ((x1, y1),(x2,y2)…), and wish to examine whether a relationship exists between the two. We take the x term in one set of coordinates, subtract the mean of x from it, and multiply the difference by the analogous difference in the y term. Since there are n coordinates, we sum the n cross-products together.
The covariance is just the mean cross-product, given as:
.
This term is exactly analogous to the variance term, only it deals with two variables and their relationship, rather than just one variable. One other interesting difference is that the covariance term may be negative.
One major property of sum of squares and sum of cross-products that is useful to biologists is inherent in the name. In both cases, we are dealing with sums. In other words, variance and covariance may be treated as additive. Hence, we can divide up the variance by meaningful biological categories, such as variance due to inheritance, environment, etc.
Back to Regression
To ascertain whether there is a linear, dependent relationship between two variables, we first need a data set to work from. In this data set, data are paired. That is, one value of x is associated with one value of y. One variable must also be independent of the other, while the other variable must be directly or indirectly dependent on the other (though this can be tested with regression). As an example, we may work with heights of parents and their children. In this case, children’s heights are dependent on their parents’ heights, but not vice versa. So, parental height is the independent variable, or x, while offspring height is the dependent variable, or y.
Next, we need to estimate a line. A line is given by the following equation:
.
Here, a is the y-intercept and b is the slope. To calculate b, we use the following equation:
.
Remember that the slope is the change in the dependent variable divided by the change in the independent variable. Here, we are dealing with variation in data rather than absolute differences between coordinates. That is because we cannot expect nature to produce perfect linear relationships for us, nor can we so accurately measure variables that no error enters into the data set. Hence, we are defining the slope as the proportion of the variance in the independent variable that covaries with the dependent variable. This is the slope of the linear relationship hidden within the cloud of data we have. As it happens, if our data actually look like a circular cloud when we plot them, our slope will become 0, suggesting no causal relationship between variation in x and y.
To solve for the y-intercept, we use the following equation:
.
Easy as cheese.
Next, we are interested in determining the proportion of the variability in y that is explained by variability in x. This is called the coefficient of determination, r2, and it is different from calculating the slope. Instead, this tests how closely around the predicted line the data cluster, assuming that there is a relationship to begin with (i.e., slope is not 0). To do this, we calculate:
.
If this term equals 1, then y is completely predictable from x, and all of the variation we see in the dependent variable is explained by associated variability in the independent variable. For an ecologist, this situation would be a dream come true! If this term is 0, then none of the variation in y is explainable by x.
Okay. That’s enough of my babbling for now. Here’s an assignment:
Parent Ht |
Child Ht |
166 |
157 |
183 |
174 |
178 |
168 |
180 |
181 |
168 |
169 |
171 |
176 |
173 |
165 |
178 |
169 |
172 |
179 |
177 |
174 |
Parent Ht |
Child Ht |
64.9 |
61.36 |
71.31 |
68.14 |
69.46 |
65.43 |
70.31 |
70.75 |
65.48 |
66.02 |
66.63 |
68.71 |
67.63 |
64.46 |
69.68 |
66.04 |
67.13 |
70.01 |
68.97 |
67.84 |
72.89 |
71.96 |
71.62 |
74.05 |
71.99 |
71.76 |
70.73 |
66.43 |
66.86 |
69.71 |
64.37 |
64.18 |
69.16 |
74.58 |
68.84 |
64.63 |
68.02 |
63.96 |
68.93 |
63.91 |
68.53 |
64 |
67.04 |
71.12 |
69.92 |
66.64 |
69.28 |
66.49 |
69.91 |
66.36 |
68.43 |
70.54 |
68.23 |
69.93 |
68.52 |
69.92 |
68.58 |
69.49 |
68.08 |
69.44 |
Note: The data used for these problems actually comes from Francis Galton’s data on human height. Be careful that you don’t make the same misinterpretations that he did!