Huelsenbeck

JOHN HUELSENBECK

Huelsenbeck's Lab: http://ib.berkeley.edu/people/lab_detail.php?lab=54

What is the probability that a Berkeley undergraduate who majored in Integrative Biology would return to be a professor in the same department? The probability is higher than you might think. IB has three faculty who were former undergraduates, out of a total of 37 faculty, but only one of those three alumni/professors is as well versed in probability as Dr. John Huelsenbeck. Probably.

Huelsenbeck is a computational and evolutionary biologist. He is interested in how to reconstruct the phylogenetic history of life--the phylogeny problem--by comparing DNA sequences sampled from different species. The result of a phylogenetic analysis is a tree, representing the relationships of the species. Much of his research concerns a realm of statistics called Bayesian inference, which allows scientists to account for uncertainty in their analyses.

But his interests don't stop here. "I am also interested in what you can do with the phylogenies once you have them, what types of evolutionary questions you can address with phylogenies."

When Huelsenbeck was an undergraduate at Berkeley, he became interested in paleontology. He took courses from Carol Hickman and did field work with David Lindberg, professors who are now his colleagues.

As a graduate student in paleontology at the University of Texas, Austin, Huelsenbeck was trying to figure out whether one should include fossils in a phylogenetic analysis. "Fossils pose a problem for phylogenetic analysis. A fossil is incomplete compared to living organisms where one can sequence the genome or compare the soft tissue to other species. On the other hand, fossils may be closer in form to the ancestor," and so the fossil may provide important information not available in any living species. The question was simple: Does inclusion of a fossil in a phylogenetic analysis help resolve phylogeny, despite the relative incompleteness of that fossil? Huelsenbeck did computer simulations to explore this problem, and it changed the trajectory of his research: "I really enjoyed coding and programming and addressing questions from a theoretical viewpoint."

Huelsenbeck has continued to work on the phylogeny problem. Phylogenies are usually built using the DNA sequences of several organisms. However, one dataset can produce many possible evolutionary trees, depending on the assumptions the researcher makes about the process of evolution. Each tree has a probability of representing the true relationships, given the data that is available. Huelsenbeck became interested in comparing the probabilities of different phylogenies, using a statistical method called Bayesian inference.

Bayesian inference is named for a theorem first introduced by Thomas Bayes, a minister and mathematician living in 18th Century England. Bayes' Theorem describes how one can update beliefs about a hypothesis in the light of new data. "I think of Bayesian inference as a model for how science works, or should work," says Huelsenbeck. "Scientists start off with some set of beliefs about the world, that they test through experiment. A Bayesian asserts that those beliefs about the world can be expressed as probabilities." The scientist then makes some experimental observations. In light of those new observations, says Huelsenbeck, "If you're a scientist you should modify those beliefs in some way. Bayes' Theorem tells you how you should modify those beliefs, how you should change your probabilities about different hypotheses."

Huelsenbeck explains this using an example. Imagine a newborn baby, "a very logical baby well-versed in probability," he says, who after observing the sun setting and rising is concerned about whether this cycle will repeat. This genius baby also has access to a life-time supply of black and white marbles. Thinking both events--a repetition of the sunset/sunrise cycle and the grim alternative--are equally likely, he puts both a black and a white marble in his bag. The black marble represents a repetition of the cycle, whereas the white marble represents a continual day (or night). Every day thereafter, after having observed another sunset/sunrise cycle, he puts a black marble into the bag, representing another repetition of sunset/sunrise. "After a lifetime of experience, and having witnessed thousands of sunsets, the baby's bag is going to have many, many black marbles in it, representing all of the experimental evidence he has accumulated. The single white marble represents the baby's initial uncertainty."

But what do all these marbles have to do with evolutionary trees? Basically, you want to calculate the probability that a particular tree is correct, conditional on your data--your bag of marbles, or in a phylogenetic analysis, a collection of DNA sequences. Bayesian inference allows you to compare each tree to all the other possible trees that could be created using your data. It is actually even more complicated--each tree is created using a model of evolution, and each model has a lot of parameters associated with it. And, says Huelsenbeck, "the most reasonable way of dealing with the large number of parameters is in a Bayesian framework."

Ultimately, Bayesian inference allows you to determine the probability that a given tree is the correct tree. Which lets you compare possible evolutionary histories, and pick the one that most likely actually happened.

In 2000, Huelsenbeck wrote a computer program that uses Bayesian inference to compare phylogenetic trees, in order to figure out which tree has the highest probability of being correct. He wrote the program for his own research, and he named it MrBayes, he explains, as a "lame, inside joke to myself." He shared this code with people who wanted to use it for their research--and in their publications they cited it as MrBayes. In 2001, Huelsenbeck gave the program a proper user interface. "After that, people really started using it. It now has over 6,000 citations."

So if you hear people refer to Dr. Huelsenbeck as Mr. Bayes, it's not because he has a dangerous alter ego, like Dr. Jekyll and Mr. Hyde. It's because Huelsenbeck wrote what is now a hugely popular program for phylogenetic analyses using Bayesian inference.

Recently, Huelsenbeck has turned his attention to the problem of alignment uncertainty. To make evolutionary trees, phylogeneticists start out with DNA sequences from each of the species of interest. These sequences need to be aligned--stacked on top of each other. Then their evolutionary relationships are determined, based on the similarities and differences in the sequences. The sequences never align perfectly; after all, it is the differences that allow scientists to reconstruct evolutionary relationships.

Usually, scientists intentionally choose sections of the genome that are easy to align. But it is becoming more and more common to build phylogenies based on whole genomes. In this case, they shouldn't pick and chose which parts they use, since all parts of the genome can provide important information.

There are many different computer programs that will align DNA sequences, and these programs use different methods. Huelsenbeck and his colleagues did a study to see if using different alignment methods produces different alignments, and thus different phylogenetic trees. Indeed it does. "There is a great deal of uncertainty in the alignment, and your results can change depending on what alignment method you use." Huelsenbeck thinks that scientists should keep this in mind: "We suggest that they treat alignments as a random variable, as something that is uncertain, and that they accommodate the uncertainty in the phylogenetic analysis."

Huelsenbeck uses a Bayesian framework to examine uncertainty in other aspects of evolution. In one of his projects, he looks at uncertainty in selecting a model of evolution. There are several models of evolution, which describe the likelihood of different types of changes occurring along a DNA sequence. Usually, scientists will use one model of evolution in their analysis, though they're not necessarily sure if it's the correct model for each gene (one gene might follow one model, while the gene next door follows another). Huelsenbeck urges people to use a class of models, instead of just one--so the model itself has some uncertainty, which is factored in to the analysis.

While Huelsenbeck's work deals with a lot of uncertainty, there is one thing he's sure of, now that he's back in Berkeley: "I'm a Cal fan."

Courses:

Huelsenbeck will teach a course in statistical phylogenetics for graduate students, IB 206. He also teaches an introductory programming course, designed for biologists. In a few years, he will teach the evolution section of Introductory Biology (Bio 1B), which, says Huelsenbeck, "would be great, because I took that course as an undergraduate. I've got some respect for it."