Because biological organization is fundamentally based on a bifurcating descent-with-modification process, the solution to the problem of phylogenetic estimation is extremely important to a wide variety of basic and applied biological problems. Recently, the introduction of molecular techniques has made available tremendous amount of data for phylogenetic analysis such that several NSF initiatives have been generated to elucidate the phylogenetic tree of all Life. In this area we are concentrating on the following problems:
Investigating the statistical properties of phylogenetic estimators
We are interested in understanding the large-sample and finite sample behavior of phylogenetic estimators especially with respect to their scaling properties both in problem size and data size. We have developed several geometrical analysis techniques using the idea of imbedded joint probability space. We are also working on developing new estimation principles. For example, we have pioneered the idea of using computational algebraic geometry techniques for estimating evolutionary trees. Algebraic geometry techniques can be used to extract invariant functions of the joint probability distribution of character states that can be used as efficient direct estimators of phylogeny. More recently we have been experimenting with using penalized likelihood methods to connect several distinct classes of phylogenetic estimators into a single family.
Establishing a simulated data set for validating phylogenetic estimators
One of the fundamental obstacles in developing phylogenetic estimation algorithms is the lack of an agreed standard for performance evaluations. Typical evaluation studies are ad hoc and use procedures that are difficult to replicate. We are part of a NSF funded project called National Resource for Phyloinformatics and Computational Phylogenomics. As part of this group, we are developing a suite of simulated data using a range of models of evolution from simple models of mutational changes to whole genome evolution. We are also developing a suite of statistical performance evaluation tools. The development of the simulated data set and performance evaluation methods will lead to a unified assessment of phylogenetic estimation algorithms.
Mining the existing database for phylogenetic information
The current databases such as Genbank contain a wealth of information for phylogenetic estimation. For example, in 2002, Genbank contained information from ~125,000 organisms. However, this data has been collected in a haphazard manner, which leads to the interesting problem of how we might most efficiently use this data. The computational problems here include obtaining maximal subsets of informative data, using fragmented datasets to generate a "super" tree that maximally cover available organisms, and illuminating bottlenecks in the available data. We are part of a NSF funded "Assembling the Tree of Life" project that will address these computational problems.