Understanding the effects of taxon sampling on phylogenetic estimation is a problem that grows more urgent as our ability to gather and analyze large-scale datasets expands. Recent attention to this question has led to several studies examining the behaviour of phylogenetic estimators with the addition or deletion of taxa to datasets, yielding the puzzling findings that the estimator performances improve sometimes but not always with the addition of taxa. Results from Kim (1998) hint at an explanation for this phenomenon, showing that for some kinds of clade structure, more intensive taxon sampling can lead to better phylogenetic estimates.
Accordingly, there has been a call for broader systematic studies of the scaling effect on the performance of parsimony and other frequently used taxonomic estimators, across a range of models, character evolution rates, and tree structures. A full systematic survey is possible with the novel approach of finding extreme bounds on the performance of phylogenetic estimation with increases of taxa. We have proposed to provide such a study by calculating performance indicators for phylogenies estimated across a variety of model trees and sampling strategies, and delineating a profile of those clades and datasets where the performance of the estimators increases with increased sampling.
The goals of the project are to answer the following:
- What kind of clade structures lead estimators to perform better with more taxon sampling?
- Can we detect the presence of such structures from the data?
- Given the phylogenetic structure of some large clade and a phylogenetic estimation problem involving a fixed subset of this clade, do phylogenetic estimators perform better or worse when larger numbers of taxa are sampled from this clade, the tree is estimated using the larger numbers of taxa, and then pruned back to the original phylogenetic estimation problem?
- Given the phylogenetic structure of some large clade and a phylogenetic estimation problem involving a fixed subset of this clade, is there an optimal way to sample additional taxa so as to improve our estimate?
The ultimate reward of this analysis will be the development of practical guidelines for finding optimal taxon sampling strategies for phylogenetic estimations.
A little background reading:
- Chase, M. W., Soltis, D. E., Olmstead, R. G., Morgan, D., Les, D. H., Mishler, B. D., Duvall, M. R., Price, R. A., Hills, H. G., Qiu, Y. L., Kron, K. A., Rettig, J. H., Conti, E., Palmer, J. D., Manhart, J. R., Sytsma, K. J., Michaels, H. J., Kress, W. J., Karol, K. G., Clark, W. D., Hedren, M., Gaut, B. S., Jansen, R. K., Kim, K. J., Wimpee, C. F., Smith, J. F., Furnier, G. R., Strauss, S. H., Xiang, Q. Y., Plunkett, G. M., Soltis, P. S., Swensen, S. M., Williams, S. E., Gadek, P. A., Quinn, C. J., Eguiarte, L. E., Golenberg, E., Learn, G. H., Graham, S. W., Barrett, S. C. H., Dayanandan, S., and Albert, V. A. (1993). Phylogenetics of Seed Plants - an Analysis of Nucleotide- Sequences From the Plastid Gene Rbcl. Annals of the Missouri Botanical Garden 80, 528-580.
- Hillis, D. (1998). Taxonomic Sampling, Phylogenetic Accuracy, and Investigator Bias. Systematic Biology 47, 3-8.
- Kim, J. (1998). Large-Scale Phylogenies and Measuring the Performance of Phylogenetic Estimators. Systematic Biology 47, 43-60.
- Rice, K. A., Donoghue, M. J., and Olmstead, R. G. (1997). Analyzing Large Data Sets: rbcL 500 Revisited. Systematic Biology 46, 554-563.
- Swofford, D. L., and G. J. Olsen. (1990). "Phylogeny Reconstruction". Pages 411-501 in Molecular Systematics (D. M. Hillis and C. Moritz, eds.) Sinauer, Sunderland, Massachusetts.