The analysis of large-scale data sets is rapidly becoming a central problem in phylogenetic biology and molecular evolution. The algorithms for phylogenetic inference are extremely difficult (NP-hard) for many popular estimation methods. Large datasets force the use of heuristic algorithms that are sensitive to the structure of how the function value relates to the function arguments. The relationship between the objective function value and function argument is called the objective function landscape. By examining landscapes constructed from real genetic data for the Maximum Parsimony function we characterized the variables relevant to the success or failure of heuristic algorithms. As expected, the structure of the objective function landscape became more complex and more difficult with increasing numbers of taxa for all search graphs. There were increasing numbers of local optima, decreasing size of the attraction basins, and decreasing correlation between the depth of the optima and their attraction basins. In particular, we found Nearest-Neighbor Interchange search strategy to be inappropriate for heuristic searches. The trees at different local optima can be quite different. We found a topological difference of 40-50% among local optima differing by only 0.5-1.0% in tree length. This indicates a need for different heuristic strategies for large-scale phylogenies such as stochastic search strategies, divide-and-conquer, or direct incremental estimation.

Related pages: Publications