Kim01

The completion of the Drosophila melanogaster genome marks another significant milestone in growth of sequence information. But it also contributes to the ever widening gap between sequence information and biological knowledge. One important approach to reducing this gap is theoretical inference through computational technologies. Multitude of computer programs have been designed to annotate genomic sequence information with biologically relevant information. Here, I suggest that all of these methods have a common structure where the sequence fragments are "coordinatized" by some description method such as Hidden Markov Models. The key to the algorithms lies in constructing the most efficient set of coordinates that allow extrapolation and interpolation from existing knowledge. Efficient extrapolation and interpolation is produced if the sequence fragments acquire a natural geometrical structure in the coordinatized description. Finding such a coordinate frame is an inductive problem with no algorithmic solution. The greater part of the problem of genomic annotation lies in biological modeling of the data rather than in algorithmic improvements.