latest news


VisCello; for visualization of single cell data.

access info ...


Sample data provenance from 1,347 RNAseq samples.

access info ...


ORNASEQ: Ontology for RNA sequencing.

access info ...


Maintained by Junhyong Kim License (pdf)

Phylogenetically Informative Substring Extraction

Publication: S. Angelov, B. Harb, S. Kannan, S. Khanna, and J. Kim. 2007. "Efficient Enumeration of Phylogenetically Informative Substrings." Journal of Computational Biology., 14(6): 701-723.

A phylogeny is a tree graph depicting the genealogical history of vertices of the tree. The vertices of the tree represent biological objects. The biological objects may be of type: whole organism, whole genomes, genes, etc. The vertices of a single tree always represent the same type of object. The leaf-vertices are degree one vertices that represent present day objects for which measurement data is available. Therefore, we assume that each leaf vertex has associated data consisting of a (genomic) string. See tutorial on phylogenies.

The root of a phylogeny is a special vertex that represents the common ancestor of all vertices. Any non-leaf vertex is called an ancestral vertex. A rooted phylogenetic tree has directed edges where each edge is directed along the path from the root to the leaves. We will call an edge directed out from an ancestral vertex, a daughter edge—and corresponding connected vertex will be called a daughter vertex. For each leaf vertex there is a unique path from the root to the leaf, which we implicitly refer to as “the path”. Typical phylogenetic trees have two daughter edges that we will call Left and Right daughters.

A phylogenetic tag or tag for short is a substring corresponding to a daughter edge, d1, of an ancestral vertex, P, such that the substring exists in the strings of all leaf vertices that are in the path of edge, d1, and NOT in the leaf vertices that are in the path of other edges, d2---dk, from the ancestral vertex. That is, it is a substring that is uniquely present in the set of leaves that are daughters of one edge and NOT present in the set of leaves that are daughters of other edges.

TagD is implemented as a command-line application. This application will generate tags for a specific branch of a tree, given the tree's structure as well as the sequence for all its taxa. A user interface has also been developed to provide useful interpretive and visualization tools. The user interface is provided as a plug-in for Mesquite.