Ortholog metrics and "high confidence" orthologs

Ortholog metrics have been calculated for a subset of metazoans, and these were used to classify a "high confidence" set of orthologs. The methodology uses two orthogonal sources of information, gene order conservation (GOC) and whole genome alignments (WGA).

The "GOC score" metric for a pair of orthologs measures whether the two genes up- and downstream of each gene in the ortholog pair are also orthologous, and allows for inversions and gene insertions. The "WGA coverage" metric determines the extent to which the orthologous regions have been aligned by pairwise LASTz alignments, primarily based on exonic coverage, with a small contribution from intronic coverage. Both metrics have a value between 0 and 100.

There is only an expectation for gene order conservation between species that are evolutionarily close; thus the GOC score is only calculated within Diptera, Hymenoptera, and Nematoda. Similarly, pairwise WGAs, and thus the related metric, are only available for a subset of fairly closely-related species.

To classify orthologs as "high confidence", thresholds are applied to the ortholog metrics, according to the evolutionary distance between the species. Within Aculeata, Caenorhabditis, Drosophila, and Onchocercidae the GOC threshold is 50 and the WGA threshold is 25; no thresholds are applied beyond these clades. In cases where GOC and WGA metrics are not available, a "tree-compliance" metric is used to identify orthologs inferred from dubious tree topologies. Finally, the orthologous proteins must have percentage identity above a certain threshold, currently set at 25% for all species.

The metrics are displayed in the genome browser in the ortholog table (example below), and are available in BioMart.

Ortholog metrics

