WormBase annotation

Manual curation of gene models in WormBase

WormBase untertakes targeted manual curation of the gene models in Caenorhabditis elegans. Most of the initial C. elegans coding gene structures were originally determined in 1998 by the ab initio gene prediction program Genefinder (Green, P., unpublished data). Curators have spent a number of years editing and improving these inital gene models, using supporting data from large-scale transcriptome projects such as Yugi Kohara’s EST libraries (Kohara,Y., unpublished data) and more recently, many RNA-Seq projects, including modENCODE. Genes with low levels of expression present particular problems because they typically have sparse transcript data to support the predicted structure. In these cases, curators rely heavily on paralogous and orthologous gene structures (via protein alignments), or by in-depth inspection of aligned, conserved genomic regions.

Historically, targets for curation were selected using an automated system that looks for discrepencies between the current gene models and the supporting transcriptomic evidence. More recently, WormBase has embarked on a theme-based curation strategy, with groups of targets being selected according to a particular theme of interest, e.g. members of a specific family, or genes involved in a specific pathway.

WormBase also curates gene models in seven additional species (C. briggsae, C. brenneri, C. japonica, C. remanei, Onchocerca volvulus, Pristionchus pacificus and Brugia malayi). This is mainly driven by

  1. the curation of a specific group of genes in C. elegans as part of a curation theme (see above)
  2. the availabilty of high-profile data sets (for example a new version of the reference genome, or a new large-scale transcriptomics study); and
  3. direct user requests

For other species not part of the above core set, WormBase imports predicted gene models submitted by the original genome project for that species. No manual curation is performed on these gene models.

WormBase genomes and gene models in Ensembl Metazoa

The eight core species above, together with selected other species, have their genome and gene set represented in Ensembl Metazoa. Data for a species is always drawn from a specific, named release of WormBase. Updates are perfomed principally according to demand, with the highest demand being for C. elegans (which is updated in Ensembl Metazoa every fifth WormBase release).