Many genomes made available through Ensembl Genomes are imported from annotated records in the INSDC archives (ENA, GenBank and DDBJ). This document gives an overview of the steps involved in loading these data. Note that the import pipeline is currently only for internal EBI use as it depends on resources only available within the European Bioinformatics Institute, though the data integrated by this pipeline are made freely available. For more information on how this pipeline is used for Ensembl Bacteria, see Ensembl Bacteria import pipeline.
Genome identification
Genomes imported from INSDC are uniquely identified by an INSDC assembly accession corresponding to an entry in the INSDC Genome Assembly Database, which is used as an authorative source of assemblies. Each assembly record comprises basic metadata about the organism plus INSDC accessions for replicons, unassembled scaffolds and WGS sets.
Gene model generation
For each genome, the import pipeline runs the following steps:
- retrieve and parse each entry from the ENA REST interface to generate a basic genome model
- parse out meta-data from the entry.
- parse out sequence from the entry (for CON entries, the sequences and assembly of component entries is also retrieved).
- parse out features and process each individually:
- use locus_tag feature qualifiers to combine CDS, gene, 5'UTR, 3'UTR and mRNA features to produce protein-coding gene models.
- use locus_tag feature qualifiers tRNA, rRNA and ncRNA features (and corresponding gene features) to production non-coding RNA gene models. Note that the INSDC data model may include genes containing both protein-coding and non-coding RNA transcripts, which are reflected in the finished genome model.
- use mat_peptide, sig_peptide and transit_peptide to add additional protein features to protein-coding gene models.
- generate repeat models from repeat_region features.
- generate simple feature models from all other feature types that cannot be processed as above.
- add external database references based on xref qualifiers.
- process each translations in the generated model to:
- find the corresponding entry in UniParc based on INSDC protein_id cross-references
- use UniParc identifiers to retrieve matches from InterPro to create protein features
- use UniProtKB identifiers to retrieve annotation from GOA, or if no UniProtKB mapping found, use InterPro signatures to retrieve annotation from GOA using the InterPro2GO mapping.
- use UniProtKB identifiers to retrieve selected cross-references to other resources from the corresponding UniProtKB record
- load imported genomes into Ensembl core MySQL databases.
- Additional non-coding RNA gene models are generated based on available alignments of Rfam family models to available INSDC genomic sequences.
Gene Identifier Assignment
The names and identifiers used within the Ensembl databases imported from INSDC for genes and gene products are derived as follows:
- names
- genes are named from the gene feature qualifier (e.g. yjdO)
- transcripts are named from a composite of the gene name and the protein_id qualifier (e.g. yjdO/ABD18711)
- stable identifiers
- the gene stable identifier is derived from the locus_tag qualifier or the protein_id qualifier if absent (e.g. b4559)
- the transcript stable identifier is derived either from the protein_id qualifier of the corresponding CDS feature, or the locus_tag qualifier if no protein_id is specified (e.g. ABD18711)
- the translation stable identifier is derived from the protein_id qualifier of the corresponding CDS feature (e.g. ABD18711)
- internal references
- each feature derived from an ENA feature has a generated identifier of the form contig_acc:feature:location e.g. BX072543.1:CDS:868063..869034