Aedes aegypti Assembly and Gene Annotation

VectorBase The Aedes aegypti data and its display on Ensembl Genomes are made possible through a joint effort by the Ensembl Genomes group and VectorBase, a NIAID Bioinformatics Resource Center. The source data for this genome and other resources can be accessed via the VectorBase browser.

About Aedes aegypti

Aedes aegypti exists in at least two forms (considered either subspecies or separate species according to different authors), namely Ae. aegypti formosus (the original wild type found in Africa) and Ae. aegypti aegypti (the worldwide urban form). The yellow fever mosquito, Ae. aegypti aegypti, has a worldwide distribution in the tropics and subtropics where it is the main vector of both dengue and yellow fever viruses.

Picture credit (public domain): James Gathany (CDC) 2006

Assembly

The Aedes aegypti Liverpool LVP strain genome sequence is a joint effort between the Broad Institute and The Institute for Genomic Research (TIGR) [1]. Assembly of 8x shotgun coverage was performed using the Broad's whole genome assembly package ARACHNE. The assembly presented here (AaegL3) consists of 4,756 supercontigs and a mitochondrial chromosome, totalling 1.3 gigabases.

Annotation

The initial annotation of the Aedes aegypti genome is a collaboration between VectorBase and TIGR. Each group generated a set of gene predictions which were merged into a single canonical set (AaegL1.1). The geneset presented here (AaegL3.1, April 2014) represents the original set integrated with several rounds of community annotations; additional gene predictions that were excluded from the AaegL1.1 set but subsequently found to have supporting evidence (transcriptome, mapped protein domains); mitochondrial genes; and non-coding RNA genes from the Ensembl Genomes pipeline.

Variation

Starting with Ensembl Genomes release 23, SNP data is available for Aedes aegypti, derived from the Bonizzoni et al. study [2] and imported via VectorBase. The study used RNA-seq to characterise sequence variation in three laboratory strains of Aedes aegypti. An analysis of the transcriptomes from the Liverpool (LVP) reference strain, and two strains that exhibit differential susceptibility to Dengue-2 infection (Chetumal and Rexville D-Puerto Rico) was conducted, identifying many novel transcriptional units and polymorphisms in immunity related genes.

EST and Protein Alignments

Aedes aegypti ESTs were mapped onto the genome using Exonerate (Example: supercont1.174:147000-580000).

WU-BlastX was used to map UniProtKB proteins onto the Aedes aegypti genome. The datasets used were: Aedes, mosquito, drosophilid, arthropod, metazoan, eukaryotic, and non-eukaryotic proteins. The wider taxonomic groups exclude any of the more specific groups, e.g. the arthropod dataset excludes mosquito and drosophilid proteins. (Example: supercont1.174:147000-580000).

Approximately 1300 community annotations are mapped to the genome (Example: supercont1.174:147000-580000).

References

  1. Genome sequence of Aedes aegypti, a major arbovirus vector.
    Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu ZJ, Loftus B, Xi Z, Megy K, Grabherr M et al. 2007. Science. 316:1718-1723.
  2. Probing functional polymorphisms in the dengue vector, Aedes aegypti.
    Bonizzoni M, Britton M, Marinotti O, Dunn WA, Fass J, James AA. 2013. BMC Genomics. 14:739.

Statistics

Summary

Assembly: AaegL3, INSDC Assembly GCA_000004015.1, Dec 2013
Database version: 76.3
Base Pairs: 1,310,106,999
Golden Path Length: 1,383,974,186
Genebuild method: Full genebuild
Table of top 500 InterPro hits

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

:
15,797
Small non coding genes

Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

:
1,660
Long non coding genes

Long non coding genes are usually greater than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as long non coding genes: 3prime_overlapping_ncrna, ambiguous_orf, antisense, antisense_RNA, lincRNA, ncrna_host, non_coding, non_stop_decay, processed_transcript, retained_intron, sense_intronic, sense_overlapping. The majority of the long non coding genes in Ensembl are annotated manually by HAVANA.

:
3
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

:
19
Gene transcriptsNucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes.: 18,838

Coordinate Systems

chromosome
1 sequence
SequenceLength (bp)
Mt16655
supercontig 4756 sequences
contig 36207 sequences