Anopheles gambiae Assembly and Gene Annotation

VectorBase The Anopheles gambiae data and its display on Ensembl Genomes are made possible through a joint effort by the Ensembl Genomes group and VectorBase, a NIAID Bioinformatics Resource Center. The source data for this genome and other resources can be accessed via the VectorBase browser.

About Anopheles gambiae

Anopheles gambiae senso stricto is the primary mosquito vector responsible for the transmission of malaria in most of sub-Saharan Africa. It is a member of a species complex that includes at least seven morphologically indistinguishable species in the Series Pyretophorus in the Anopheles subgenus Cellia. An. gambiae feeds preferentially on humans and is one of the most efficient malaria vectors known. Anopheles gambiae senso stricto is now known to consist of two genetically distinct forms or incipient species, known formally as the An. gambiae M and An. gambiae S forms.

Picture credit (public domain): James Gathany (CDC) 1994

Assembly

The genome assembly presented here (AgamP3, February 2006) is a revised assembly based on the whole genome shotgun assembly of the PEST strain of Anopheles gambiae produced by the The International Anopheles Genome Project and described in [1] with revisions as described in [2]. More details can be found at VectorBase.

Annotation

Annotation of the AgamP3 assembly was carried out by VectorBase. The set of gene models presented (genebuild AgamP3.7, released October 2012) combines manual annotation, data provided by the research community, and gene prediction using the Ensembl system. Prediction utilised alignments of dipteran and other protein sets to the genome and generation of GeneWise models, alignment and gene prediction based on Anopheles ESTs, and selected ab initio predictions. More details can be found at VectorBase.

Functional Genomics

The functional genomics database for An. gambiae contains mappings to probes from the many microarray designs, more details can be found at the VectorBase microarray page.

Variation

Variation data for Anopheles gambiae was imported from NCBI dbSNP, and from other studies involving the M and S molecular forms [3, 4, 5].

Variation data is also available for the Anopheles gambiae MR4 reference colonies 4ARR, Kisumu, Akron, L3-5 and G3. These samples were sequenced by the Kwiatkowski group at the Wellcome Trust Sanger Institute, as part of the Malaria Programme's Anopheles gambiae Genome Variation Project. These variants should be considered preliminary, pending further analysis and quality control filtering.

EST and Protein Alignments

Anopheles gambiae ESTs were mapped onto the genome using Exonerate (Example: 2L:155000-225000).

WU-BlastX was used to map UniProtKB proteins onto the Anopheles gambiae genome. The datasets used were: Aedes, mosquito, drosophilid, arthropod, metazoan, eukaryotic, and non-eukaryotic proteins. The wider taxonomic groups exclude any of the more specific groups, e.g. the arthropod dataset excludes mosquito and drosophilid proteins. (Example: 2L:39300000-39320000).

GeneWise was used to map proteins from non-redundant taxonomic levels onto the Anopheles gambiae genome. The protein datasets used were: Anopheles (UniProtKB and community annotation), Aedes (VectorBase), drosophilid (FlyBase), and all UniProtKB (Example: 2L:39308000-39320000).

Approximately 8500 community annotations are mapped to the genome (Example: 2L:155000-225000).

References

  1. The genome sequence of the malaria mosquito Anopheles gambiae.
    Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R et al. 2002. Science. 298:129-149.
  2. Update of the Anopheles gambiae PEST genome assembly.
    Sharakhova MV, Hammond MP, Lobo NF, Krzywinski J, Unger MF, Hillenmeyer ME, Bruggner RV, Birney E, Collins FH. 2007. Genome Biology. 8:R5.
  3. SNP genotyping defines complex gene-flow boundaries among African malaria vector mosquitoes.
    Neafsey DE, Lawniczak MK, Park DJ, Redmond SN, Coulibaly MB, Traor SF, Sagnon N, Costantini C, Johnson C, Wiegand RC et al. 2010. Science. 330:514-517.
  4. Association mapping of insecticide resistance in wild Anopheles gambiae populations: major variants identified in a low-linkage disequilbrium genome.
    Weetman D, Wilding CS, Steen K, Morgan JC, Simard F, Donnelly MJ. 2010. PLoS ONE. 5:e13140.
  5. Gene flow-dependent genomic divergence between Anopheles gambiae M and S forms.
    Weetman D, Wilding CS, Steen K, Pinto J, Donnelly MJ. 2012. Molecular Biology and Evolution. 29:279-291.

Statistics

Summary

Assembly: AgamP3, INSDC Assembly GCA_000005575.1, Feb 2006
Database version: 75.3
Base Pairs: 278,253,050
Golden Path Length: 273,093,681
Genebuild method: Full genebuild
Table of top 500 InterPro hits

Gene counts

Coding genes

Genes and/or transcript that contains an open reading frame (ORF).

:
12,810
Short non coding genes

Short non coding genes are usually fewer than 200 bases long. They may be transcribed but are not translated. In Ensembl, genes with the following biotypes are classed as short non coding genes: miRNA, miscRNA, rRNA, tRNA, ncRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, and also the pseudogenic form of these biotypes. The majority of the short non coding genes in Ensembl are annotated automatically by our ncRNA pipeline.

:
650
Pseudogenes

A pseudogene shares an evolutionary history with a functional protein-coding gene but it has been mutated through evolution to contain frameshift and/or stop codon(s) that disrupt the open reading frame.

:
5

Coordinate Systems

chromosome
7 sequences
SequenceLength (bp)
UNKN42389979
Y_unplaced237045
2L49364325
2R61545105
3L41963435
3R53200684
X24393108
scaffold 8987 sequences

Other

Snap gene prediction: 24,679
Short Variants: 9,088,941