Multiple genome alignments
Multiple alignments are calculated between groups of genomes.
Alignments available
Name | Genomes | Method used |
---|---|---|
wormbase-ws269 cactus-hal | Brugia malayi (Nematode, FR3), Caenorhabditis brenneri (Nematode), Caenorhabditis briggsae (Nematode), Caenorhabditis elegans (Nematode, N2), Caenorhabditis japonica (Nematode), Caenorhabditis remanei (Nematode), Onchocerca volvulus (Nematode, O. volvulus Cameroon isolate), Pristionchus pacificus (Nematode, PS312), Strongyloides ratti (Threadworm) | Cactus |
46 pangenome_drosophila | Bactrocera neohumeralis (Lesser Queensland fruit fly, Rockhampton), Coremacera marginata (Sieve-winged Snailkiller, reference), Drosophila albomicans (Fruit fly, 15112-1751.03), Drosophila ananassae (Fruit fly, 14024-0371.13), Drosophila arizonae (Fruit fly, ariz_Son04), Drosophila biarmipes (Fruit fly, raj3), Drosophila bipectinata (Pomace flies, 14024-0381.07), Drosophila busckii (Fruit fly, San Diego stock center stock number 13000-0081.31), Drosophila elegans (Pomace flies, 14027-0461.03), Drosophila erecta (Fruit fly, 14021-0224.00,06,07), Drosophila eugracilis (Fruit fly, 14026-0451.02), Drosophila ficusphila (Fruit fly, 14025-0441.05), Drosophila grimshawi (Fruit fly, 15287-2541.00), Drosophila guanche (Fruit fly), Drosophila gunungcola (Fruit fly, Sukarami), Drosophila hydei (Fruit fly, 15085-1641.00,03,60), Drosophila innubila (Fruit fly, TH190305), Drosophila kikkawai (Pomace flies, 14028-0561.14), Drosophila mauritiana (Fruit fly, mau12), Drosophila melanogaster - (Fruit fly), Drosophila miranda (Fruit fly, MSH22), Drosophila mojavensis (Fruit fly, 15081-1352.22), Drosophila montana (Pomace flies, Dmon_TW-CO22-7), Drosophila nasuta (Pomace flies, 15112-1781.00), Drosophila navojoa (Fruit fly, navoj_Jal97), Drosophila novamexicana (Pomace flies, 15010-1031.04,08,12), Drosophila obscura (Fruit fly, BZ-5 IFL), Drosophila persimilis (Fruit fly, 14011-0111.01,24,50), Drosophila pseudoobscura (Fruit fly, MV2-25), Drosophila rhopaloa (Fruit fly, 14029-0021.01), Drosophila santomea (Fruit fly, STO CAGO 1482), Drosophila sechellia (Fruit fly, sech25), Drosophila serrata (Pomace flies, Fors4), Drosophila simulans (Fruit fly, w501), Drosophila subobscura (Fruit fly, 14011-0131.10), Drosophila subpulchrella (Fruit fly, 33 F10 #4), Drosophila sulfurigaster albostrigata (Flies, 15112-1811.04), Drosophila suzukii (Pomace flies, WT10), Drosophila takahashii (Pomace flies, IR98-3 E-12201), Drosophila teissieri (Fruit fly, GT53w), Drosophila tropicalis (Pomace flies, 14030-0801.00), Drosophila virilis (Pomace flies, 15010-1051.87), Drosophila willistoni (Fruit fly, 14030-0811.24), Drosophila yakuba (Fruit fly, Tai18E2), Machimus atricapillus (Kite-tailed Robberfly, reference), Myopa tessellatipennis - (flies, reference) | Cactus |
Alignment methods
Progressive Cactus
Progressive-Cactus [3] is a next-generation aligner that stores whole-genome alignments in a graph structure. Genomes can be added incrementally, which makes it scalable to hundreds of genomes.
The Ensembl Compara Perl API provides access to Cactus alignment data in one of two ways: via HAL file (CACTUS_HAL) or database (CACTUS_DB).
Cactus alignment via HAL file
Alignments of type CACTUS_HAL are accessed via a HAL file [4]. For performance reasons, alignments are filtered to remove blocks whose length is below a threshold set to approximately one thousandth the size of the genomic region being accessed. Within each alignment block, aligned sequences are deduplicated per genome, keeping only the aligned sequence with the greatest number of nucleotides for the given genome.
Cactus alignment via database
Alignments of type CACTUS_DB are preloaded from a HAL file into a MySQL database following an approach similar to that used by cactus-hal2maf [3] (version 2.9.7).
-
Dump a MAF alignment file for a given reference genome
(e.g. Drosophila melanogaster) and sequence region (typically 500 kilobases in length) using hal2maf
[4] (version 2.2) with command-line options:
--noAncestors --unique
- Filter out aligned sequences with fewer than 5 nucleotides, and filter out alignment blocks with fewer than 20 alignment columns.
-
Normalise the alignment to merge smaller alignment blocks using taffy
(commit 5221c50)
with command-line options:
--filterGapCausingDupes --maximumBlockLengthToMerge 8000 --maximumGapLength 1200
-
Deduplicate alignments per genome within each MAF block using the mafDuplicateFilter command of mafTools [5]
(commit 259e5b4 of ComparativeGenomicsToolkit version) with command-line option:
--keep-first
- Load MAF alignment blocks into the output MySQL database.
CACTUS_DB alignments are also filtered by the Compara Perl API at access time, with the minimum block length set to one hundredth the size of the accessed region.
References
- Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. "Enredo and Pecan: Genome-wide mammalian consistency-based multiple alignment with paralogs." Genome Res. 2008 Nov;18(11):1814-28.
- Slater GS, Birney E. "Automated generation of heuristics for biological sequence comparison." BMC Bioinformatics. 2005 Feb;6:31.
- Armstrong J, Hickey G, Diekhans M, et al. "Progressive Cactus is a multiple-genome aligner for the thousand-genome era." Nature. 2020 Nov;587(7833):246-251.
- Hickey G, Paten B, Earl D, Zerbino D, Haussler D. "HAL: a hierarchical format for storing and analyzing multiple genome alignments." Bioinformatics. 2013 May;29(10):1341-1342.
- Earl D, Nguyen N, Hickey G, et al. "Alignathon: a competitive assessment of whole-genome alignment methods." Genome Research. 2014 Dec;24(12):2077-2089.