Ensembl Metadata Perl API
Whilst all Ensembl Genomes database can be accessed using the standard Ensembl API, the way that up to 250 genomes are loaded into a single database presents some barriers to easy access for Ensembl Bacteria. To overcome this, a metadata API is provided to make accessing the data easier.
The Bio::EnsEMBL::LookUp object provides an interface for loading the EnsEMBL Registry with the large numbers of genomes from multiple databases, and allows individual DBAdaptor objects to be retrieved for genomes that match criteria including ENA accession (or other seq_region name), Genome Assembly accession, species name and taxonomy ID.
Once Bio::EnsEMBL::DBAdaptor objects have been returned, they can be used as for any Ensembl species. Alternatively, the now-loaded Bio::EnsEMBL::Registry object can be accessed directly.
Installing the API
First, the standard Ensembl API and its dependencies should be installed. This should be the same version of Ensembl that is used by Ensembl Bacteria.
Secondly, the ensembl-metadata API should be installed from the GitHub repository. You will need to add the modules directory from this package to your PERL5LIB. You may also need to install the CPAN package JSON - please consult your local systems adminstration if you are unsure about how to do this.
Basic use
The default mode for using the API is to use a specialised lookup database on the public MySQL server.
Building a helper
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new();
Once instantiated, the helper can be queried to retrieve Ensembl DBAdaptors in various ways:
Getting DBAdaptors by name
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new(); my $dba = $lookup->get_by_name_exact('escherichia_coli_str_k_12_substr_mg1655'); my @dbas = @{$lookup->get_all_by_name_pattern('escherichia_coli_.*')};
Getting DBAs by taxonomy
To get all genomes an organism identified by a given taxonomic node, supply the NCBI taxonomy ID to get_all_by_taxon_id(). For instance, for Streptococcus sanguinis (strain SK36):
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new(); my @dbas = @{$lookup->get_all_by_taxon_id(388919)};
To get all genomes belonging to a branch of the taxonomy, supply the NCBI taxonomy ID of the root node for that branch to get_all_by_taxon_branch(). For instance, to find all genomes from the genus Escherichia:
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new(); my @dbas = @{$lookup->get_all_by_taxon_branch(561)};
Getting DBAs by genomic INSDC accession
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new(); my ($dba) = $lookup->get_all_by_accession("U00096");
Getting DBAs by Genome Assembly accession
use strict; use warnings; use Bio::EnsEMBL::LookUp; my $lookup = Bio::EnsEMBL::LookUp->new(); my $dba = $lookup->get_by_assembly_accession("GCA_000005845.1");
Once obtained, DBAdaptor objects can be used as normal for an Ensembl species e.g.
my $genes = $dba->get_GeneAdaptor()->fetch_all(); print "Found ".scalar @$genes." genes for ".$dba->species()."\n";
Important: Once finished with a DBAdaptor object for the time being, it should be disconnected to avoid running out of connections on the MySQL server being used with following method:
$dba->dbc()->disconnect_if_idle();
Disconnected DBAdaptor objects can be used again without manually reconnecting.
Advanced use
The registry helper can be instantiated and used in a variety of ways. For instantiation from a local database set, the following code can be used (subsitute your own details):
register_dbs( "mysql.mydomain.com", 3306, "myuser", "mypass", "bacteria_[0-9]+_collection_core_17_70_1" ); my $lookup = Bio::EnsEMBL::LookUp::LocalLookUp->new(-CLEAR_CACHE => 1); # use as required
The ensembl-metadata API can be used in conjunction with the Compara Perl API to access gene family data for Ensembl Bacteria. To find which families a gene belongs to:
use strict; use warnings; use Bio::EnsEMBL::LookUp; use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor; print "Building helper\n"; my $helper = Bio::EnsEMBL::LookUp->new(); my $nom = 'escherichia_coli_str_k_12_substr_mg1655'; print "Getting DBA for $nom\n"; my ($dba) = @{$helper->get_by_name_exact($nom)}; my $gene = $dba->get_GeneAdaptor()->fetch_by_stable_id('b0344'); print "Found gene " . $gene->external_name() . "\n"; # load compara adaptor my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77'); # find the corresponding member my $member = $compara_dba->get_GeneMemberAdaptor()->fetch_by_source_stable_id('ENSEMBLGENE',$gene->stable_id()); # find families involving this member for my $family (@{$compara_dba->get_FamilyAdaptor()->fetch_all_by_Member($member)}) { print "Family ".$family->stable_id()."\n"; }
To retrieve the genes belonging to a given family:
use strict; use warnings; use Bio::EnsEMBL::LookUp; use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor; print "Building helper\n"; my $helper = Bio::EnsEMBL::LookUp->new(); # load compara adaptor my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77'); # find the corresponding member my $family = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395'); print "Family " . $family->stable_id() . "\n"; for my $member (@{$family->get_all_Members()}) { my $genome_db = $member->genome_db(); print $genome_db->name(); my ($member_dba) = @{$helper->get_by_name_exact($genome_db->name())}; if (defined $member_dba) { my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id()); print $member_dba->species() . " " . $gene->external_name . "\n"; $member_dba->dbc()->disconnect_if_idle(); } }
To retrieve the genes belonging to a given family (in this case the HAMAP family for the cytochrome b6-f complex subunit 8), filtering to a specific branch of the taxonomy (in this case from the species Prochlorococcus marinus):
-use strict; use warnings; use Bio::EnsEMBL::LookUp; use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor; print "Building helper\n"; my $helper = Bio::EnsEMBL::LookUp->new(); # find all genomes that descendants of a specified node to use as a filter my $taxid = 1219; # Prochlorococcus marinus print "Finding genomes for " . $taxid . "\n"; my %target_species = map { $_->species() => $_ } @{$helper->get_all_by_taxon_branch($taxid)}; # load compara adaptor my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77'); # find the corresponding member my $family = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395'); print "Family " . $family->stable_id() . "\n"; for my $member (@{$family->get_all_Members()}) { my $genome_db = $member->genome_db(); # filter by taxon from the calculated list my $member_dba = $target_species{$genome_db->name()}; if (defined $member_dba) { my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id()); print $member_dba->species() . " " . $gene->external_name . "\n"; $member_dba->dbc()->disconnect_if_idle(); } }
To retrieve the canonical peptides from genes belonging to a given family:
use strict; use warnings; use Bio::EnsEMBL::LookUp; use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor; use Bio::SeqIO; print "Building helper\n"; my $helper = Bio::EnsEMBL::LookUp->new(); # load compara adaptor my $compara_dba = Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(-HOST => 'mysql-eg-publicsql.ebi.ac.uk', -USER => 'anonymous', -PORT => '4157', -DBNAME => 'ensembl_compara_bacteria_24_77'); # find the corresponding member my $family = $compara_dba->get_FamilyAdaptor()->fetch_by_stable_id('MF_00395'); # create a file to write to my $outfile = ">" . $family->stable_id . ".fa"; my $seq_out = Bio::SeqIO->new(-file => $outfile, -format => "fasta",); print "Writing family " . $family->stable_id() . " to $outfile\n"; # loop over members for my $member (@{$family->get_all_Members()}) { my $genome_db = $member->genome_db(); my ($member_dba) = @{$helper->get_by_name_exact($genome_db->name())}; if (defined $member_dba) { my $gene = $member_dba->get_GeneAdaptor()->fetch_by_stable_id($member->gene_member()->stable_id()); print "Writing sequence for " . $member->stable_id() . "\n"; my $s = $gene->canonical_transcript()->translate(); $seq_out->write_seq($s); $member_dba->dbc()->disconnect_if_idle(); } }