General instructions for the Perl API
Introduction
This tutorial demonstrates general API concepts, applicable across all parts of the Ensembl API.
The Perl API provides a level of abstraction over the Ensembl databases and is used by the Ensembl web interface, pipeline, and internal annotation systems. To external users the API may be useful to automate the extraction of particular data, to customise Ensembl to fulfil a particular purpose, or to store additional data in Ensembl.
The Perl API is only one of many ways of accessing the data stored in Ensembl. Additionally there is a genome browser web interface, and the BioMart system. BioMart may be a more appropriate tool for certain types of data mining.
API Documentation
The Ensembl Perl APIs have easy-to-use web-browsable documentation, which provides access to all modules, listing the possible Objects and Adaptors, the functions that can be called on them and the type of output. There is also standard Perl POD (Plain Old Documentation) mixed in with the actual code, but can be automatically extracted and formatted using some software tools.
The first step for working with the Perl APIs is to install the APIs and ensure your PERL5LIB environment variable is set up correctly (see the Perl API Installation instructions)
For additional information you can contact ensembl-dev, the Ensembl development mailing list.
Connecting to the Database: The Registry
All data used and created by Ensembl is stored in MySQL relational databases. If you want to access this database the first thing you have to do is to connect to it. This is done behind the scenes by Ensembl using the standard Perl DBI module. However, if your computer is behind a firewall, you need to allow outgoing connections to the corresponding ports. You will need to know two things before you start:
- host
- the name of the host where the Ensembl database lives
- user
- the user name used to access the database
First, we need to import all Perl modules that we will be using. Since we need a connection to an Ensembl database through the Registry we first have to import the Registry module which we use to establish this connection. Almost every Ensembl script that you will write will contain a use statement like the following:
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org' -user => 'anonymous' );
We've made a connection to an Ensembl Registry and passed parameters using the -attribute => 'somevalue' syntax present in many of the Ensembl object constructors. Formatted correctly, this syntax lets you see exactly what arguments and values you are passing.
Connecting to the database for non-vertebrate genomes
If you're working with non-vertebrate genomes, you need to use a different host in your registry
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'mysql-eg-publicsql.ebi.ac.uk', -port => 4157 );
To use both Ensembl and Ensembl Genomes data in parallel, multiple servers can be specified e.g.
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_multiple_dbs( {-host => 'mysql-eg-publicsql.ebi.ac.uk', -port => 4157, -user => 'anonymous' }, {-host => 'ensembldb.ensembl.org', -port => 5306, -user => 'anonymous' } );
Connecting to the database for GRCh37
To work with the dedicated GRCh37 human database, this is found on port 3337, you can load the registry with:
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org' -user => 'anonymous', -port => 3337 );
In addition to the parameters provided above the optional port and pass parameters can be used to specify the TCP port to connect via and the password to use respectively. These values have sensible defaults and can often be omitted.
Using the registry to get information about the database
The registry may be used to, for example, get a list of all Ensembl databases installed on a given database host:
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org' -user => 'anonymous' ); my @db_adaptors = @{ $registry->get_all_DBAdaptors() }; foreach my $db_adaptor (@db_adaptors) { my $db_connection = $db_adaptor->dbc(); printf( "species/group\t%s/%s\ndatabase\t%s\nhost:port\t%s:%s\n\n", $db_adaptor->species(), $db_adaptor->group(), $db_connection->dbname(), $db_connection->host(), $db_connection->port() ); }
Object Adaptors
Before we launch into the ways the API can be used to retrieve and process data from the Ensembl databases it is best to mention the fundamental relationships the Ensembl objects have with the database.
The Ensembl Perl API works through a system of objects and adaptors. Objects represent biological entities in the database, such as Gene, Exon and Slice (a genomic region). The adaptors are used to retrieve these, using internal knowledge of the underlying database schema. This way you can write code and use the Ensembl Core API without having to know anything about the underlying databases you are using.
Object adaptors are obtained from the Registry via a method named get_adaptor(). This is followed by the species, API and adaptor you're looking for, like:
my $something_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Something' );
For example, to obtain a Slice adaptor or a Variation adaptor (which retrieve Slice and Variation objects respectively) for Human, do the following after having loaded the Registry, here called $registry, as above:
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); my $variation_adaptor = $registry->get_adaptor( 'Human', 'Variation', 'Variation' );
You can find the exact phrases needed to specify that adaptor in the documentation page for that adaptor. For example, the documentation page for the GeneAdaptor gives the code to get a gene adaptor.
Don't worry if you don't immediately see how useful this could be. Just remember that you don't need to know anything about how the database is structured, but you can retrieve the necessary data (neatly packaged in objects) by asking for it from the correct adaptor. Throughout the rest of this document we are going to work through the ways the Ensembl objects can be used to derive the information you want.
You can use the method
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); my $species = 'human'; my $group = 'variation'; # database type my $dba = $registry->get_DBAdaptor($species,$group); my $available_adaptors = $dba->get_available_adaptors; # Display the list of adaptors available in the Ensembl Variation API foreach my $adaptor (sort(keys(%$available_adaptors))) { print $available_adaptors->{$adaptor}."\n"; }
Most Objects in the API are hashes, consisting of a number of key-value pairs. You can then call a number of methods on the Objects which fetch the values. Some of these will return other Objects, some will return arrays (lists) of other Objects and some will return strings. The documentation for each Object type (for example, the Gene documentation) lists the methods available, including how to call the method, any optional parameters and what type of thing will be returned.
To see what an Object looks like, you can use the Perl Data::Dumper module (which is standard for Perl installations) to see the hash structure, For example:
use Bio::EnsEMBL::Registry; use Data::Dumper; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous', ); my $gene_adaptor = $registry->get_adaptor( 'human', 'core', 'gene'); my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000139618' ); # Set the Dumper settings to show only one level and use indents on the hash $Data::Dumper::Maxdepth=1; $Data::Dumper::Indent=3; # print the hash using Dumper warn Dumper($gene);
If you run this code, you will see that some values are strings, for example:
'biotype' => 'protein_coding',Other values are references to other Ensembl Object hashes:
'slice' => 'Bio::EnsEMBL::Slice=HASH(0x7f94c40f4218)',This tells me that the Object is a Slice and that it is a hash.
Using the LookUp module to get Adaptors
For some non-vertebrates, it can be easier to get Adaptors using the LookUp module. This is part of ensembl-metadata, which will need to installed alongside the rest of the API.
The LookUp module allows you to find species more easily, without necessarily knowing the alias used by the API, for example searching for species that have an alias that match a regular expression (such as part of a species name), or species which are derived from a specific ENA/INSDC accession, or species that belong to a particular part of the taxonomy, using NCBI taxon IDs.
The LookUp module allows you to find species more easily, without necessarily knowing the alias used by the API, for example searching for species that have an alias that match a regular expression (such as part of a species name), or species which are derived from a specific ENA/INSDC accession, or species that belong to a particular part of the taxonomy, using NCBI taxon IDs.
First you need to invoke the LookUp module:
my $lookup = Bio::EnsEMBL::LookUp->new();
You can then use it to get the all database adaptors based on different options:
# Get Adaptors for all species which match a regular expression in the species name my @dbas = @{ $lookup->get_by_name_pattern("Escherichia.*") }; # Get Adaptors for all species that match a node of NCBI taxonomy my @dbas = @{ $lookup->get_all_by_taxon_id(388919) }; # Get Adaptors for all species that are descendants of an NCBI taxonomy node my @dbas = @{ $lookup->get_all_by_taxon_branch(511145) };
Each of these queries will give you an array of DBAdaptors, one for each species retrieved, so you will need to move through these.
Now you can get the database adaptors you need by invoking methods on the DBAdaptors. Since the species have already been selected, you only need to specify which Adaptor you want. For example, to get the GeneAdaptor
my $gene_adaptor = $dba->get_GeneAdaptor;
You can now use the GeneAdaptor just as you would for any other query.
Code Conventions
Several naming conventions are used throughout the API. Learning these conventions will aid in your understanding of the code.
-
Variable names are underscore-separated all lower-case words.
As always with Perl the punctuation mark indicates the type of object it is:
$scalar (or variable), @array (or list), %hash (or dictionary, key/value pairs)
$slice, @exons, %exon_hash, $gene_adaptor
-
Class and package names are mixed-case words that begin with capital
letters.
Bio::EnsEMBL::Gene, Bio::EnsEMBL::Exon, Bio::EnsEMBL::Slice, Bio::EnsEMBL::DBSQL::GeneAdaptor
-
Method names are entirely lower-case, underscore separated words.
Class names in the method are an exception to this convention; these
words begin with an upper-case letter and are not underscore
separated. The word dbID is another exception which
denotes the unique database identifier of an object. No method names
begin with a capital letter, even if they refer to a class.
fetch_all_by_Slice(), get_all_Genes(), translation(), fetch_by_dbID()
- Method names that begin with an underscore '_' are intended to be private and should not be called externally from the class in which they are defined.
-
Object adaptors are responsible for the creation of various objects.
The adaptor is named after the object it creates, and the methods
responsible for the retrieval of these objects all start with the word
fetch. All of the fetch methods returns only objects of the type that
the adaptor creates. Therefore the object name is not required in the
method name. For example, all fetch methods in the Gene adaptor
return Gene objects. Non-adaptor methods generally avoid the use of
the word fetch.
fetch_all_by_Slice(), fetch_by_dbID(), fetch_by_region()
-
Methods which begin with get_all or fetch_all return references to
lists. Many methods in Ensembl pass lists by reference, rather than
by value, for efficiency. This might take some getting used to, but
it results in more efficient code, especially when very large lists
are passed around (as they often are in Ensembl).
get_all_Transcripts(), fetch_all_by_Slice(), get_all_Exons()
The following examples demonstrate some of Perl's list reference syntax. You do not need to understand the API concepts in this example. The important thing to note is the language syntax; the concepts will be described later.
# get a slice adaptor for the human core database my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' ); # Fetch all clones from a slice adaptor (returns a list reference) my $clones_ref = $slice_adaptor->fetch_all('clone'); # If you want a copy of the contents of the list referenced by # the $clones_ref reference... my @clones = @{ $clones_ref }; # Get the first clone from the list via the reference: my $first_clone = $clones_ref->[0]; # Iterate through all of the genes on a clone foreach my $gene ( @{ $first_clone->get_all_Genes() } ) { print $gene->stable_id(), "\n"; } # More memory efficient way of doing the same thing my $genes = $first_clone->get_all_Genes(); while ( my $gene = shift @{$genes} ) { print $gene->stable_id(), "\n"; } # Retrieve a single Slice object (not a list reference) my $chromosome = $slice_adaptor->fetch_by_region( 'chromosome', '13' ); # No dereferencing needed: print $chromosome->seq_region_name(), "\n";
A note about lazy loading and memory usage
Some of the data that makes up the objects returned from the Ensembl API is lazy loaded. By using lazy loading, we are able to minimise the number of database queries and only "fill in" the data in the object that the program actually asked for. This makes the code faster and its memory footprint smaller, but it also means that the more data that the program requests from an object the larger it becomes. The consequence of this is that looping over a large number of these objects in some cases might grow the memory footprint of the program considerably. It also has the consequence that some objects do not load enough information and the lazy-loading can be counter productive.
By using a while-shift loop rather than a foreach loop, the growth of the memory footprint due to lazy loading of data is more likely to stay small. This is why the comment on the last loop above says that it is a "more memory efficient way", and this is also why we use this convention for most similar loop constructs in the remainder of this API tutorial.
NB: This strategy obviously won't work if the contents of the list being iterated over is needed at some later point after the end of the loop.
Coordinates
Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a protein sequence is also 1. The length of a sequence is defined as end - start + 1.
In some rare cases insertions are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.
Slice coordinates are relative to the start of the underlying DNA sequence region. The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region. Sequence retrieved from regions where the sequence is not defined will consist of Ns.
All features retrieved from the database have an associated slice (accessible via the slice() method). A feature's coordinates are always relative to this associated slice, i.e. the start and end attributes define a feature's position relative to the start of the slice the feature is on (or the end of the slice if it is a negative strand slice). The strand attribute of a feature is relative to the strand of the slice. By convention the start of a feature is always less than or equal to the end of the feature regardless of its strand (except in the case of an insert). It is legal to have features with coordinates which are less than one or greater than the length of the slice. Such cases are common when features that partially overlap a slice are retrieved from the database.
Consider, for example, the following figure of two features associated with a slice:
[-----] (Feature A) |================================| (Slice) [--------] (Feature B) A C T A A A T C T T G (Sequence) 1 2 3 4 5 6 7 8 9 10 11 12 13
The slice itself has a start of 2, an end of 13, and a length of 12 even though the underlying sequence region only has a length of 11. Retrieving the sequence of such a slice would give the string CTAAATCTTGNN — the undefined region of sequence is represented by Ns. Feature A has a start of 0, an end of 2, and a strand of 1. Feature B has a start of 3, an end of 6, and a strand of -1.
Further help
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.