DAS Workshop 2010 ProServer Tutorial

Part 3

Andy Jenkinson, EMBL-EBI, 7th April 2010

Overview

Now that you have completed Part 1 and Part 2 of the tutorial, let us imagine that instead of a file, your data are in a relational database -- specifically in our example, the public Ensembl MySQL database. You will now modify your myplugin SourceAdaptor to fetch its data from there.

Transport helpers

ProServer includes various Transport modules. These optional modules exist to make accessing your data easier by reducing the boilerplate code you need to write. There are transport modules for various flat files, SRS and the BioPerl API, for example. Similarly to SourceAdaptor modules, Transports are objects in the Bio::Das::ProServer::SourceAdaptor::Transport namespace (e.g. Bio::Das::ProServer::SourceAdaptor::Transport::file). Transports are configured in each data source's INI section:

[mysource]
state        = on
adaptor      = myplugin
transport    = file

Using the above INI configuration, ProServer will create an object of the Bio::Das::ProServer::SourceAdaptor::Transport::file package at runtime, and make it accessible to a SourceAdaptor object via:

my $transport = $self->transport();

The dbi transport

Of particular interest to us in this tutorial is the dbi transport. An object of this package uses the Perl DBI framework to abstract out the creation of database connections, execution of statements and return of results sets. Functionality is exposed to an adaptor object via the query method. The details of the database to connect to (hostname, username etc) are confined to the source's INI configuration, leaving the plugin only to specify the SQL query to execute, and iterate over the results.

Connection details may be specified as follows:

[mysource]
state        = on
adaptor      = myplugin
transport    = dbi
dbhost       = host.company.com
dbport       = 3306
dbname       = mydata
dbuser       = read_only_user
dbpass       = secret

You can now execute SQL statements against the mydata database and process the results like this:

my $sql = 'select col2, col3, col4 from table where col1 = ? and col2 >= ? and col3 <= ?';
my $rows_arrayref = $self->transport()->query($sql, $arg1, $arg2, $arg3);
for my $row ( @{ $rows_arrayref } ) {
  my $col2 = $row->{'col2'};
  my $col3 = $row->{'col3'};
  my $col4 = $row->{'col4'};
}

Modify your DAS source

You will now use the dbi transport to help you in your task. Modify your mysource.ini INI file and myplugin.pm SourceAdaptor to connect to the Ensembl database and extract the same data as in your file, and build annotation hashrefs from the rows returned.

This is the SQL query you will need to run to extract the same information that is in exons.txt:

select sr.name AS chromosome,
       gsi.stable_id AS g_id, g.seq_region_start AS g_start, g.seq_region_end AS g_end,
       tsi.stable_id AS t_id, t.seq_region_start AS t_start, t.seq_region_end AS t_end,
       esi.stable_id AS e_id, e.seq_region_start AS e_start, e.seq_region_end AS e_end
from   seq_region sr,
       gene_stable_id gsi, gene g,
       transcript t, transcript_stable_id tsi,
       exon_transcript et, exon e, exon_stable_id esi
where  gsi.gene_id = g.gene_id
and    g.gene_id = t.gene_id
and    t.transcript_id = tsi.transcript_id
and    t.transcript_id = et.transcript_id
and    et.exon_id = e.exon_id
and    e.exon_id = esi.exon_id
and    g.seq_region_id = sr.seq_region_id
and    sr.coord_system_id = 2
limit  1000

Click here to show/hide the INI file:

Click here to show/hide the code:

Rebuild and run the server

Now rebuild ProServer, and run it with the new configuration:

./Build
eg/proserver -x -c eg/mysource.ini

And see if it works:

http://localhost:9000/das/mysource/features?segment=5:144942,155558

Further Tasks

Although the above SQL query allows you to change the source of the data without changing too much code, it is not particular efficient. It would be far better to use the input segment, start and end parameters to construct dynamic SQL queries to extract only the exons, transcripts and genes. In a real world whole-genome scenario, this would make more sense.

You should also provide metadata in the INI file to be included in the DAS sources command. In particular:

[mysource]
coordinates = GRCh_37,Chromosome,Homo sapiens -> 5:144942,155558
title       = Ensembl transcripts
doc_href    = http://mycompany.com/moreinfo
description = Some information about the data in the source and where it came from.
maintainer  = me@mycompany.com