Attaching phyloxml annotations

Fabian Schreiber posted to the mailing list that he was having trouble parsing a newick string, then attaching annotations to the resulting objects such that they show up when serialized to phyloxml. How this is done is not immediately obvious so here is an example.

The general underlying logic is that the annotations in phyloxml that have to do with taxonomy (i.e. species codes, scientific names) are attached to OTUs, not nodes, and that these annotations must be scoped with the namespace for phyloxml annotations, which is the constant _NS_PHYLOXML_. The implementation then becomes something like this:

use strict;
use warnings;
use Bio::Phylo::Factory;
use Bio::Phylo::IO qw'parse unparse';
use Bio::Phylo::Util::CONSTANT qw':objecttypes :namespaces';

# let's say we want to attach species codes such
# as the ones that archaeopteryx uses for gene tree /
# species tree reconciliation. This means that every
# tip in the tree needs to link to an OTU to which we
# attach the code.
my %codes = (
    Homo_sapiens    => 'HS',
    Pan_troglodytes => 'PT',
    Gorilla_gorilla => 'GG',

# we parse the newick as a project so that we end
# up with an object that has both the tree and the
# annotated OTUs
my $proj = parse(
    '-format' => 'newick',
    '-handle' => \*DATA,
    '-as_project' => 1,

# here we make the OTUs
my ($forest) = @{ $proj->get_items(_FOREST_) };
my $taxa = $forest->make_taxa;

# it's easier to make a factory object for creating the annotations
my $fac = Bio::Phylo::Factory->new;

# here we annotate the OTUs
    my $taxon = shift;
    my $name = $taxon->get_name;
    my $code = $codes{$name};
            '-namespaces' => { 'pxml'  => _NS_PHYLOXML_ },
            '-triple' => { 'pxml:code' => $code },

# now write the output
print unparse( '-format' => 'phyloxml', '-phylo' => $proj );


And the resulting phyloxml looks like this:


Edit: In a follow-up email, Fabian points out he was still having trouble attaching the domain architecture annotations. I actually never got around to implementing those in Bio::Phylo, but there is a dirty and hacky solution (and we are nothing if not dirty, dirty hackers). Assuming the preceding code, we can annotate tree nodes with the domain annotations thusly, using raw XML:

# here we annotate domain architecture
my $tree = $forest->first;
    my $node = shift;
    my $arch = _create_dummy_architecture();
            '-namespaces' => { 'pxml' => _NS_PHYLOXML_ },
            '-triple' => { 'pxml:sequence' => $arch },

# returns hardcoded, raw XML. Actual architectures 
# left as an exercise for the reader.
sub _create_dummy_architecture {

1 comment:

  1. I want to parse a phyloxml: get the taxonomy code (or scientific_name in sequence block ) and replace node name with it . Is there any example like this problem? Thanks so much!