Reading and writing complex character annotations

I have just uploaded Bio::Phylo version 0.48 so it should become available shortly on http://search.cpan.org/dist/Bio-Phylo. Also, the mailing list has now opened up to anyone without my approving your signing up, at least until the spam gets out of hand. And there was a question from Laszlo Nagy, sent to me privately, which is about the post's title and is interesting enough to discuss here publicly.

The question is as follows:

The bottleneck in my work is that characters preferably have to be written and later accessed one by one, which makes several methods that handle bigger alignment-like chunks of characters infeasible. The other point is that each character has a bunch of information associated with it, which should be retrievable after analysis. I would not really feel comfortable with storing all characters in a hash and writing them all together (partly due to the size of the final matrix) Do you think nexml has significant advantages with regard to these two aspect as compared to nexus?

The answer, of course, is yes. NeXML is designed specifically for cases such as this. (Note: the pretty PDF that describes NeXML is now available for open access). Here's a simple example script that reads in a NEXUS file just to have some source data, though this could equally be read from PHYLIP, FASTA, tab-delimited files, etc. The script then iterates over each character in the matrix that was read and attaches some trivial RDFa annotations to it and prints the result to NeXML in the terminal:

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Phylo::Factory;
use Bio::Phylo::IO 'parse';

# I always use the factory object to create new objects so I don't have
# to add that many 'use' statements at the top or type out the fully
# qualified package names every time I create something new
my $fac = Bio::Phylo::Factory->new;

# parse the nexus data below the __DATA__ token into a Bio::Phylo::Project
my $project = parse(
    '-format' => 'nexus',
    '-handle' => \*DATA,
    '-as_project' => 1,
);

# get the (first and only) matrix object from the project
my ($matrix) = @{ $project->get_matrices };

# all annotations need to be in a namespace, let's use your URL
my $ns = 'http://www.clarku.edu/faculty/dhibbett/people_Laszlo.html#';

# each annotation should have a prefix that's bound to the namespace
my $prefix = 'ln'; # Laszlo Nagy

# let's add some annotations to each character
for my $i ( 0 .. $matrix->get_nchar - 1 ) {
    my $char = $matrix->get_characters->get_by_index($i);
    
    # this attaches an xsd:string
    $char->add_meta(
        $fac->create_meta(
            '-namespaces' => { $prefix => $ns },
            '-triple'     => { "${prefix}:trivialString" => "Char$i" },
        ),      
    );
    
    # this attaches an xsd:integer
    $char->add_meta(    
        $fac->create_meta(
            '-namespaces' => { $prefix => $ns },
            '-triple'     => { "${prefix}:trivialInt" => $i },
        ),
    );
    
    # this attaches an xsd:float
    $char->add_meta(    
        $fac->create_meta(
            '-namespaces' => { $prefix => $ns },
            '-triple'     => { "${prefix}:trivialFloat" => $i + 0.1 },
        ),
    );  
}

# write output
print $project->to_xml;

__DATA__
#NEXUS
BEGIN TAXA;
    TITLE Taxa;
    DIMENSIONS NTAX=5;
    TAXLABELS
        taxon_1 taxon_2 taxon_3 taxon_4 taxon_5 
    ;
END;
BEGIN CHARACTERS;
    DIMENSIONS  NCHAR=5;
    FORMAT DATATYPE = STANDARD GAP = - MISSING = ? SYMBOLS = "  0 1";
    MATRIX
        taxon_1  11101
        taxon_2  11101
        taxon_3  01100
        taxon_4  00010
        taxon_5  00010
    ;
END;

The result should be something like this, though I ran it through an indenter to make it slightly more readable:


  
  
  
    
    
    
    
    
  
  
    
      
        
        
        
        
        
        
        
        
        
        
        
        
          
          
          
          
          
          
          
          
          
          
          
        
      
      
        
        
        
      
      
        
        
        
      
      
        
        
        
      
      
        
        
        
      
      
        
        
        
      
    
    
      
        
        
        
        
        
      
      
        
        
        
        
        
      
      
        
        
        
        
        
      
      
        
        
        
        
        
      
      
        
        
        
        
        
      
    
  

The annotations are at line 40, 41 and 42 for the first character, and so on. The script below reads in the NeXML we've just produced and prints out the values of the annotations:


#!/usr/bin/perl
use strict;
use warnings;
use Bio::Phylo::IO 'parse';

# let's assume we provide 'output.xml' on the command line
my $infile = shift @ARGV;

# parse the xml into a project file
my $project = parse(
    '-format'     => 'nexml',
    '-file'       => $infile,
    '-as_project' => 1,
);

# fetch the (one and only) matrix from the project
my ($matrix) = @{ $project->get_matrices };

# now fetch the annotations again
print "char\tstring\tint\tfloat\n";
for my $i ( 0 .. $matrix->get_nchar - 1 ) {
    
    # this gets the character object, i.e. 
    # something to represent a matrix column
    my $char = $matrix->get_characters->get_by_index($i);
    
    # get_meta_object assumes there is exactly one annotation
    # for the provided predicate. otherwise use get_meta and
    # iterate over the results
    my $string = $char->get_meta_object('ln:trivialString');
    my $int    = $char->get_meta_object('ln:trivialInt');
    my $float  = $char->get_meta_object('ln:trivialFloat');
    
    # print output
    print $i, "\t", $string, "\t", $int, "\t", $float, "\n";
}

Which should print out the following:

char    string  int     float
0       Char0   0       0.1
1       Char1   1       1.1
2       Char2   2       2.1
3       Char3   3       3.1
4       Char4   4       4.1

No comments:

Post a Comment