Compact, simple tree representation

The NeXML file format has a rather verbose way of representing trees. For example, this tree:

Can be represented as a Newick string like so:

(((A:1.00,B:2.00)n1:3.00,(C:4.00,D:5.00)n2:6.00)n3:7.00,E:8.00)n4:0.00;

But in NeXML it becomes this:


    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

For small trees - or small sets of trees - the advantages of NeXML outweigh this verbosity, but for big trees or large sets (e.g. from a Markov chain) this becomes a problem.

People interested in XML representations have long discussed ways to represent trees concisely, but if such a representation is to follow XML idioms the solution is not obvious.

For example, simply embedding a Newick string inside an XML document introduces an additional syntax that can not be validated with XML schema or transformed with XSL transforms.

Also, the Newick format has its own problems in that nodes and edges cannot easily be annotated other than by embedding them inside "hot comments".

In previous NeXML design discussions, Mark Holder suggested the idea of an integer array whose element values are the index of the focal element's parent. To represent the example tree we might expand that idea to end up with something like this:

-101221550
0.007.003.001.002.006.004.005.008.00
n4n3n1ABn2CDE

The first row shows Mark's suggestion (with the root having the magic number -1), the second has the branch lengths, the third row the node and tip labels.

For example, the fourth element in each array is for tip A, whose branch length is 1.00, and whose parent is at index 2. I'm using zero-based indices, so the element with index 2 is the third element, i.e. node n1 (branch length 3.00, parent at index 1).

If we place these arrays within separate XML tags (e.g. nodes, lengths and labels), this suggestion has some virtues:

  1. Each array can be syntactically validated as lists of XML simple types with predefined number and string formats
  2. Annotations can be added as additional arrays in parallel with these.
  3. It is concise.

Playing around with it a little bit, I also noticed it is extremely easy to parse. Here is a code sample that demonstrates this in Perl, but the basic idea is the same for every language. The only string processing that is needed is splitting on whitespace (which can even be done easily in XSLT), and if the arrays were constructed by a pre-order traversal, the encoded tree can be reconstructed very quickly:

#!/usr/bin/perl

# this supplies a factory object that
# can instantiate node and tree objects
use Bio::Phylo::Factory;

# this is a standard library that tests
# whether a string scalar looks like a
# number in any of Perl's number formats
use Scalar::Util 'looks_like_number';

# read the input file as an array of lines
my @lines = <>;

# split the lines on whitespace and assign
# them to arrays
my @nodes   = split /\s/, $lines[0];
my @lengths = split /\s/, $lines[1];
my @labels  = split /\s/, $lines[2];

# create the factory and use it to create
# a tree
my $factory = Bio::Phylo::Factory->new;
my $tree = $factory->create_tree;

# create an index for the @nodes array
for my $i ( 0 .. $#nodes ) {

    # for each element, create a node
    my $node = $factory->create_node;

    # assign branch length from @lengths, if it looks like a number        
    $node->set_branch_length( $lengths[$i] ) if looks_like_number $lengths[$i];

    # assign name from @labels, if one was defined
    $node->set_name( $labels[$i] ) if defined $labels[$i];

    # assign the parent, if index >= 0
    $node->set_parent( $tree->get_by_index( $nodes[$i] ) ) if $nodes[$i] >= 0;

    # a tree object can also function as an array of nodes, in which
    # new ones are inserted, and in which lookups by index can be done
    $tree->insert($node);
}

# print out the result, with internal node labels
print $tree->to_newick( '-nodelabels' => 1 );

Creating the encoding is also very easy, as this example shows:

#!/usr/bin/perl

# this is a library for parsing common
# phylogenetic file formats
use Bio::Phylo::IO 'parse';

# the @arrays will hold parent indices,
# branch lengths and node names; the %hash
# will hold a lookup table to find the 
# index of a given node by its ID
my ( @nodes, @lengths, @labels, %index_of );

# counter to generate node indices
my $counter = 0;

# this parses the first tree out of a
# newick file that is provided as a 
# command line argument
my $tree = parse(
    '-format' => 'newick',
    '-file'   => $ARGV[0],
)->first;

# does a depth first traversal, nodes
# are processed in pre-order
$tree->visit_depth_first(
    '-pre' => sub {
        my $node = shift;
        my $parent = $node->get_parent;

        # because this is pre-order, an index for the parent
        # has already been generated. In the case of the root,
        # use index -1
        push @nodes, $parent ? $index_of{$parent->get_id} : -1;
        push @lengths, $node->get_branch_length;
        push @labels, $node->get_internal_name; 

        # generate an index for the focal node and store it
        # in the lookup table
        $index_of{$node->get_id} = $counter++;  
    }
);

# prints results as space-separated lines
print join(' ', @nodes), "\n";
print join(' ', @lengths), "\n";
print join(' ', @labels), "\n";

So, what do you think? Is it bad to come up with yet another tree description syntax? Are there other concise encodings that can sensibly be embedded in XML? One downside is that any XML schema-based validated of this format can only check the syntax (i.e. are the correct numbers for indices and branch lengths used, are the labels valid strings?) but not whether an actual tree shape is encoded.

27 comments:

  1. Why not split the difference and get rid of the edge "edge" field completely. As the simple format above shows, every node can be described as a node name, a pointer to an ancestral node, and a branch length. Thus, why not simply use



    That captures all of the necessary information without any of the edge fields, and still easily allows for adding additional annotation. You could also get rid of the "root" tag by simply defining the root node as having an ancestor of "root" or -1 or "n/a" or something. Strictly speaking, is it even necessary to have the id and label as separate entities? (it might be, I don't really know the ins and outs of xml that well).

    It will be shorter than the current schema, although not as short as the proposed scheme. However, it may be easier to validate for proper tree shapes than the new schema (just guessing).

    ReplyDelete
    Replies
    1. Thanks for your reply! The id and label attributes do need to be separate, because we want to be able to name things with more freedom than is allowed in XML identifiers. We also want to keep the edges in the "verbose" version, because we can then annotate them separately from nodes. I suppose they could be omitted in a "compact" version, though this won't really shave off that many bytes. You're right, though, that that is easier to validate (at least to some extent) because you can constrain id references to point to something existing.

      Delete
  2. Gah. Stupid formatting. The middle part is supposed to say

    <node label="n3" id="ne4" length="7.0" ancestor="ne3"></node>

    ReplyDelete
  3. One problem with this is that it doesn't allow nodes to have multiple parents, while NeXML and Newick do. Granted, that isn't problem for most people, but it would be nice in a lot of cases.

    Also, if you want something approaching the richness and universality of XML but much more compact, isn't JSON the natural choice?

    ReplyDelete
    Replies
    1. You're right that this won't work for anastomosing trees, and it's not obvious to me how the parallel arrays idea can be made to support reticulation without multidimensional arrays (but maybe I'm missing something?).

      The idea here, by the way, is that this syntax is to be embedded in XML, so replacing it with JSON isn't really the solution - at least at this level of design decisions. I have played around with XML2JSON mappings for NeXML, though, and although that didn't yield nicely idiomatic JSON, it is more concise so it's an additional thing we could do if large data sets need to be transmitted (without zipping them).

      Delete
    2. Mike, on an entirely different note: I would like to list Names on Nodes on the NeXML website in the list of programs that use NeXML. Would you be interested in that? Do you have an icon I might put on there?

      Delete
    3. Ah, I see. Well, you could optionally have each element in the row be an array. If something has two parents, the element could be "[1, 2]" instead of "1". (You'd need to have two branch lengths as well in that case.) Making it optional means that it doesn't mess with the basic idea (i.e., your example is still valid).

      Please do add a link! I never got around to making an icon, so I just whipped something up here: http://farm8.staticflickr.com/7033/6686924333_fe8fda084c_o.png
      Useable?

      Delete
    4. If we are going to tunnel additional syntax inside strings like your "[1, 2]" suggestion we might as well have Newick strings in there. Mmmmm.

      Do you have a smaller icon? It's for the front page (halfway down), they're all 32px wide. I tried rescaling but it becomes unreadable/unrecognizable.

      Delete
    5. This one should shrink down easier. (Don't have image-editing software here at work, unfortunately.)
      http://farm8.staticflickr.com/7011/6687047667_3fd5348f19_o.png

      Delete
    6. I've added the link to the staging server (http://nexml-dev.nescent.org/), should be live soon. I've placed it next to jsPhyloSVG and PhyloBox, all being web-based, visual tools.

      Delete
  4. This is basically reinventing the "ancestor function", which the more ancient of us may recall from the days of PAUP, FORTRAN, and MS DOS.

    The Newick format has its limits but is much easier to "grok" than XML. I can check syntax using "balance brackets" command in BBEdit, extract a subtree simply by copying and pasting that selection into a tree viewer. XML has none of these features. I get why XML is useful (e.g., highly structured documents), but it seems that having ancestor functions inside XML raises the same issues as having Newick strings (by he way, I'm pretty sure one could write a simple Newick parser in XSLT if one really wanted to).

    The world is moving towards JSON, so I can't get excited about this. If you're going to have XML, have proper XML and live with it. Otherwise, surely the simple way forward is Newick strings as part of a JSON string?

    ReplyDelete
    Replies
    1. I think ancestor functions inside XML raise fewer issues than Newick strings unless you bless a limited subset of allowable Newick syntax and hacks, certainly for validation and XSLT. I agree that a Newick parser could be implemented in XSLT, though I would hate to have to deal with [] comments that way and it would be more complicated than with an ancestor function.

      "If you're going to have XML, have proper XML and live with it." I couldn't agree more!

      A construct like <nodes>-1 0 1 2 2 1 5 5 0</nodes> is perfectly proper XML, the allowable values within the array can be defined as an XML schema type.

      But even if the world is moving towards JSON, do you really want to have Newick strings in there? Is that nice to deal with, say in chouchdb? In JSON I would also rather have an ancestor function.

      Delete
    2. Or just do the Newick string as JSON? Change the parentheses to brackets, put all the labels in quotes? Although I guess that doesn't work for ancestor labels, hmmm....

      Delete
    3. Also, it's nice to have arrays of nodes so that you can have arrays of annotations parallel to them. Mesquite uses this to a great extent; you can't do that with Newick string. How do you refer to a node in a Newick string externally?

      Delete
    4. I suppose the only options are: 1) label every node, or 2) allow formula-based references to nodes (e.g., "A|B" => the MRCA of the terminal nodes labelled "A" and "B").

      Delete
    5. Phyloreferencing is certainly an interesting idea, I'd love to see a workable spec.

      Delete
    6. I envisage simply having the Newick string as a string. I wouldn't convert the string to a JSON object (cf Mike Keesey) because that imposes a data structure on the tree, and that's always helpful (I want to choose the data structure). I'm biased, but the ease of creating and consuming JSON (think AJAX), storing it in document-oriented databases (think CouchDB), and its small size (bandwidth matters, think mobile devices) means I have little time for XML-based solutions to shipping data around.

      Regarding phyloreferencing, to my mind there are obvious parallels with georeferencing. We can recast phylogenetic queries (e.g., "find me all trees with frogs") as spatial queries (given a global ordering of taxa). Indeed, I think we could quite easily adopt some conventions from GeoJSON to describe phylogenetic queries. I need to flesh this stuff out somewhere, but the underlying motive is to keep things simple and therefore tractable.

      Delete
  5. "One downside is that any XML schema-based validated of this format can only check the syntax..." -- that seems like a major downside, but is not the current NeXML not much better? For example, if you changed the last bit to "" so that ne10 is orphaned and ne4 is a meaningless loop, would an XML schema validator flag this as corrupt? It strikes me that PhyloXML's approach is the closest to providing true tree-graph integrity validation (by virtue of the hierarchal design of XML itself), even if it doesn't allow for multiple parents.

    ReplyDelete
  6. whoops... there must be some way to prevent blogspot from munching up the greater-than-less-than bits...

    ReplyDelete
  7. It occurs to me that when nodes have multiple parents, a MRCA function doesn't always resolve to a single node. Might not be workable in those situations.

    I made a huge phyloreferencing spec for my work on Names on Nodes: http://namesonnodes.org/ns/math/2009/

    Names on Nodes uses MathML which is great for structure but incredibly verbose. (Quite similar to your present dilemma!) I started to work on a more compact, legible phyloreferencing script here:

    http://namesonnodes.org/ns/math/2009/query.html

    ... but I never really finished it.

    ReplyDelete
  8. I'm not a great fan of formats that aren't readable, and I'm not sure the index scheme is very obvious. (Then again, you could argue that XML formats aren't for humans. (And then again, again, if you argue that, the verbosity of node-edge tags isn't an issue.))

    One point: does the index scheme mean that changing the tree may cause changes across a lot of nodes because it changes the indices and triggers renumbering? This seems a little awkward.

    ReplyDelete
    Replies
    1. And, the index scheme prevents us from attaching properties to edges/branches, only to descendant nodes. Fine for a rooted tree, but for unrooted, you have to pretend the tree is rooted and then extrapolate the unroooted structure. We've been doing that for years (see Newick) but this might be an opportunity to get it right.

      Apologies if this comes across negatively - I realise it's a hard problem and am just thining through the implications.

      (Actually from Paul Agapow - because I can't get my OpenID to work this morning ... :^(

      Delete
    2. The verbosity of the node-edge tags isn't (just) an issue in terms of its readability, but really in file sizes. If the output from a MrBayes run as a NEXUS file is 100Kb it's not so great if the equivalent in NeXML is 100Mb (or whatever). Some size increase is probably fine, but not by many orders of magnitude if we are going to share data "in the cloud".

      The index scheme doesn't necessarily cause many changes: if it's just a branch swap, the only thing that has to change is one index. The consequence, though, is that the arrays might subsequently need more scanning than if they're inserted in a neat pre-ordered fashion. For an example of how this works out in practice: mesquite has some internal methods for periodically checking the integrity of its ancestor function arrays if they've been "dirtied", and annotations that are associated with it by 3rd party modules are flagged as untrustworthy if the topology changes.

      Good point about not having separate edge properties this way and having to fake things with (not really) "unrooted" trees: this should be an opportunity to get it right. Maybe something like Felsenstein's ring pointers can be done in XML?

      Delete
  9. Great: I composed my comment before permitting javascript on this page.. Sigh. It's all gone. Here goes to try to recover it:

    Personally I hate XML for its verbosity and hugeness. While I was attempting to view this post on my slightly-smart-phone, the (ok, image of the) XML version of the tree above didn't load for so long that I thought it was the cruel joke part of the story. Nice irony, I thought.

    Now basically I see XML as a good excuse for not wanting to write a new parser. It's not designed for humans to read, and it's certainly not designed for compactness. There's a nice hierarchical structure in it, which could be useful to represent trees, but it's clearly a very bulky way of doing it and isn't always appropriate.
    I'm with Rod here: if you want to use XML (I don't, for reasons like those above), do so: if not, don't. Just because there's a hierarchical structure inherent in XML doesn't mean we should use that structure for trees, which aren't the only phylogenetic graph of interest anyway. If there's a good format like Newick or the integer-array based method Mark Holder suggested (I think I first saw that in Phylip: wonderfully compact), then use that. If you are desperate to have some XML-like format and maintain a general DAG structure then maybe you could label nodes (constructing the node labels: bad idea) and even have <node>, <subtree> and even <newicknode> but for my money, leave it in a compact format and parse it specially. XML isn't designed for this.

    ReplyDelete
    Replies
    1. I strongly disagree that we should have to write a new parser every time we want a different type of file format. Time is limited enough as it is. On one of my projects, I spent weeks trying to write a NEXUS parser before throwing up my hands in despair at ever being able to cover all (or even most) variants. It took me a couple of days to implement a NeXML parser.

      That said, yes, XML is incredibly verbose. But that's why we have JSON.

      (Note: my comments don't apply to simple formats, like Newick, where it's stupidly easy to write a parser. But anything that has the full expressive power of NeXML isn't going to be that simple.)

      Delete
  10. Hi Rutger,

    Just my $0.02 ...

    I think this parent array design is great, with lots of desirable properties from a parsing/representation/maintenance perspective. While not essential, it also works well as a data model, and in fact that is how trees are stored under the hood in GINKGO.

    However, at the same time, basically I agree with the others in that I do not see what it brings to the table over XML apart from compactness, and I do not think compactness alone is sufficient to warrant incorporation of this into NeXML. A long time ago, we committed toward expression over economy as priorities in this data format. I still think that this is a right decision. But it does have consequences, and cumbersome representation is one of them. I think we should stick to our guns.

    What does a more compact representation buy us? For many, many, many datasets, nothing. The verbosity of other data structures, some of which are irreducible, undermine the savings in tree representation. The cost here is yet another parallel, independent way of representing the same thing in our data standard: something that needs to be documented, maintained, re-implemented in every new language, requiring new lines of code (and thus opportunity for new bugs) etc. etc. I think it might be problematical if a data standard itself proliferates multiple equivalent ways of doing something with minimal differences in benefit between them and lots of possibility of extra error due to the unnecessary redundancy.

    We all agree that we are no longer concerned with humans directly working on the source. Savings in disk space? When does this become an issue? A 1.5 G file becomes 1.2G on a 1TB storage system?

    I'm not saying compactness is not desirable. But I think NeXML should not be the target of such priorities. Maybe an alternate format, designed purely for compactness, parsing efficiency, and speed, and for direct/straightforward exchange with NeXML (i.e., a clear mapping of entities). Probably has to be lossy wrt to NeXML due to some fundamental information-theoretic equivalent of the laws of thermodynamics, but maybe sufficiently expressive to capture most data required by analyses (trees, sites). Additional information provided through a supplemental NeXML file (annotations/metadata).

    ReplyDelete
  11. I'm afraid I agree with the previous comments. Don't mangle a format if it doesn't meet your needs, instead choose one that does. And why not use compression tools for compressing a verbose format. (And I'm not sure why Google calls me unknown now.)

    ReplyDelete