Algorithm for distinguishing polyphyly from paraphyly

For a project that involves large trees as produced by BOLD I've had to come up with a way to assess whether species are monophyletic, paraphyletic or polyphyletic. Or, perhaps more accurately, whether all species in the tree had undergone complete lineage sorting for the COI locus, and if not, what other species they are tangled up with. Monophyly is easy enough, you just walk the tree and check that for each set of tips that is somehow lumped together there is only one MRCA. I had a harder time distinguishing para and poly, perhaps because I think (probably wrongly) that they are kinda the same depending on how you trace the lineages over the tree. So here is what I came up with.

Let's say there is such a thing as a stopnode*, which is locally the most recent interior node that subtends more than one species, and that this property is reset and re-applied every additional instance a clade of more than one species is formed in a tip-to-root traversal.

For example, consider a tree that has multiple individuals for species A and B in it: ((((A1,A2)n1,B1)n2,B2)n3,A3)n4. The node n2 is the first stopnode that subtends both species A and species B. Once we've found n2, we forget about this assemblage of A and B (for which we saw A1, A2 and B1) and keep moving closer to the root. The next node, n3, then only subtends species B (as instance B2) so it's not a stopnode. When we get to n4 we've found the next one. The rules for assigning monophyly, paraphyly and polyphyly then become:

  • If a species participates in only one stopnode, it is monophyletic.
  • If a species participates in multiple stopnodes that are all nested, it's paraphyletic.
  • If a species participates in multiple stopnodes some of which are not nested, it's polyphyletic.

Assuming we can come up with a way to make sets of stopnodes in a post-order traversal and index them on the participating species (the code nearly writes itself, obviously) we then have to come up with a way to figure out whether the stopnodes for a given species are nested.

One way to figure out whether nodes in a tree are nested is by applying the value of a counter both in pre-order and in post-order:

my $counter = 0;
traverse( $tree->get_root );
sub traverse {
    my $node = shift;
    $node->set_generic( 'pre' => $counter++ );
    traverse( $_ ) for @{ $node->get_children };
    $node->set_generic( 'post' => $counter++ );
}

Both collecting the stopnodes and labeling all nodes like this can be done in a single recursion. For each species, we then sort the stopnodes on the value of the 'pre' annotation in ascending order. When we iterate over this list, the value of the 'post' annotation for each next node in the list must be smaller than that of the current one, or else: polyphyly.

Here's a little project that applies this as both a command-line script and a little web app that accepts BOLD's horribly garbled output format.

*This name sucks. Come up with something that sounds graph-theoretical before you click "Publish". Oh.

2 comments:

  1. Thanks for sharing Rutger.
    One earlier account on the subject came to mind:
    Farris, J.S. 1974. Formal Definitions of Paraphyly and Polyphyly. Syst. Zool. 23:548-554.
    http://www.jstor.org/stable/2412474?seq=4
    Cheers
    Johan

    ReplyDelete
  2. Hi Johan - apologies for my late response! Thank you very much for that interesting, ancient reference - plus ça change, etc. :)

    ReplyDelete