Assessing ancient DNA fragmentation using the SAMTools API

At Naturalis we are now Illumina sequencing genomes from 150+ years old herbarium specimens. One of the things about this old DNA is that it fragments along predictable patterns, i.e. the strand breaks just after a purine. This means that when we do NGS of these genomes and we map them agains a reference we should see compositional bias one base upstream from where the short reads map against the reference. There exist tools to compute and visualize this bias across an entire chromosome, but not across an interval (which is what we'd like), so I took the opportunity to play around with the SAMTools perl API.

Algorithm for distinguishing polyphyly from paraphyly

For a project that involves large trees as produced by BOLD I've had to come up with a way to assess whether species are monophyletic, paraphyletic or polyphyletic. Or, perhaps more accurately, whether all species in the tree had undergone complete lineage sorting for the COI locus, and if not, what other species they are tangled up with. Monophyly is easy enough, you just walk the tree and check that for each set of tips that is somehow lumped together there is only one MRCA. I had a harder time distinguishing para and poly, perhaps because I think (probably wrongly) that they are kinda the same depending on how you trace the lineages over the tree. So here is what I came up with.

Creating simple JSON using recursion

Fabian (TreeFam's fearless leader) had another email request: how do I generate simple JSON by traversing a tree object? The end result needed to be a nested pseudo-object structure where each object has a 'name' field and, optionally, a 'children' field that holds a list of similar pseudo-objects. This is the input format for a rather attractive tree viewer widget that is explained here and that might be adopted for future releases of TreeFam.