Assessing ancient DNA fragmentation using the SAMTools API

At Naturalis we are now Illumina sequencing genomes from 150+ years old herbarium specimens. One of the things about this old DNA is that it fragments along predictable patterns, i.e. the strand breaks just after a purine. This means that when we do NGS of these genomes and we map them agains a reference we should see compositional bias one base upstream from where the short reads map against the reference. There exist tools to compute and visualize this bias across an entire chromosome, but not across an interval (which is what we'd like), so I took the opportunity to play around with the SAMTools perl API.

Algorithm for distinguishing polyphyly from paraphyly

For a project that involves large trees as produced by BOLD I've had to come up with a way to assess whether species are monophyletic, paraphyletic or polyphyletic. Or, perhaps more accurately, whether all species in the tree had undergone complete lineage sorting for the COI locus, and if not, what other species they are tangled up with. Monophyly is easy enough, you just walk the tree and check that for each set of tips that is somehow lumped together there is only one MRCA. I had a harder time distinguishing para and poly, perhaps because I think (probably wrongly) that they are kinda the same depending on how you trace the lineages over the tree. So here is what I came up with.

Creating simple JSON using recursion

Fabian (TreeFam's fearless leader) had another email request: how do I generate simple JSON by traversing a tree object? The end result needed to be a nested pseudo-object structure where each object has a 'name' field and, optionally, a 'children' field that holds a list of similar pseudo-objects. This is the input format for a rather attractive tree viewer widget that is explained here and that might be adopted for future releases of TreeFam.

Attaching phyloxml annotations

Fabian Schreiber posted to the mailing list that he was having trouble parsing a newick string, then attaching annotations to the resulting objects such that they show up when serialized to phyloxml. How this is done is not immediately obvious so here is an example.

Reading and writing complex character annotations

I have just uploaded Bio::Phylo version 0.48 so it should become available shortly on http://search.cpan.org/dist/Bio-Phylo. Also, the mailing list has now opened up to anyone without my approving your signing up, at least until the spam gets out of hand. And there was a question from Laszlo Nagy, sent to me privately, which is about the post's title and is interesting enough to discuss here publicly.

Bio::Phylo::Beagle wrapper around Beagle-lib

The BEAGLE library has been published in Systematic Biology and is under development on Google code. BEAGLE is a very exciting project: it has been designed to make the most of various chip architectures (GPUs, OpenMP and SSE) in computing likelihoods. Several popular programs (MrBayes, BEAST and Garli) interface with BEAGLE, and the author list is a who's who in computational biology. In the BEAGLE source tree there is an example that demonstrates how python can interface with BEAGLE. Here, by way of a test script, I demonstrate a perl library that does the same, but with the data handling of Bio::Phylo.

Visualizing RDFa semantic annotations using SVG and NeXML/JSON

TreeBASE emits richly RDFa-annotated NeXML. An example of this is the annotation of taxa/OTUs, whose names TreeBASE tries to match to the NCBI taxonomy. The annotations that indicate such a match use skos:closeMatch as predicate and the URIs as provided by UniProt as the object. Below is a tree that uses the same logic and code, more or less, as a previous post, but embeds these UniProt URIs as xlink:href clickable links (in bold face) in the generated SVG.