Skip to main content

plotting - How to work with phylogenetic trees?


A phylogenetic tree can be represented as a comma-bracket string by the Newick format (see the Newick Standard, or this site for syntax details; here is an online tree visualizer). A simple Newick string example is resolved as follows:


"(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;"

enter image description here


Note, that a (b1, b2, ...)node represents a tree rooted at node, with branches in the parenthesis; any node can have a lable l and a distance d to its parent in the form l:d; multiple trees can be separated by a ;. The format is rather flexible, as is apparent from the wiki. Both the label and the distance can be omitted, hence ((,),,); is a valid tree.



  • How can one parse a Newick tree to a Mathematica-usable object? (The other way around was already covered.)

  • How to convert a tree to a Graph object?

  • How to work with trees? E.g. how to measure distances of leafs/nodes?


  • How to display a tree in standard cladogram format (orthogonal edges, correct branching coordinates, vertex and edge labels, etc.)?

  • What is the relationship between a phylogenetic tree, a TreeGraph, a Dendrogram, and a DendrogramPlot of the "HierarchicalClustering`" package?



Answer



Download the Phylogenetics` package here: Phylogenetics on GitHub (direct link to latest release: source zip), available on PackageData.net here. Documentation: notebook / pdf / dark themed pdf (for the eyestrained).




Version 1.1.0. is out (2017 01. 31.)



  • added: CommonAncestor, DivergenceTime, AncestorIndex, TreeDistanceMatrix, TreeTop, $PhylogeneticsVersionNumber.

  • fixed: Usage message corrections.


  • feature: Straight Cladogram edges, branching according to angle.

  • modified: LeafQ and NodeQ now can be applied to a single argument representing a node association.

  • renamed: SubnodeIndex to DescendantIndex.

  • modified: Code was made compatible down to Mathematica 10.0. No compatibility before that due to heavy usage of associations.




I had to deal with some phylogenetic trees in papers that did not publish exact data for all nodes (e.g. see this paper). I ended up coding a full package to handle Newick and other phylogenetic trees. Hope it could save some time for others.


The Phylogenetics` package can:



  • parse a Newick tree string to a Mathematica-usable Tree object and write a Tree object to a Newick string;


  • parse a Cluster object of the HierarchicalClustering` package to a Tree and convert a Tree to Cluster representation;

  • test properties of Tree objects;

  • extract vertices, edges, internal nodes, leafs, sub- and supertrees of Tree objects;

  • directly calculate various properties of Tree objects like paths and distances between vertices;

  • convert a Tree object to Graph, and a tree Graph to Tree; any tree graph (TreeGraph, ClusteringTree, etc.) for which TreeGraphQ returns True can be converted to Tree;

  • convert a Tree object or a tree Graph object to Cladogram graph,

  • provide some example trees in Newick and Cluster format.


The Tree representation follows closely the Newick format, but is a full representation where nothing is omitted. A tree is technically a directed, acyclic graph with an explicit metric defined along the branching dimension (instead of between leaf nodes as with e.g. Dendrogram). Each vertex is represented as an association of basic properties minimally required to describe a tree:




  • "Name" specifies the unique name of the vertex, just like in Graph; can be any expression.

  • "Label" specifies the label of the vertex that might appear when the tree is displayed as a graph; can be any expression.

  • "BranchLength" specifies the numerical distance of the vertex to its parent vertex.

  • "Type" specifies the type of the vertex, either as "Node" for internal nodes or "Leaf" for terminal nodes.


Nodes are formatted by default to appear simplified as grey boxes not to increase clutter; the tree structure is best visible in this form.


The package lives on GitHub (link in first line above), and I will try to maintain and develop it as my time allows. Full documentation and examples are in the Phylogenetics documentation.nb notebook. Please report issues and bugs here. Collaborations are very welcome!




Examples


Load the package and convert a Newick string to a Tree.



Needs["Phylogenetics`"];

tree = NewickToTree["(A:0.1,B:0.2,((G:.1,H:0.2,I:0.3)C:0.3,D:0.4):0.5)F;"]

Mathematica graphics


TreeToGraph[tree, ImageSize -> Small]

Mathematica graphics


There are many example trees provided through PhylogeneticData. Cladogram offers visualization specific for cladistics.


tree = NewickToTree[PhylogeneticData["ExampleNewick"]];

Cladogram[tree, EdgeLabels -> Automatic, Frame -> True,
EdgeLabelStyle -> Orange, LayerSizeFunction -> (1 # &),
ImageSize -> Medium, VertexSize -> 0.01]

Mathematica graphics


Note, that unnamed nodes are automatically assigned a unique index to avoid name collision. Vertices, edges, leafs, internal nodes, subtrees and supertrees can be extracted easily:


tree = NewickToTree["(A:0.1,B:0.2,((G:.1,H:0.2,I:0.3)C:0.3,D:0.4)E:0.5)F;"]
VertexList[tree]
LeafList[tree]
Subtree[tree, "C"]

Supertree[tree, {"H", "D"}]

Mathematica graphics


Cladogram[tree, GraphHighlight -> Subtree[tree, #], PlotLabel -> #, 
VertexLabels -> "Name"] & /@ {"F", "E", "C", "A"}

Mathematica graphics


Graphs can be converted directly to cladograms:


graphs = {KaryTree[10, ImageSize -> Tiny], 
CompleteKaryTree[3, 3, ImageSize -> Tiny],

StarGraph[6, ImageSize -> Tiny],
TreeGraph[{1 -> 2, 1 -> 3, 1 -> 4}, ImageSize -> Tiny]}
Cladogram[#, Orientation -> Top, ImageSize -> Tiny] & /@ graphs

Mathematica graphics


Hierarchical clusters can be also converted, with similarity values indicating branching distance.


cluster = PhylogeneticData["ExampleCluster1"]
tree = ClusterToTree[cluster]

Mathematica graphics



{TreeToGraph[tree, 
GraphLayout -> {"LayeredEmbedding", Orientation -> Top}],
Cladogram[tree, Orientation -> Top, ImageSize -> Small,
ImagePadding -> 10],
DendrogramPlot[cluster, Orientation -> Top, LeafLabels -> (# &)]}

Mathematica graphics


Extract paths and measure distances:


tree = NewickToTree["(A:0.1,B:0.2,((G:.1,H:0.2,I:0.3)C:0.3,D:0.4)E:0.5)F;"];
list = {{"I", "F"}, {"F", "I"}, {"H", "H"},{"H", "B"}};

path = TreePath[tree, #] & /@ list;
dist = TreeDistance[tree, #] & /@ list;

MapThread[
Cladogram[tree, GraphHighlight -> #1, ImageSize -> Small,
PlotLabel -> Row@{#1, ": ", #2}, VertexLabels -> "Name"] &, {path,
dist}]

Mathematica graphics


A serious phylogenetic tree (from Parfrey et al. 2011):



Cladogram[NewickToTree[PhylogeneticData["NewickEukarya"]], 
ImageSize -> 500, ImagePadding -> {{1, 100}, {1, 1}},
LayerSizeFunction -> (1/5 # &), VertexSize -> None,
VertexLabelStyle -> Directive[5, Italic], ImageSize -> Medium]

Mathematica graphics


Comments

Popular posts from this blog

mathematical optimization - Minimizing using indices, error: Part::pkspec1: The expression cannot be used as a part specification

I want to use Minimize where the variables to minimize are indices pointing into an array. Here a MWE that hopefully shows what my problem is. vars = u@# & /@ Range[3]; cons = Flatten@ { Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; Minimize[{Total@((vec1[[#]] - vec2[[u[#]]])^2 & /@ Range[1, 3]), cons}, vars, Integers] The error I get: Part::pkspec1: The expression u[1] cannot be used as a part specification. >> Answer Ok, it seems that one can get around Mathematica trying to evaluate vec2[[u[1]]] too early by using the function Indexed[vec2,u[1]] . The working MWE would then look like the following: vars = u@# & /@ Range[3]; cons = Flatten@{ Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; NMinimize[ {Total@((vec1[[#]] - Indexed[vec2, u[#]])^2 & /@ R...

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

What is and isn't a valid variable specification for Manipulate?

I have an expression whose terms have arguments (representing subscripts), like this: myExpr = A[0] + V[1,T] I would like to put it inside a Manipulate to see its value as I move around the parameters. (The goal is eventually to plot it wrt one of the variables inside.) However, Mathematica complains when I set V[1,T] as a manipulated variable: Manipulate[Evaluate[myExpr], {A[0], 0, 1}, {V[1, T], 0, 1}] (*Manipulate::vsform: Manipulate argument {V[1,T],0,1} does not have the correct form for a variable specification. >> *) As a workaround, if I get rid of the symbol T inside the argument, it works fine: Manipulate[ Evaluate[myExpr /. T -> 15], {A[0], 0, 1}, {V[1, 15], 0, 1}] Why this behavior? Can anyone point me to the documentation that says what counts as a valid variable? And is there a way to get Manpiulate to accept an expression with a symbolic argument as a variable? Investigations I've done so far: I tried using variableQ from this answer , but it says V[1...