4003-532/4005-736 Parallel Computing II
Module 3 Lecture Notes -- Phylogenetic Trees
Building Blocks
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
Fitch Algorithm for Parsimony Scoring of a Tree
-
W. Fitch.
Toward defining the course of evolution: minimum change for a specific tree topology.
Systematic Zoology,
20(4):406-416, December 1971.
- Terminology
- DNA sequence -- a sequence of the bases A, C, G, T
- Site -- a particular position in the sequence
- State -- the contents of a site (i.e., a base)
- If the state of a particular site in one species is different from the state of the same site in another species, then there was a state change at that site somewhere in the evolutionary history of those species
- Input: A rooted bifurcated tree of equal-length (aligned) DNA sequences
- S = Number of species = Number of DNA sequences
- N = Number of sites in each DNA sequence
- Output: The tree's parsimony score
- Score = Number of state changes needed to account for the differences in the states of all sites across all species
- We want to find the tree(s) with the smallest score
- Representation of sites
- Each node in the tree has an associated sequence of sites
- Each site is represented as a set of bases
- At a tip node, each site's set of bases has a single member corresponding to the species' state
- At an interior node, each site's set of bases might have one or more than one member
- Algorithm to compute the sequence of sites and the score for an interior node
Score = 0
For i = 0 to N-1:
Child1 = First child node's set of bases at site i
Child2 = Second child node's set of bases at site i
If intersection (Child1, Child2) is not empty:
Interior node's set of bases at site i = intersection (Child1, Child2)
Else:
Interior node's set of bases at site i = union (Child1, Child2)
Score = Score + 1
- Algorithm to compute the score for the whole tree
- Starting at the tip nodes and working up to the root node, compute the score for each interior node as above
- Tree score = Sum of interior nodes' scores
- Example (J. Felsenstein, Inferring Phylogenies (Sinauer Associates, 2004), page 12)
- Example
- Note that the interior nodes' sequences of sets of bases do not represent the DNA sequences of the inferred ancestor species, they are only used to compute the parsimony score
DNA Sequence
- To compute the Fitch parsimony score, we must represent a DNA sequence as a sequence of sets of bases
- The set of bases requires highly efficient implementations of these operations:
- Intersection
- Union
- Compare to empty
- We also want to:
- Minimize the amount of storage used
- Minimize the number of bytes needed to serialize a DNA sequence
- Thinking ahead to sending DNA sequences in messages in a cluster parallel program
- DNA sequence class:
DNA Sequence File Format
- Quasi-official standard file format for phylogeny programs: interleaved PHYLIP format
- Example from Prof. Larry Buckley, RIT Department of Biology
- Example from Prof. Larry Buckley, RIT Department of Biology
- DNA sequence list class, including methods to read and write interleaved PHYLIP format files:
Phylogenetic Tree Textual Format
- Quasi-official standard textual format for phylogenetic trees: Newick Standard Format
- Example: (Gibbon,(Orangutan,(Gorilla,(Chimp,Human))))
- Example: ((AACT,AGCT),(CGCT,CAGT))
- Program to make a drawing of a tree from its Newick Standard Format representation:
Informative and Uninformative Sites
- Consider a site where all S species have the same state
- This site will add 0 to the Fitch parsimony score regardless of the tree topology
- Consider a site where S-1 species have the same state and one species has a different state
- Consider a site where S-K species have the same state and each of the remaining K species has a different state from all the others
- This site will add K to the Fitch parsimony score regardless of the tree topology
- Consider a site where two different states appear, and each state appears in two different species
- This site will add a different amount to the Fitch parsimony score, depending on the tree topology
- Example for 4 species W, X, Y, and Z with states {A}, {A}, {C}, and {C}, respectively
- One state change for this tree:
+-----{A} = W
|
+-----{A}
| |
| +-----{A} = X
------{AC}*
| +-----{C} = Y
| |
+-----{C}
|
+-----{C} = Z
- Two state changes for this tree:
+-----{A} = W
|
+-----{AC}*
| |
| +-----{C} = Y
------{AC}
| +-----{A} = X
| |
+-----{AC}*
|
+-----{C} = Z
- Informative site: A site with at least two different states, each state appearing in at least two different species
- Uninformative site: A site that is not an informative site
- When comparing two trees to see which has the better score, the uninformative sites do not affect the outcome
- Because the uninformative sites contribute the same amount to both trees' scores
- Therefore, the Fitch parsimony scoring algorithm only needs to consider the informative sites
- Why is this important?
- When computing a phylogenetic tree for related species, typically most of the sites are uninformative
- But the Fitch algorithm's running time is O(N), N = number of sites
- Therefore, considering only the informative sites can greatly reduce the program's running time while yielding the same answer
- DNA sequence list class, including a method to find and excise the uninformative sites:
Tree Generation Practicalities
- Incremental generation of bifurcated trees
- Each tree node has a reference to its parent, its first child, and its second child
- Given a current tree with N tip nodes and 2N-1 total nodes . . .
- Algorithm to generate all possible trees with N+1 tip nodes:
For i = 0 to 2N-2:
New tree = copy of current tree
Add a new tip node
Add a new interior node
New interior node's parent = node i's parent
New interior node's first child = node i
New interior node's second child = new tip node
Node i's parent's child (first or second as appropriate) = new interior node
Node i's parent = new interior node
- Starting from a tree with a single tip node, you can construct all possible trees with two nodes, then three nodes, and so on
- Incremental update of Fitch parsimony scores
- Each tree node has a reference to a sequence of sets of bases
- Each interior tree node remembers its own contribution to the parsimony score
- The tree itself remembers the total parsimony score
- When a new tip node is added, you only have to run the Fitch algorithm on the interior nodes from the new tip up to the root
- Also, you can stop if the Fitch algorithm generates a sequence of sets of bases that is the same as what's already stored at that node
- DNA sequence tree class
|
Parallel Computing II
|
|
•
|
|
4003-532-70/4005-736-70
|
|
•
|
|
Spring Quarter 2007
|
|
Course Page
|
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2007 Alan Kaminsky.
All rights reserved.
Last updated 23-Apr-2007.
Please send comments to ark@cs.rit.edu.
|