4003-532/4005-736 Parallel Computing II
Module 3 Lecture Notes -- Phylogenetic Trees
Concepts
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
History of DNA
- Deoxyribonucleic acid (DNA), the molecule, was discovered in 1869 by Friedrich Miescher
- He called it "nuclein" because it was found in cell nuclei
- The stereochemical structure of DNA was discovered in 1953 by Francis Crick and James Watson
- Experimental evidence (X-ray diffraction data) for the structure of DNA was discovered by Rosalind Franklin and Maurice Wilkins and their students
|
|
Friedrich Miescher
1844-1895
|
- The original papers announcing the DNA structure and X-ray data:
- J. Watson and F. Crick. A structure for deoxyribose nucleic acid. Nature, 171:737-738, April 25, 1953
- M. Wilkins, A. Stokes, and H. Wilson. Molecular structure of deoxypentose nucleic acids. Nature, 171:738-740, April 25, 1953
- R. Franklin and R. Gosling. Molecular configuration in sodium thymonucleate. Nature, 171:740-741, April 25, 1953
- Available at http://www.nature.com/nature/dna50/archive.html
- Crick, Watson, and Wilkins won the 1962 Nobel Prize in Medicine "for their discoveries concerning the molecular structure of nucleic acids and its significance for information transfer in living material"
 |
|
 |
|
 |
|
 |
| Francis Crick |
|
Rosalind Franklin |
|
James Watson |
|
Maurice Wilkins |
| 1916-2004 |
|
1920-1958 |
|
1928- |
|
1916-2004 |
Structure of DNA
- A DNA molecule takes the form of a double helix
- Two intertwined helical sugar-phosphate backbones on the outside
- A sequence of bases on the inside
- The four bases:
- Adenine (A)
- Cytosine (C)
- Guanine (G)
- Thymine (T)
- A and G are purines
- C and T are pyrimidines
 |
|
 |
|
 |
|
 |
| Adenine |
|
Cytosine |
|
Guanine |
|
Thymine |
- Base pairs
- Each base binds to one backbone strand
- Each base also binds to a complementary base on the other backbone strand
- Each such base pair consists of a purine and a pyrimidine
- A pairs with T
- C pairs with G
- We will be concerned with the informational structure of DNA, rather than the chemical structure
Structure of Proteins
- A protein molecule consists of a sequence of amino acids
- The twenty amino acids:
- Name (Abbreviation/Character)
- Alanine (Ala/A)
- Arginine (Arg/R)
- Asparagine (Asn/N)
- Aspartic acid (Asp/D)
- Cysteine (Cys/C)
- Glutamic acid (Glu/E)
- Glutamine (Gln/Q)
- Glycine (Gly/G)
- Histidine (His/H)
- Isoleucine (Ile/I)
- Leucine (Leu/L)
- Lysine (Lys/K)
- Methionine (Met/M)
- Phenylalanine (Phe/F)
- Proline (Pro/P)
- Serine (Ser/S)
- Threonine (Thr/T)
- Tryptophan (Trp/W)
- Tyrosine (Tyr/Y)
- Valine (Val/V)
- The amino acid sequence determines the protein's 3-D shape
- The protein's 3-D shape determines the protein's function
Genetic Sequences
- Protein sequence -- a sequence of characters from the 20-character alphabet {ARNDCEQGHILKMFPSTWYV} (amino acids)
- DNA sequence -- a sequence of characters from the 4-character alphabet {ACGT} (bases)
- Gene -- a DNA sequence that encodes the sequence of amino acids for one protein
- Genome -- the complete DNA sequence for one organism
- Contains genes, plus "junk DNA" whose function is unknown
- About 97% of the human genome is junk DNA
- Sequence databases
- National Center for Biotechnology Information (NCBI) -- http://www.ncbi.nlm.nih.gov/
- Collects data from many other sequence databases
- Allows many kinds of sequence database searches
- Example of a gene found on the NCBI web site
- Organism: Desert iguana, Dipsosaurus dorsalis
- Protein: Muscle-type (A4) lactate dehydrogenase
LOCUS DQ793218 944 bp mRNA linear VRT 30-JUL-2006
DEFINITION Dipsosaurus dorsalis muscle-type (A4) lactate dehydrogenase mRNA,
partial cds.
ACCESSION DQ793218
VERSION DQ793218.1 GI:110677269
KEYWORDS .
SOURCE Dipsosaurus dorsalis
ORGANISM Dipsosaurus dorsalis
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Lepidosauria; Squamata; Iguania; Iguanidae; Iguaninae; Dipsosaurus.
REFERENCE 1 (bases 1 to 944)
AUTHORS Fields,P.A.
TITLE Muscle-type lactate dehydrogenase of the desert iguana, Dipsosaurus
dorsalis
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 944)
AUTHORS Fields,P.A.
TITLE Direct Submission
JOURNAL Submitted (22-JUN-2006) Biology, Franklin and Marshall College,
P.O. Box 3003, Lancaster, PA 17604, USA
FEATURES Location/Qualifiers
source 1..944
/organism="Dipsosaurus dorsalis"
/mol_type="mRNA"
/db_xref="taxon:51217"
CDS <1..>944
/note="oxidoreductase"
/codon_start=1
/product="muscle-type (A4) lactate dehydrogenase"
/protein_id="ABG85222.1"
/db_xref="GI:110677270"
/translation="EKLIANVHKEEHPHAHNKITVVGVGAVGMACAISILMKDLADEL
ALVDVVEDKLKGEMLDLQHGSLFLRTPKIVSGKDYAVTAHSKLVIITAGARQQEGESR
LNLVQRNVNIFKFIIPNVVKYSPDCKLLVVSNPVDILTYVAWKISGFPKHRVIGSGCN
LDSARFRHLMGERLGIHPLSCHGWIVGEHGDSSVPVWSGVNVAGVSLKGLHPDMGTDA
DKEKWKEVHKQVVDSAYEVIKLKGYTSWAIGLSVADLAETIMKNLRRVHPVSTMVKGM
HGINDDVFLSVPCVLGYSGITDVVKMTLKSEEEDKLRK"
ORIGIN
1 gagaagctga tcgcaaatgt tcacaaagaa gaacatcccc atgcccataa caagatcact
61 gtggttgggg ttggtgcagt tggcatggct tgtgccataa gtattttgat gaaggacttg
121 gctgatgaac ttgcccttgt tgatgtcgtg gaggacaagc tgaagggaga gatgttggac
181 ctccagcatg gcagcctctt tctcaggaca ccaaagattg tctctggcaa agactatgct
241 gtgactgcac actccaagct ggtcatcatc actgctggag cccgccagca agaaggcgag
301 tctcgcctca acctggttca acgcaatgtg aacatcttca aattcatcat tcccaatgta
361 gtcaagtaca gccccgactg caagctgctt gttgtttcca acccagtgga tattctgacc
421 tacgtggcat ggaagatcag cggcttccct aagcaccgtg ttattggaag tggttgcaat
481 ctggattctg cccgtttccg ccacctgatg ggtgaaagac tgggcatcca ccctctcagc
541 tgccatggct ggattgttgg ggagcacgga gactccagtg tgcccgtgtg gagtggtgtg
601 aatgttgctg gtgtctctct gaaaggctta catcctgaca tggggactga tgctgacaag
661 gagaaatgga aagaggtcca caagcaggtg gtagacagtg cttatgaggt tatcaagctg
721 aagggataca catcttgggc aattggtctc tccgtggctg atctagctga aacaatcatg
781 aagaatttac gaagagtgca tccagtttct acaatggtga agggcatgca tggaattaat
841 gatgatgtct tcctcagtgt tccttgtgtc ctgggctact caggaattac agatgttgtg
901 aaaatgactc tgaagagtga agaggaagat aaattgagga agag
//
- Information about species
- State of our knowledge
- For some organisms, we know individual gene sequences, but not the complete genome
- Most sequence data is for individual genes in isolation
- For some organisms, we know the complete genome, but not (all of) the individual gene sequences
- E.g., the Human Genome Project
- For some organisms, we know the complete genome, and we know the location of each gene within the genome
- E.g., Pyrococcus furiosus, NCBI accession number NC_003413
- A single-celled organism that likes to live in boiling hot water (90-100 C)
Sequence Alignment
- Sequence alignment
- Find the closest possible match between two sequences
- Can be done for DNA sequences or protein sequences
- For now we will focus on DNA sequences
- Two example DNA sequences: the muscle-type lactate dehydrogenase genes of --
 |
|
 |
| Marine iguana, A. cristatus |
|
Desert iguana, D. dorsalis |
| Copyright © Misty McPhee |
|
Copyright © Tanya Dewey |
| Images from the University of Michigan Museum of Zoology Animal Diversity Web |
- Global alignment
- Find the closest possible match between the entire first sequence and the entire second sequence
- Needleman-Wunsch algorithm (1970), a dynamic programming algorithm
- Example: Global alignment between A. cristatus and D. dorsalis muscle-type lactate dehydrogenase genes
- Local alignment
- Find the closest possible match between a subsequence of one and a subsequence of the other
- Smith-Waterman algorithm (1981), another dynamic programming algorithm
- Example: Local alignment between A. cristatus and D. dorsalis muscle-type lactate dehydrogenase genes
- Gapped vs. ungapped alignment
- Gapped alignment
- Can insert gaps (-) between bases to improve the overall alignment score
- Gaps might indicate deletion of genetic material during the course of evolution
- Gaps might also indicate a nonsensical alignment
- Ungapped alignment
- Will never insert gaps between bases
- Pairwise vs. multiple alignment
- Pairwise alignment -- alignment of just two sequences
- Multiple alignment -- alignment of more than two sequences
- Multiple alignment is much harder
- Sequence alignment algorithms is a whole field of study in its own right
- Here we are only interested in aligning sequences as a precursor to constructing phylogenetic trees
Phylogeny
- Phylogeny (from the Greek: "tribe-origin") of a set of organisms --
- Shows how related organisms are grouped together
- Shows how those relationships came about during the course of evolution
- Phylogenetic tree -- depicts a phylogeny as a tree
- "Tree" in the graph theoretic sense
- A small phylogenetic tree:
- Chimpanzees and humans evolved from a common ancestor sometime in the past
- Gorillas and the chimp-human ancestor in turn evolved from a common ancestor even earlier
- And so on
|
|
 |
- A large phylogenetic tree (click image for further information):
O. Bininda-Emonds, M. Cardillo, K. Jones, R. MacPhee, R. Beck, R. Grenyer, S. Price, R. Vos, J. Gittleman, and A. Purvis.
The delayed rise of present-day mammals.
Nature,
446:507-512, March 29, 2007.
- This is a "supertree" of 4,510 of the 4,554 extant mammalian species
- Only the top part of the tree is shown, down to the level of families rather than species
- A supertree is formed by combining smaller trees together
- Tree of Life Web Project -- http://tolweb.org/tree/
- Bifurcating vs. multifurcating trees
- Bifurcating tree: Each interior node has exactly two descendents (a "binary tree" in CS parlance)
- Multifurcating tree: Each interior node has two or more descendents
- Most phylogeny methods generate bifurcating trees
- Rooted vs. unrooted trees
- Rooted tree: One node is designated as the ultimate ancestor of all the others
- Unrooted tree: There is no such designated node
- The above examples are all rooted bifurcating trees
Phylogenetic Tree Construction
- Biologists used to construct phylogenies based on the organisms' physical characteristics (phenotypes)
- Now, biologists construct phylogenies based on the organisms' DNA and protein sequences (genotypes)
- Input to the tree construction process:
- Multiple alignment of the organisms' DNA sequences
- Output from the tree construction process:
- Phylogenetic tree
- Additional information depending on the algorithm
- Tree construction methods:
- Parsimony methods
- Each interior node in the tree represents a point where the two descendent sequences diverged from each other
- A certain number of "state changes" of the characters in the sequences have to occur when going from the root of the tree to the branches
- A parsimony method finds the tree with the fewest total number of state changes
- Minimum state changes = Maximum parsimony
- Distance methods
- Assign a distance between each pair of sequences
- For example, based on the number of differing bases
- For example, based on the amount of time that would have to elapse to account for the differences between the sequences under some model for the rate at which mutations occur ("molecular clock")
- Assign a "branch length" to each tree branch representing the distance from the ancestor to the descendents
- Find the tree topology and branch lengths that most closely (in a least-squares sense) reproduce the pairwise inter-sequence distances
- Likelihood methods
- Assume some model for the probability of one base changing into another base
- Then given a tree, you can compute the probability that the actually observed sequences would occur
- Define the "likelihood" of the tree to be this probability
- Find the tree with the maximum likelihood
Maximum Parsimony Method
- Input: A list of N sequences
- Algorithm:
For every possible rooted bifurcating tree of N sequences:
Compute the number of state changes
Output the tree with the fewest state changes
- Example: Number of state changes needed for a tree with four sequences
- Four state changes are needed
- A possible evolutionary history:
- Root ancestor starts as AACT
- Root splits into AACT and CACT, one state change
- AACT splits again into AACT and AGCT, one state change
- CACT splits into CGCT and CAGT, two state changes
- The maximum parsimony method depends only on the number of state changes, not on the detailed evolutionary history
- The problem with this deceptively simple algorithm:
- The number of possible rooted bifurcating trees explodes as N increases
- There is one possible tree for N = 2:
- There are three possible trees for N = 3:
- There are 15 possible trees for N = 4:
- In general, for N sequences, there are (2N - 3)!! possible trees
- !! is the "double factorial" function
- (2N - 3)!! = 1 x 3 x 5 x 7 x . . . x (2N - 3)
- The obligatory table of double-factorial values
| N | | (2N - 3)!! | | Base 2 |
| 2 | | 1.00x100 | | 1.00x20 |
| 3 | | 3.00x100 | | 1.50x21 |
| 4 | | 1.50x101 | | 1.88x23 |
| 5 | | 1.05x102 | | 1.64x26 |
| 6 | | 9.45x102 | | 1.85x29 |
| 7 | | 1.04x104 | | 1.27x213 |
| 8 | | 1.35x105 | | 1.03x217 |
| 9 | | 2.03x106 | | 1.93x220 |
| 10 | | 3.45x107 | | 1.03x225 |
| 11 | | 6.55x108 | | 1.22x229 |
| 12 | | 1.37x1010 | | 1.60x233 |
| 13 | | 3.16x1011 | | 1.15x238 |
| 14 | | 7.91x1012 | | 1.80x242 |
| 15 | | 2.13x1014 | | 1.52x247 |
| 16 | | 6.19x1015 | | 1.37x252 |
| 17 | | 1.92x1017 | | 1.33x257 |
| 18 | | 6.33x1018 | | 1.37x262 |
| 19 | | 2.22x1020 | | 1.50x267 |
| 20 | | 8.20x1021 | | 1.74x272 |
Parsimony Algorithms
- Exhaustive search (sequential)
- Examine every possible tree
- Easy to program
- Takes too long as N increases
- Branch-and-bound search (sequential)
- Don't examine every possible tree
- Keep track of the best (lowest-scoring) tree so far
- Skip over portions of the search space where none of the trees are better than the best so far
- Harder to program
- But can yield a major reduction in the running time, depending on the input data and the order in which the search examines trees
- Parallel branch-and-bound search
- Same as branch-and-bound, except examine multiple portions of the search space simultaneously
- Can find better bounds faster, again depending on the input data and the order in which the search examines trees
- Still harder to program
|
Parallel Computing II
|
|
•
|
|
4003-532-70/4005-736-70
|
|
•
|
|
Spring Quarter 2007
|
|
Course Page
|
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2007 Alan Kaminsky.
All rights reserved.
Last updated 22-Mar-2007.
Please send comments to ark@cs.rit.edu.
|