Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Parallel Computing II 4003-532-70/4005-736-70 Spring Quarter 2007
Course Page

4003-532/4005-736 Parallel Computing II
Module 3 Lecture Notes -- Phylogenetic Trees
Branch-and-Bound Search

Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science


Combinatorial Search

  • Configuration
    • A partial or complete solution to the problem being solved
    • Here: A partial or complete phylogenetic tree
       
  • Search space
    • The directed graph of all possible configurations (graph vertices)
    • Here: A digraph of all possible phylogenetic trees
    • An edge from configuration X to configuration Y if Y is derived from X
    • Here: An edge from phylogenetic tree X to phylogenetic tree Y if adding a species to phylogenetic tree X in a certain spot yields phylogenetic tree Y
    • Branching factor: The number of edges coming out of a configuration
       
  • The search space may be a search tree
    • Only one path from any configuration to any other configuration
    • Here: The phylogeny search space is a tree
       
  • Or the search space may be a digraph but not a tree
    • Multiple paths from some configuration to some other configuration
    • May be acyclic or cyclic
       
  • Search space traversal
    • Beginning with some initial configuration, visit every configuration in the search space
    • Visit every configuration just once
       
  • Goal of the search space traversal
    • Determine whether a specified configuration exists in the search space
    • Or, find the configuration(s) that are "best" according to a specified criterion
    • Here: Find the phylogenetic tree(s) with the smallest parsimony score
       
  • Traversal technique: Breadth-first search
    • Revolves around a queue of configurations
    • Pro: A simple iterative algorithm, easy to program
    • Con: If there is a large branching factor (often the case), the queue length explodes and the program runs out of memory
       
  • Traversal technique: Depth-first search
    • Revolves around a stack of configurations
    • Pro: Provided the search space depth is not large (often the case), the stack length (proportional to the search space depth) stays manageable
    • Con: A recursive algorithm, somewhat harder to program
       
  • Non-tree search space
    • If the search space is not a tree, we must detect if we have visited a certain configuration before, so as not to process it again
    • Requires keeping a collection of already-seen configurations
    • Con: This eats up memory
    • Con: This eats up CPU time, although a collection searchable in O(1) time, such as a hash table, speeds it up
    • Here: The phylogeny search space is a tree, so we don't have to worry about this issue
       
  • Search strategy: Exhaustive search
    • Always visit every configuration in the search space
    • Pro: Easy to program
    • Pro: Guaranteed to find the solution
    • Con: Takes too long as the problem size increases
       
  • Search strategy: Branch-and-bound search
    • A variant of exhaustive search applicable when searching for a "best" solution
    • Do not search parts of the search space that we know cannot possibly contain the solution
    • Pro: Still guaranteed to find the solution
    • Pro: Can reduce the running time to find the solution
    • Con: Does not necessarily reduce the running time
    • Con: A bit harder to program
       
  • Search strategy: Approximate search or heuristic search
    • Do not visit every configuration in the search space
    • Try to visit portions of the search space such that the solution found in those portions is close to the true best solution
    • Examples: Genetic algorithm, simulated annealing, tabu search
    • Pro: Usually takes much less time than exhaustive search or branch-and-bound search
    • Con: Not guaranteed to find the true best solution
    • Con: Often you have no idea how far the solution it finds is from the true best solution (sometimes you can prove an upper bound)
    • Pro: But the solution you find may be "good enough" in a practical application
       
  • Biologists insist on finding the guaranteed most parsimonious phylogenetic tree(s), they will not tolerate an approximation


Exhaustive Search Program

  • Approach
    • Input S DNA sequences
    • Excise uninformative sites
    • Generate all (2S - 3)!! phylogenetic trees
    • Compute parsimony score (number of state changes) of each tree using the Fitch algorithm
    • Output tree(s) with the best (smallest) score
       
  • Tree generation algorithm
    • Depth-first search of the space of phylogenetic trees
       
  • Package edu.rit.phyl.pars
  • Running time measurements
    • Input: 18-species iguana data set -- iguana18.phy
    • Running time for the first S species on the tardis.cs.rit.edu machine (one run)
          Exhaust.
       S  T (msec)
      --  --------
       3       148
       4       141
       5       153
       6       184
       7       344
       8       688
       9      4645
      10     82676
      11   1926761
      


Branch-and-Bound Search Program

  • Approach
    • Input S DNA sequences
    • Excise uninformative sites
    • Shuffle sequences (more on this later)
    • Generate phylogenetic trees using a recursive depth-first search algorithm
    • Compute parsimony score (number of state changes) of each tree using the Fitch algorithm
    • If a phylogenetic tree's parsimony score is greater than the best (smallest) score found so far, prune the search
    • Keep a list of the tree(s) with the best (smallest) score
    • When finished searching, output the list
       
  • Package edu.rit.phyl.pars
  • Running time measurements
    • Input: 18-species iguana data set -- iguana18.phy
    • Running time for the first S species on the tardis.cs.rit.edu machine (one run)
             Exhaust.   B-and-B
       S     T (msec)   T (msec)
      --     --------   --------
       3          148        169
       4          141        157
       5          153        187
       6          184        182
       7          344        271
       8          688        332
       9         4645        526
      10        82676       1276
      11      1926761      12711
      12     40461981*   2693930
      13    930625563*   3017496
      14  23265639075*  63596079
      
      *predicted


Branch-and-Bound Search Order

  • The order in which configurations are visited does not affect the solution the branch-and-bound algorithm finds
     
  • The order in which configurations are visited can affect the branch-and-bound algorithm's running time
     
  • You want to find a close approximation to the best solution early on
    • Then branch-and-bound will prune more of the search space, reducing the running time
       
  • There is no theoretical guidance on what a good search order is
     
  • For parsimony, here's a strategy that has been found to work well:
    • Shuffle the input DNA sequences so that far-apart sequences are added to the tree first, then nearer-together sequences are added
    • Distance between two sequences = Hamming distance = number of sites that differ
    • With DNA sequences in this order, the first complete tree generated by the depth-first search makes the sequences' tree distances mirror the sequences' Hamming distances
    • This is a good approximation to the true most parsimonious tree (one hopes)
       
  • Class DnaSequenceList, method shuffleDescendingOrder()
     
  • Package edu.rit.phyl.pars
  • Running time measurements
    • Input: 18-species iguana data set -- iguana18.phy
    • Running time for the first S species on the tardis.cs.rit.edu machine (one run)
    • Compare running time with shuffling the input sequences to running time without shuffling the input sequences
                        B-and-B    B-and-B
             Exhaust.   Shuffled   Unshuff.
       S     T (msec)   T (msec)   T (msec)
      --     --------   --------   --------
       3          148        169        173
       4          141        157        141
       5          153        187        146
       6          184        182        158
       7          344        271        233
       8          688        332        318
       9         4645        526        413
      10        82676       1276       6866
      11      1926761      12711      98692
      12     40461981*   2693930    2672954
      13    930625563*   3017496   15710305
      14  23265639075*  63596079
      
      *predicted


SMP Parallel Branch-and-Bound Program

  • Design
    • Basically the same as the sequential B-and-B program
    • Shared global variable for best score
    • Each thread traverses the complete search tree down to a threshold depth, D
    • For each configuration at level D:
    • Only the first thread to reach that configuration traverses below that configuration
    • The other threads backtrack when they reach that configuration
       
  • Rationale
    • This is designed mainly to achieve load balancing
    • The amount of work below a certain configuration varies, because of the B-and-B pruning
    • Therefore, a static division of the search space among the threads will not achieve a balanced load
    • Instead, the threads iterate over the configurations at level D using what amounts to a dynamic schedule
    • Choose D small enough so that traversing the complete search tree down to level D only takes a fraction of a second
    • Choose D large enough for there to be plenty of "chunks" (configurations) to balance the load
    • D = 6 or 7 works well -- there are 945 or 10,395 configurations at level 6 or 7
       
  • Package edu.rit.phyl.pars
  • Running time measurements
    • Input: 18-species iguana data set -- iguana18.phy
    • Running time for the first S = 11 and 12 species on the tardis.cs.rit.edu machine (median of 7 runs)
    • Parallel search threshold D = 6 and 7
       S   D    K  T (msec)  Spdup  Effi.   EDSF
      --  --  ---  --------  -----  -----  -----
      11   6  seq     11976
      11   6    1     12226  0.980  0.980
      11   6    2      6610  1.812  0.906  0.081
      11   6    3      4655  2.573  0.858  0.071
      11   6    4      3965  3.020  0.755  0.099
      
      11   7  seq     12180
      11   7    1     12455  0.978  0.978
      11   7    2      7121  1.710  0.855  0.143
      11   7    3      5121  2.378  0.793  0.117
      11   7    4      4143  2.940  0.735  0.110
      
      12   6  seq   2610182
      12   6    1   2604162  1.002  1.002
      12   6    2   1468328  1.778  0.889  0.128
      12   6    3    926611  2.817  0.939  0.034
      12   6    4    708720  3.683  0.921  0.030
      
      12   7  seq   2613847
      12   7    1   2687077  0.973  0.973
      12   7    2   1444752  1.809  0.905  0.075
      12   7    3    959342  2.725  0.908  0.036
      12   7    4    691847  3.778  0.945  0.010
      

Parallel Computing II 4003-532-70/4005-736-70 Spring Quarter 2007
Course Page
Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Copyright © 2007 Alan Kaminsky. All rights reserved. Last updated 13-May-2007. Please send comments to ark­@­cs.rit.edu.