4003-532/4005-736 Parallel Computing II
Module 3 Lecture Notes -- Phylogenetic Trees
Branch-and-Bound Search
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
Combinatorial Search
- Configuration
- A partial or complete solution to the problem being solved
- Here: A partial or complete phylogenetic tree
- Search space
- The directed graph of all possible configurations (graph vertices)
- Here: A digraph of all possible phylogenetic trees
- An edge from configuration X to configuration Y if Y is derived from X
- Here: An edge from phylogenetic tree X to phylogenetic tree Y if adding a species to phylogenetic tree X in a certain spot yields phylogenetic tree Y
- Branching factor: The number of edges coming out of a configuration
- The search space may be a search tree
- Only one path from any configuration to any other configuration
- Here: The phylogeny search space is a tree
- Or the search space may be a digraph but not a tree
- Multiple paths from some configuration to some other configuration
- May be acyclic or cyclic
- Search space traversal
- Beginning with some initial configuration, visit every configuration in the search space
- Visit every configuration just once
- Goal of the search space traversal
- Determine whether a specified configuration exists in the search space
- Or, find the configuration(s) that are "best" according to a specified criterion
- Here: Find the phylogenetic tree(s) with the smallest parsimony score
- Traversal technique: Breadth-first search
- Revolves around a queue of configurations
- Pro: A simple iterative algorithm, easy to program
- Con: If there is a large branching factor (often the case), the queue length explodes and the program runs out of memory
- Traversal technique: Depth-first search
- Revolves around a stack of configurations
- Pro: Provided the search space depth is not large (often the case), the stack length (proportional to the search space depth) stays manageable
- Con: A recursive algorithm, somewhat harder to program
- Non-tree search space
- If the search space is not a tree, we must detect if we have visited a certain configuration before, so as not to process it again
- Requires keeping a collection of already-seen configurations
- Con: This eats up memory
- Con: This eats up CPU time, although a collection searchable in O(1) time, such as a hash table, speeds it up
- Here: The phylogeny search space is a tree, so we don't have to worry about this issue
- Search strategy: Exhaustive search
- Always visit every configuration in the search space
- Pro: Easy to program
- Pro: Guaranteed to find the solution
- Con: Takes too long as the problem size increases
- Search strategy: Branch-and-bound search
- A variant of exhaustive search applicable when searching for a "best" solution
- Do not search parts of the search space that we know cannot possibly contain the solution
- Pro: Still guaranteed to find the solution
- Pro: Can reduce the running time to find the solution
- Con: Does not necessarily reduce the running time
- Con: A bit harder to program
- Search strategy: Approximate search or heuristic search
- Do not visit every configuration in the search space
- Try to visit portions of the search space such that the solution found in those portions is close to the true best solution
- Examples: Genetic algorithm, simulated annealing, tabu search
- Pro: Usually takes much less time than exhaustive search or branch-and-bound search
- Con: Not guaranteed to find the true best solution
- Con: Often you have no idea how far the solution it finds is from the true best solution (sometimes you can prove an upper bound)
- Pro: But the solution you find may be "good enough" in a practical application
- Biologists insist on finding the guaranteed most parsimonious phylogenetic tree(s), they will not tolerate an approximation
Exhaustive Search Program
- Approach
- Input S DNA sequences
- Excise uninformative sites
- Generate all (2S - 3)!! phylogenetic trees
- Compute parsimony score (number of state changes) of each tree using the Fitch algorithm
- Output tree(s) with the best (smallest) score
- Tree generation algorithm
- Depth-first search of the space of phylogenetic trees
- Package edu.rit.phyl.pars
- Running time measurements
Branch-and-Bound Search Program
- Approach
- Input S DNA sequences
- Excise uninformative sites
- Shuffle sequences (more on this later)
- Generate phylogenetic trees using a recursive depth-first search algorithm
- Compute parsimony score (number of state changes) of each tree using the Fitch algorithm
- If a phylogenetic tree's parsimony score is greater than the best (smallest) score found so far, prune the search
- Keep a list of the tree(s) with the best (smallest) score
- When finished searching, output the list
- Package edu.rit.phyl.pars
- Running time measurements
Branch-and-Bound Search Order
- The order in which configurations are visited does not affect the solution the branch-and-bound algorithm finds
- The order in which configurations are visited can affect the branch-and-bound algorithm's running time
- You want to find a close approximation to the best solution early on
- Then branch-and-bound will prune more of the search space, reducing the running time
- There is no theoretical guidance on what a good search order is
- For parsimony, here's a strategy that has been found to work well:
- Shuffle the input DNA sequences so that far-apart sequences are added to the tree first, then nearer-together sequences are added
- Distance between two sequences = Hamming distance = number of sites that differ
- With DNA sequences in this order, the first complete tree generated by the depth-first search makes the sequences' tree distances mirror the sequences' Hamming distances
- This is a good approximation to the true most parsimonious tree (one hopes)
- Class DnaSequenceList, method shuffleDescendingOrder()
- Package edu.rit.phyl.pars
- Running time measurements
SMP Parallel Branch-and-Bound Program
- Design
- Basically the same as the sequential B-and-B program
- Shared global variable for best score
- Each thread traverses the complete search tree down to a threshold depth, D
- For each configuration at level D:
- Only the first thread to reach that configuration traverses below that configuration
- The other threads backtrack when they reach that configuration
- Rationale
- This is designed mainly to achieve load balancing
- The amount of work below a certain configuration varies, because of the B-and-B pruning
- Therefore, a static division of the search space among the threads will not achieve a balanced load
- Instead, the threads iterate over the configurations at level D using what amounts to a dynamic schedule
- Choose D small enough so that traversing the complete search tree down to level D only takes a fraction of a second
- Choose D large enough for there to be plenty of "chunks" (configurations) to balance the load
- D = 6 or 7 works well -- there are 945 or 10,395 configurations at level 6 or 7
- Package edu.rit.phyl.pars
- Running time measurements
|
Parallel Computing II
|
|
•
|
|
4003-532-70/4005-736-70
|
|
•
|
|
Spring Quarter 2007
|
|
Course Page
|
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2007 Alan Kaminsky.
All rights reserved.
Last updated 13-May-2007.
Please send comments to ark@cs.rit.edu.
|