Experimental Project
Common Subsequence Algorithms
VCSG 800,
Spring 2013
Contents
The goal of this project is two fold.
Firstly, to observe empirically complexities of different
implementations of algorithms for the same problem: finding longest
common subsequence in two sequences.
Secondly, to find out how accurate are the theoretical
estimates of complexity when compared to practical execution times.
The project can be completed individually, or by a team of two students.
In the latter case, please let me know by email (with a cc to your partner)
the composition of two person teams by April 4, 2013.
Submit the source code (hardcopy) of at least one algorithm with time
measuring routines, a sample profiler run, and at least one higher
level script, by April 18, 2013. This will be the skeleton
of your full experiment.
The hardcopy of full report, containing description and analysis of the results, and
the final source code are due May 9, 2013. The project will be graded
according to criteria listed in the
gradesheet.
Implement (at least) the following algorithms for the longest
common subsequence problem:
- Naive recursive algorithm (as implied by theorem 15.1 page 392)
- Recursive algorithm with memoization
- Dynamic programming version of the algorithm
- Quadratic-time linear-space algorithm
In each case the task is to find the length of the longest common
subsequence of two sequences, and generate some or all of them.
Use C, C++, Java or other programming language for
this project, as long as you can perform fine cpu time measurements
of your experiments.
Make your code simple, minimize user interface and i/o operations,
etc., concentrate on the comparison of execution times and
memory requirements.
Use small toy examples first, to ensure that all your implementations
are logically correct. Then, the most important input for running timing
experiments should be pairs of randomly generated sequences.
Some pairs of input sequences, for at least one of the algorithms,
must be of length at least 40000 for the case of computing only
the length of LCS.
- Use alphabets {0,1} and {A,C,G,T}.
- Vary the length and structure of input sequences,
use some randomly generated sequences.
- Count the number of recursive calls and relate it to cpu time.
- Find the largest inputs you can process in 10 cpu seconds,
for each of the algorithms.
- Address at least one of the exercises 15.4* pages 396/397.
- Compare the performance of your algorithms for small alphabets
versus large alphabets (20 or more characters).
- Use some realistic cases motivated by DNA sequences.
- Solve more exercises from pages 396/397.
- Measure separately CPU time spent on the computation of length,
and the reconstruction of common subsequences.
- Implement quadratic-time linear-space nonrecursive algorithm
- Implement subquadratic algorithm for LCS by Hirschberg, or other
found in the literature.
-
Use Condor to parallelize LCS computations with different weighting
schemes for VERY long sequences.
- Implement and analyze the performance of the dynamic programming
algorithm computing the edit distance of two sequences, and edit
operations transforming one into the other. Follow the
suggestions described in the problem 15-5 pages 406-408.
The last two features may easily become the subject of
your final MS project, thesis, and beyond.
Different algorithms should use the same input sequences
for cpu time comparison.
Design the experiments so that they are informative.
Vary the values of parameters.
You should use scripts to organize your experiments.
Finer time measurements can be obtained by using time system calls
in C,
C++, or
Java,
from the inside of the program.
Very fine time analysis, can be done
with the help of gprof (see man entries for time
and gprof).
Describe how did you organize your experiments.
Comments should be embedded into the source code.
I may ask you for a demonstration in one of our labs.
Tabulate cpu times for the same data for different
algorithms. Compare them to the theoretical complexity of each
algorithm, and between the algorithms. You can divide the tabulated
times by the values of complexity function, and thus approximating
a "constant" hidden in O-notation.
- Some cpu times can be small. In order to observe more accurately
the time spent on the actual execution of the algorithm, generate (or read)
input once, and then run the algorithm, say, 100 times (to make
the cpu time meaningful, but still small), without any i/o operations.
- Be sure to use machines of the same speed for each experiment -
run a simple timing test, and/or check the processor type and speed.
CS department has machines, which while looking very
similar, run with different speeds.
- For a variety of random number generators see man drand48,
a simple program showing the usage of
drand48 and time system calls.
- Different algorithms should produce the same lengths and counts
for the same data, in all implementations.
- Avoid interactive features. Your program does not have to be user friendly.
- Make all your implementations equally
sophisticated; fancy and complicated tricks in only one algorithm
can hide the true result of algorithm comparison.
- Be creative. You are encouraged to use other experiments/reasonings
than those suggested here.
References
- Our textbook, chapter 15.
- D. S. Hirschberg, A Linear Space Algorithm for Computing Maximal
Common Subsequences, Communications of the ACM 18 (June 1975) 341-343.
(hardcopy distributed to everybody)
- D. S. Hirschberg, Algorithms for the Longest Common Subsequence
Problem, Journal of the ACM, 24 (October 1977) 664-675.
(if you wish ask for a hardcopy from spr)
- Website
by Christian Charras and Thierry Lecroq
on sequence comparison.
(if you wish ask for a hardcopy from spr)
VCSG 800