Common Subsequence Algorithms

CSCI-665, Spring 2017

The goal of this project is two fold. Firstly, to observe empirically complexities of different implementations of algorithms for the same problem: finding longest common subsequence in two sequences. Secondly, to find out how accurate are the theoretical estimates of complexity when compared to practical execution times.

The project can be completed individually, or by a team of two students.
In the latter case, please let me know by email (with a cc to your partner)
the composition of two person teams by Thursday, **March 9**, 2017.

Submit the source code (hardcopy) of at least one algorithm with time
measuring routines, a sample profiler run, and at least one higher
level script, by Tuesday, **April 11**, 2017. This will be the skeleton
of your full experiment.

The hardcopy of full report, containing description and analysis of the results,
and the final source code are due Tuesday, **May 9**, 2017. Note that in order to
write a meaningful report, the experiments should be completed, say, a week
earlier. The project will be graded according to the criteria listed in the
gradesheet.

Implement (at least) the following algorithms for the longest common subsequence problem:

- Naive recursive algorithm (as implied by theorem 15.1 page 392)
- Recursive algorithm with memoization
- Dynamic programming version of the algorithm
- Quadratic-time linear-space algorithm

In each case the task is to find the length of the longest common
subsequence of two sequences, and generate some or all of them.
Use *C*, *C++*, Java or other programming language for
this project, as long as you can perform fine cpu time measurements
of your experiments.
Make your code simple, minimize user interface and i/o operations,
etc., concentrate on the comparison of execution times and
memory requirements.

Use small toy examples first, to ensure that all your implementations are logically correct. Then, the most important input for running timing experiments should be pairs of randomly generated sequences. Some pairs of input sequences, for at least one of the algorithms, must be of length at least 40000 for the case of computing only the length of LCS.

- Use alphabets {0,1} and {A,C,G,T}.
- Vary the length and structure of input sequences, use some randomly generated sequences.
- Count the number of recursive calls and relate it to cpu time.
- Find the largest inputs you can process in 10 cpu seconds, for each of the algorithms.

- Compare the performance of your algorithms for small alphabets versus large alphabets (20 or more characters).
- Use some realistic cases motivated by DNA sequences.
- Solve more exercises from pages 396/397.
- Measure separately CPU time spent on the computation of length, and the reconstruction of common subsequences.
- Implement quadratic-time linear-space nonrecursive algorithm
- Implement subquadratic algorithm for LCS by Hirschberg, or other found in the literature.
- Implement and analyze the performance of the dynamic programming algorithm computing the edit distance of two sequences, and edit operations transforming one into the other. Follow the suggestions described in the problem 15-5 pages 406-408.

The last two features may easily become the subject of your final MS project, thesis, and beyond.

Different algorithms should use the same input sequences for cpu time comparison. Design the experiments so that they are informative. Vary the values of parameters.

You should use scripts to organize your experiments.
Finer time measurements can be obtained by using *time system calls*
in *C*,
*C++*, or Java,
from the inside of the program.
Very fine time analysis, can be done
with the help of *gprof* (see *man* entries for *time*
and *gprof*).

Describe how did you organize your experiments. Comments should be embedded into the source code. I may ask you for a demonstration in one of our labs.

Tabulate cpu times for the same data for different algorithms. Compare them to the theoretical complexity of each algorithm, and between the algorithms. You can divide the tabulated times by the values of complexity function, and thus approximating a "constant" hidden in O-notation.

- Some cpu times can be small. In order to observe more accurately the time spent on the actual execution of the algorithm, generate (or read) input once, and then run the algorithm, say, 100 times (to make the cpu time meaningful, but still small), without any i/o operations.
- Be sure to use machines of the same speed for each experiment - run a simple timing test, and/or check the processor type and speed. CS department has machines, which while looking very similar, run with different speeds.
- For a variety of random number generators see
*man drand48*, a simple program showing the usage of*drand48*and*time system calls*. - Different algorithms should produce the same lengths and counts for the same data, in all implementations.
- Avoid interactive features. Your program does not have to be user friendly.
- Make all your implementations equally sophisticated; fancy and complicated tricks in only one algorithm can hide the true result of algorithm comparison.
- Be creative. You are encouraged to use other experiments/reasonings than those suggested here.

- Our textbook, chapter 15.
- D. S. Hirschberg, A Linear Space Algorithm for Computing Maximal Common Subsequences, Communications of the ACM 18 (June 1975) 341-343. (hardcopy distributed to everybody)
- D. S. Hirschberg, Algorithms for the Longest Common Subsequence Problem, Journal of the ACM, 24 (October 1977) 664-675. (if you wish ask for a hardcopy from spr)
- Website by Christian Charras and Thierry Lecroq on sequence comparison.