Research Programming Guidelines

Richard Zanibbi, November 2011 (first draft)

This document summarizes my views on how to write code for research projects not concerned with large-scale software development. It is based on my experience writing code for my own as well as other people's research projects for more than a decade, in a number of different programming languages, development environments, and computing platforms. Comments are quite welcome (send email to: rxzvcs@rit.edu).

Summary

Correct is more important than fast.
Complete is more important than fast.
Clear and concise is more important than fast.
Get it right, then make it fast.

(Jim Cordy used to make the last point regularly. In cases with heavy iteration and/or large data one needs to optimize as a program is constructed, but this should be done incrementally, with 'right' modules/functions being constructed before they are modified and made 'fast.')

A year in the lab can save you a day in the library.

(paraphrased from Mike Kalish. Other people have probably worked on your problem or closely related ones. They will have written papers on their findings, and probably even made code available. Study their papers and programs, as it will save you time. Also, there is no substitute for learning through studying a program that has its purpose clearly understood by its author, with the structure of a problem and corresponding solution reflected in the program design, comments, and I/O.)

I think it is very definitely worth the struggle to try and do first-class work because the truth is, the value is in the struggle more than it is in the result.

(from Richard Hamming)

Guidelines

Always determine whether an implementation exists that you can use or adapt. If the answer is yes, then take the time to properly learn the tool/language, rather than try to implement the same functionality in a familiar tool or language, because:
- While challenging, it will probably take much less time than having to debug your own implementation, leaving more time to do new things.
- It will let you benefit from the experience and expertise of other people working in the area, and learn from their ideas, algorithms, scripts, and the organization of their programs.
- It adds a tool that you can include on your CV. A willingness and ability to learn new tools is appealing to both industrial and academic organizations.
Write the I/O for your program first. This will allow you to make progress early on, by developing powerful and simple-to-use I/O functions. If you are unable to define the inputs and outputs for your program, this provides an opportunity for you to find the tools needed, or to refine your understanding of your project implementation/experiment. If you are unsure what the inputs and outputs are, you do not understand either the purpose or operations of your algorithm.
- Use a simple I/O format that is easily human-readable. Often this is a very simple text file format that can be translated to/from other formats (e.g. YAML or XML, using libraries or tools) using functions.
- Make use of standard input, standard output and file redirection. The ability to redirect output to a file, and use files for input will make the execution of test scripts and experiments much easier.
- Always save parameters that are 'learned' from data, e.g. weights in a neural net. Modern languages that support 'pickling' (e.g. in Java or Python) make it very easy to save and restore data structures and object state.
- Outputs include plots, graphs, and statistical tests. Write functions that will allow you to easily generate tables, figures, plots (e.g. of time series data, or bar graphs) and statistical tests (e.g. t-tests or ANOVA) using mock data if necessary. This allows you to test your approach to analyzing experimental outcomes early, identify metrics that are uninteresting or important but missing from the experiment design, and save energy for considering rather than coding visualization for results near the end of development, when time tends to be very short.
Automation: Implement tests and experiments using scripts or programs that you can later run again, in case you need to re-generate results, check for bugs, or adapt.
Test-driven development: whether developing bottom-up, or top-down, stop and test regularly, and use scripts or programs to automate tests. Don't do this because it's good; do it because it gives confidence in your programs, encourages lean designs, and will allow you to trust functions/program modules/classes as you construct larger programs, making debugging more manageable.
- Format debugging and test output simply, and carefully. Make sure that what is output will be easy to understand correctly, for you or anyone else that might use the program (your 'audience').
- Design tests to make it easy to spot errors in test output if needed. Include trivial cases whose correct output is 'obvious:' often this helps spot bugs earlier in development.
- When confused about the behavior of a program, write little 'experiment' programs to test your understanding of the program behavior. This is a rare case where more code is better, provided that the additional code can be easily separated and later removed if necessary.
Write code for your audience: researchers in your area. If you write code that others cannot easily identify the organization or details of as needed, they will not use it, reducing the impact of your work. Remember that this audience may include yourself if you choose to use your code again even two weeks into the future.
- Research programs are not mature applications. Source code for a fully developed application needs to be used by potentially hundreds or thousands of other developers; early research programs are prototypes, which need to be used within a research lab, and possibly by others working in the research area. Style and features should be dictated by the needs of the research project and the knowledge base of researchers in the area, which is normally very specific.
  If a research program is useful, it is very likely to be re-designed and re-implemented.
- Commenting: Write comments that will help you remember the structure and subtle details of the program when reading the code. Rather than comment on every input and output of a function, use descriptive names, and comments/operations in the code that make their types clear.
- Document how to run your program and tests. Minimally, you should write a README file that describes how to invoke the program, and what the command line arguments are. It is also a good idea to document sample test cases from your test suite here, so that you can remember how to run them/where to find them in your code base.
- Try to keep code small, and where possible to remove any dead or incorrect code. Avoid large sections of unused, commented code; they reduces the likelihood that you will see ways to simplify (and debug) your code.
Use development tools sparingly. Some tools such as debuggers, profilers and version control systems are very helpful, but learning about a vast array of tools that you cannot profitably use and/or constraining your programs to provide hooks for tools will slow you down.