Research Programming Guidelines
Richard Zanibbi, November 2011 (first draft)
This document summarizes my views on how to write code for research projects not concerned with large-scale software development. It is based on my experience writing code for my own as well as other people's research projects for more than a decade, in a number of different programming languages, development environments, and computing platforms. Comments are quite welcome (send email to: rxzvcs@rit.edu).
Summary
Correct is more important than fast.
Complete is more important than fast.
Clear and concise is more important than fast.
Get it right, then make it fast.
(Jim Cordy used to make the last
point regularly. In cases with heavy iteration and/or large data one needs to optimize as a program is constructed, but this should be done incrementally, with 'right' modules/functions being constructed before they are modified and made 'fast.')
A year in the lab can save you a day in the library.
(paraphrased from Mike Kalish. Other people have probably worked on your problem or closely related ones. They will have written papers on their findings, and probably even made code available. Study their papers and programs, as it will save you time. Also, there is no substitute for learning through studying a program that has its purpose clearly understood by its author, with the structure of a problem and corresponding solution reflected in the program design, comments, and I/O.)
I think it is very definitely worth the struggle to try and do first-class work because the truth is, the value is in the struggle more than it is in the result.
(from Richard Hamming)
Guidelines
- Always determine whether
an implementation exists that you can use or adapt. If the
answer is yes, then
take the time to properly learn the tool/language, rather than
try to implement the same functionality in a familiar tool or
language, because:
- While challenging, it will probably take much less time
than having to debug your own implementation, leaving more time to do new things.
- It will let you benefit from the experience and expertise of other people
working in the area, and learn from their ideas, algorithms, scripts, and the organization of their
programs.
- It adds a tool that you can include on your CV. A willingness and ability to learn new tools is appealing to both industrial and academic organizations.
- Write the I/O for your program first. This will allow you to make progress early on, by developing powerful and simple-to-use I/O functions. If you are unable to define the inputs and outputs for your program, this provides an opportunity for you to find the tools needed, or to refine your understanding of your project implementation/experiment. If you are unsure what the inputs and outputs are, you do not understand either the purpose or operations of your algorithm.
-
Use a simple I/O format that is easily human-readable. Often this is a very simple text file format that can be translated to/from other formats (e.g. YAML or XML, using libraries or tools) using functions.
-
Make use of standard input, standard output and file redirection. The ability to redirect output to a file, and use files for input will make the execution of test scripts and experiments much easier.
-
Always save parameters that are 'learned' from data, e.g. weights in a
neural net. Modern languages that support 'pickling' (e.g. in Java or Python)
make it very easy to save and restore data structures and object state.
- Outputs include plots, graphs, and statistical tests. Write functions that will allow you to easily generate tables,
figures, plots (e.g. of time series data, or bar graphs) and statistical tests
(e.g. t-tests or ANOVA) using mock
data if necessary. This allows you to test your approach to analyzing experimental
outcomes early, identify metrics that are uninteresting or important but missing from the
experiment design, and save energy for considering rather than coding visualization for results near the end of development, when time tends to be very short.
- Automation:
Implement tests and experiments using scripts or programs that you can later
run again, in case you need to re-generate results, check for bugs,
or adapt.
- Test-driven development: whether developing bottom-up, or top-down,
stop and test regularly, and use scripts or programs to automate tests.
Don't do this because it's good; do it because it gives
confidence in your programs, encourages lean designs, and will allow you to trust
functions/program modules/classes as you construct larger programs, making
debugging more manageable.
- Format debugging and test output simply, and carefully. Make sure that what is output will be easy to understand correctly, for you or anyone else that might use the program (your 'audience').
- Design tests to make it easy to spot errors in test output if needed. Include trivial cases whose correct output is 'obvious:' often this helps spot bugs earlier in development.
- When confused about the behavior of a program, write little
'experiment' programs to test your understanding of the program
behavior. This is a rare case where more code is better, provided that the
additional code can
be easily separated and later removed if necessary.
- Write code for your audience: researchers in your area. If you write code that others cannot easily identify the organization or details of as needed, they will not use it, reducing the impact of your work. Remember that this audience may include yourself if you choose to use your code again even two weeks into the future.
- Use development tools sparingly. Some tools such as debuggers, profilers and version control systems are very helpful, but learning about a vast array of tools that you cannot profitably use and/or constraining your programs to provide hooks for tools will slow you down.