Few researchers were using computers 30 years ago. This quickly changed with the release of several commercially viable personal computers in the 1980s. Since then, processing power has increased and the cost of computers decreased at an exponential rate (see Moore’s Law).
It’s no surprise that computers are now pivotal in chemistry research. We use them in a wide range of calculations – from determining the 40th decimal place of the absolute energy of He to modeling the release and distribution of toxic chemicals in river basins. The software used to address these complex problems is becoming increasingly accessible and easy to use too. There are already a variety of cell phone apps for chemistry related problem solving.
Yet, while the prevalence of software and computer-based research continues to grow, the rules for publishing results and sharing software lags behind. The magical/miracle nature of black-box calculations is disconcerting to individuals that want to know how the answers were obtained (see Sidney Harris cartoon). A palpable concern is growing in the scientific community around the sharing of software – and the foundational source code -necessary to reproduce published results. Two recent opinion pieces, one in Science titled, “Shining Light into Black Boxes” and the other in Nature titled, “The case for open computer programs” are trying to bring attention to this issue. The articles discuss the advantages and apprehensions of sharing, as well as suggest possible changes. Below is a summary of the points raised by the authors of the two articles – as well as the thoughts others (including myself).
Advantages to sharing software and source code:
- Reproducibility: As stated by Ince et. al., “The vagaries of hardware, software and natural-language will always ensure that exact reproducibility remains uncertain…” without the release of source code in its entirety.
- Catching errors: A simple mistake in converting units, assigning missing values as zero, rounding errors, or a misplaced decimal point, can wildly skew outcomes (see Office Space). We can only see and correct errors if we can see the source code.
- Facilitating progress: All publications require that data, equations, materials, methods, and instrumentation are disclosed so that the results can be tested and furthered by others. We are all better served when source code is disseminated in a similar manner so that programs can be studied and repurposed in future research.
- Teaching tools: Real, applied examples – that are relevant to research – are useful for new students and researchers learning to program and develop code.
- Openness: Despite the competition to acquire funding and to publish first, we are all joined in the endeavor of understanding the rules that govern the universe. The open sharing of information has been and will continue to be the foundation of scientific progress.
- Relying on faith: No matter how prolific or respected you are as a researcher, the implicit assertion, “Trust me, the program works the way I say it does” is not an acceptable means of justifying your results. On a fundamental philosophical level, black box justifications like that should be socially unacceptable in the sciences.
Apprehensions against sharing software and source code:
- It’s not required: With the constant push to publish early and often, no one wants to put in unnecessary time and effort during the submission process.
- Embarrassment: Many computer programmers take pride in how clean, efficient, and elegant they make their code. For researchers, on the other hand, programing is often a means to an end. The idea of someone else looking at your “messy” code with a critical lens could be intimidating or even embarrassing.
- Citations: There is not a shared mechanism or expectation for citing the authors of source code. Without citations, sharing source code does not help your career and may even help your “competitors.”
- Formatting: Currently, there is not a standard format for sharing code.
- Intellectual Property: Obviously, if a program is commercially available or has potential to be, the release of source code would allow anyone to reproduce the program without purchasing it.
- Source code in the wrong hands can be dangerous: This is a concern among many theoreticians. In a day and age where anyone can calculate energies, structure and spectra of molecules with prepackaged software (Gaussian, Spartan, GAMESS, etc.), the rationalization of results based on black box answers is common. Dissemination of more software and source code is likely to compound the problem.
- I did the work, so should you: Recently, a colleague of mine contacted an author of a paper requesting the source code for results which were not reproducible without it. She was basically told, “we used a Monte Carlo method. You can find something similar in Matlab.” My colleague was understandably disappointed at the answer. Taking the time to write a program to double-check someone else’s results, when it might not even reproduce the results (see above) – even using Matlab as a starting point - is just not worth the time and effort.
With all of that said, the question remains: what should be required for publication? Currently, depending on the journal, the requirement can range from a general description of the nature of the program – so individuals can write their own code – to the full release of the source code.
I have been, and will continue to be, a strong advocate of sharing all results and information, no matter how trivial or insignificant they may seem (see Journal of Failed Chemistry). Any information may be the key to a major research discovery that helps move an entire field forward. This information could include a program/source code that solves a lingering problem or reveals a new way to think of a solution. For tax payer funded research, I agree with Ince et. al. that, “anything less than the release of the source programs is intolerable for results that depend on computation.” For those of us that want to see the release of source code, changes clearly need to be made.
- Institutionally: Publically-funded research institutions should create and implement a quick, standard, open source licensing procedure. For most code, commercialization is not an issue. Usually, the questions the code address is so specific that only a handful of people would be interested, limiting its potential profitability. Yet, for source code that is potentially profitable there needs to be a quick mechanisms for protection.
- Funding agencies: Tax payer funded agencies should clearly state their preference for the open dissemination of software and source code. This includes the requirement of a code dissemination plan in the grant proposal.
- Journals: Publishers could enact a policy that requires all software and source code necessary to reproduce results be made available. This would be a condition for publication. They also need to develop simple mechanisms for sharing source code.
- Researchers: We should share our software/source code using any and every possible means. Sharing mechanisms can range from including it as a supplement to our publications to making it available for download online.
- Reviewers: Demand full source code for any software that is not commercially available prior to publication. For those programs that are commercially available or those that potentially could be, it may be necessary to coordinate an independent third-party tester that is given access to the code but has signed a non-disclosure agreement.
This post provided an overview of the debate around the availability of non-commercial software/source code with research publications. In my next post I will discuss a specific example within the chemistry community that exemplifies how a commercially-available software package can further complicate the situation.