On any given day, medical researchers at Carnegie Mellon University (CMU) may be investigating new ways to thwart the development of epilepsy or designing an implantable biosensor to improve the early detection of diseases such as cancer and diabetes. As with any disciplined pursuit of science, such work is subject to rigorous rounds of peer review, in which documents revealing methodology, results, and other key details are examined.
But, assuming software was created for the research, should a complete disclosure of the computer code be included in the review process? This is a debate that doesn't arrive with any ready answersnot on the campus grounds of CMU or many other institutions. Scott A. Hissam, a senior member of the technical staff at CMU's Software Engineering Institute, sees validity in both sides of the argument.
"From one perspective, revealing the code is the way it should be in a perfect world, especially if the project is taking public money," says Hissam, who, as a coauthor of Perspectives on Free and Open Source Software, has explored the topic. "But, in practice, there are questions. The academic community earns needed credentialing by producing original publications. Do you give up the software code immediately? Or do you wait until you've had a sufficient number of publications? If so, who determines what a sufficient number is?"
Another dynamic that adds complexity to the discussion is that scientific researchers are not software developers. They often write their own code, but generally don't follow the same practices, procedures, and standards as professional software programmers.
"Researchers who are trying to cure cancer or study tectonic plates will write software code to do a specific task in a lab," Hissam says. "They aren't concerned about the same things that computer programmers are, such as scalability and design patterns and software architecture. So imagine how daunting of a task it would be to review and try to understand how such a code was written."
This issue has gained considerable attention ever since Climategate, which involved the illegal hacking of researchers' email accounts last year at the Climate Research Unit at the University of East Anglia, one of world's leading institutions on global climate change. More than 1,000 email messages and 2,000 documents were hacked, and source code was released. Global warming contrarians have contended the email reveals that scientists manipulated data, among other charges. Climate Research Unit scientists have denied these allegations and independent reviews conducted by both the university and the House of Commons' Science and Technology Select Committee have cleared the scientists of any wrongdoing.
Still, Darrel Ince, professor of computing at the U.K's Open University, cited the Climate Research Unit's work as part of his argument that code should be revealed. He wrote in the Manchester Guardian that the university's climate-research team depended on code that has been described as undocumented, baroque, and lacking in data needed to pass information from one program and research team to another.
Ince noted that Les Hatton, a professor at the Universities of Kent and Kingston, has conducted an analysis of several million lines of scientific code and found that the software possessed a high level of detectable inconsistencies. For instance, Hatton found that interface inconsistencies between software modules that pass data from one part of a program to another happen, on average, at the rate of one in every seven interfaces in Fortran and one in every 37 interfaces in C.
"This is hugely worrying when you realize that one errorjust onewill usually invalidate a computer program," Ince wrote. Those posting comments on the Guardian Web site have been largely supportive of his arguments. "The quality of academic software code should absolutely be scrutinized and called out whenever needed," wrote one commenter. "It should be the de facto criteria for accepting papers," wrote another.
Still, not all were in agreement. "I work in scientific software," wrote one commenter. "The sort of good programming practices you talk about are things ... [that are] absolutely useless for one person wanting do a calculation more quickly. That's all the computer models are, fancy calculators. I've seen plenty of Fortran and VB code to do modeling written by academics and it's mostly awful but it also nearly always does the job."
Efforts to encourage scientists to reveal software code stem from philosophies that began with the birth of computers. Because the big, clunky invention was so expensive, software was freely shared. "There wasn't much that people had written anyway," says John Locke, manager of Freelock Computing, an open-source business services firm. "Sharing code was like sharing scientific ideas, and was treated in the same way."
The U.S. Constitution provides patent and copyright protection to scientists and their sponsors so they can place their work in the public domain while still being able to profit, Locke argues. And this, he says, provides enough protection to open up the code.
"Not sharing your code basically adds an additional burden to others who may try to review and validate your work," Locke says. "If the code is instrumental in testing a hypothesis, keeping it closed can prevent adequate peer review from taking place. After all, source code is nothing more than a very specific set of steps to achieve a desired result. If those steps cannot be reviewed in detail, the whole test is suspect."
There is often hesitancy, however, for these very reasons. Opening up the code essentially throws "the books" open. It further peels away the curtain to reveal how the work was done. These days, scientists are wary of providing additional fodder that could impede their work or damage their reputations.
"There are downsides [to revealing code]," says Alan T. DeKok, a former physicist who now serves as CTO of Mancala Networks, a computer security company. "You may look like a fool for publishing something that's blatantly wrong. You may be unable to exploit new 'secret' knowledge and technology if you publish. You may have better-known people market your idea better than you can, and be credited with the work. But in order to be trusted, much of the work should be released. If they can't release key portions, then the rest is suspect."
"Not sharing your code basically adds an additional burden to others who may try to review and validate your work," says John Locke.
While ethical considerations and those conveyed in the greater interest of science are often made to encourage more information sharing, those same considerations can be used to state the case that some information needs to remain undisclosed. Had the Manhattan Project happened today, for instance, surely few people would call for an open dissection of its software DNA, says Mike Rozlog, developer tools product manager at Embarcadero Technologies.
Also, science is a highly competitive endeavor, and funding is often based on a track record of success. "If you're forced to release proprietary [code]," Rozlog says, "this could give a significant advantage to rogue organizations that don't follow the same rules."
For the past seven years, researchers at Purdue University have attempted to resolve this issue, especially with the study of nanotechnology. Funded by the National Science Foundation, nanoHUB.org has been established as a site where scientists and educators share simulation and modeling tools and run their code on high-performance computer resources, says software architect Michael McLennan, a senior research scientist at Purdue. A toolkit called Rappture standardizes the input and output for the tools and tracks details about execution, such as which user ran which version of the code, the computer used, and the date of the usage. Simulations run in a cloud of computing resources, and the most demanding computations are sent to national grid computing resources such as the TeraGrid. nanoHUB.org now has a core group of 110,000 users from more than 170 nations, who launch more than 340,000 online simulations each year.
The project encourages users to release their work as open source or under a creative commons license, McLennan says. "But even if the codes are not open source, the unique middleware allows scientists to run the tools and test the behavior of the models," McLennan says. Since launching it, Purdue has developed other hubs using the same software platform to study cancer research and care, bio-fuels, environmental modeling, pharmaceutical engineering, among other pursuits. It's now constructing a dozen more hubs as well, and some are for outside agencies, such as the Environmental Protection Agency. And researchers at Notre Dame are using the software to build their own hub for biological adaption to climate change.
"Having our software as open source allows these other sites to pick this up and create their own hubs in their own machines," McLennan says. "It shows that this kind of effort can go far beyond nanoHUB.org, and take hold across a wide variety of science and engineering disciplines."
Further Reading
Feller, J., Fitzgerald, B., Hissam, S.A., and Lakhani, K.R.
Perspectives on Free and Open Source Software. MIT Press, Cambridge, MA, 2005.
Ince, D.
If you're going to do good science, release the computer code too. Manchester Guardian, Feb. 5, 2010.
McLennan, M. and Kennell, R.
HUBzero: a platform for dissemination and collaboration in computational science and engineering. Computing in Science and Engineering 12, 2, March/April 2010.
PurdueRCAC
HUBzero Cyberinfrastructure for Scientific Collaboration.
http://www.youtube.com/watch?v=MrOGA_TluGY
©2010 ACM 0001-0782/10/1000 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
The following letter was published in the Letters to the Editor in the December 2010 CACM (http://cacm.acm.org/magazines/2010/12/102133).
--CACM Administrator
About the software of science, Dennis McCafferty's news story (Oct. 2010) asked "Should Code Be Released?" In the case of climate science code, the Climate Code Foundation (http://climatecode.org/) answers with an emphatic yes. Rebuilding public trust in climate science and support for policy decisions require changes in the transparency and communication of the science. The Foundation works with climate scientists to encourage publication of all climate-science software.
In a Nature opinion piece "Publish Your Computer Code: It Is Good Enough" (Oct. 13, 2010, http://www.nature.com/news/2010/101013/full/467753a.html), I argued that there are powerful reasons to publish source code across all fields of science, and that software is an essential aspect of the scientific method despite failing to benefit from the system of competitive review that has driven science forward for the past 300 years. In the same way software is part of the scientific method, source code should be published as part of the method description.
As a reason for not publishing software, McCafferty quoted Alan T. DeKok, a former physicist, now CTO of Mancala Networks, saying it might be "blatantly wrong." Surely this is a reason, perhaps the main one, that software should be published to expose errors. Science progresses by testing new ideas and rejecting those that are wrong.
I'd also like to point out a glaring red herring in McCafferty's story the suggestion that a policy in this area could undermine a modern-day Manhattan Project. All design and method descriptions from that project were top secret for years, many to this day. Such secrecy would naturally apply to any science software of similar military importance.
Nick Barnes
Staines, U.K.
Displaying 1 comment