What if there were a way to take the mind-boggling amount of existing computer code, organize it, and learn from it in a way that makes writing new code more error-free and secure?
That scenario is headed toward reality thanks to a project called Pliny, which takes its name from the Roman naturalist and philosopher who authored the first encyclopedia.
Pliny is being funded by a four-year, $11 million grant recently announced by the Defense Advanced Research Projects Agency (DARPA), part of the U.S. Department of Defense.
UW-Madison computer scientists will collaborate with their counterparts at Rice University in Houston (which will lead the project), the University of Texas at Austin, and the company GrammaTech.
Computer sciences Professors Ben Liblit and Tom Reps will spearhead the UW-Madison side of the project, which has been compared to an autocorrect or autocomplete system. Says Liblit, "Based on knowing how people use English, autocomplete tries to make a best guess about what you're going to type. Similarly, there's a vast amount of software out there in the world, and what you're writing [as a software engineer] probably looks similar to what other people have written."
When code diverges from common patterns, it could be the result of a mistake — or an intentional choice by a programmer. By drawing upon a huge repository of billions of lines of code, Pliny will help software developers spot such divergences to identify possible errors.
Pliny is a massive undertaking, involving more than two dozen computer scientists from the four partner organizations. The project brings together two key areas within computer science: programming languages and big data. Pliny's backbone will be a data-mining engine that continuously scans the enormous repository of open-source code. In turn, "[the field of] programming languages knows how to look at code and treat it as data," Liblit says.
Once developed, Pliny could be a valuable tool for both professional software developers and students. It will make writing good code faster and more problem-free.
While some may perceive writing software as a rigid process, Liblit says there are multiple correct solutions to a need. However, there are also approaches that are clearly incorrect and don't work. Pliny will help by providing code correction, or feedback on what a developer has done wrong. And since bugs present security vulnerabilities, software that is more error-free is by definition more secure.
Together, Reps and Liblit bring strengths in static and dynamic analysis, respectively. Reps is a world leader in drawing out very deep, subtle properties about how a piece of software behaves, backed by strict mathematical proofs. Liblit focuses on drawing observations from extremely large, often messy bodies of code. "Tom drills deep; I fan out wide. Tom has a formal mathematician's rigor; I have a statistician's best-effort style that rolls with imperfection," Liblit says of their complementary approaches.
While there are four partner institutions working on Pliny, Wisconsin Badger connections are woven throughout the project. Reps also serves as president of GrammaTech, a developer of software-assurance and cybersecurity tools headquartered in Ithaca, N.Y.
UW-Madison computer sciences alumnus David Melski, once Reps' student, is now GrammaTech's vice president of research. Rice University Professor Swarat Chaudhuri, one of Pliny's principal investigators, interned at the company in 2005. And Rice's chair of computer science, Professor Vivek Sarkar, earned his master's at UW-Madison.
In Madison, computer sciences graduate students like David Bingham Brown, who chose UW-Madison specifically to work on a challenge of this nature, will gain significant research experience through Pliny. "There are Badger footprints all over this project," Liblit says.
Pliny is part of DARPA's Mining and Understanding Software Enclaves (MUSE) program, an initiative that seeks to gather publicly available, open-source code and to mine it to create a searchable database of properties, behaviors, and vulnerabilities.
No entries found