Recent years have seen a steep increase in the use of artificial intelligence methods in software engineering (AI+SE) research. The combination of these two fields has unlocked remarkable new abilities: Lachaux et al.'s recent work on unsupervised machine translation of programming languages,15 for instance, learns to generate Java methods from C++ with over 80% accuracy—without curated examples. This would surely have sounded like a vision of a distant future just a decade ago, but such quick progress is indicative of the substantial and unique potential of deep learning for software engineering tasks and domains.
Yet these abilities come at a price. The "secret ingredient" is data, as epitomized by Lachaux et al.'s work that utilizes 163 billion tokens across three programming languages. For perspective, this is not just nearly 100 times the size of virtually all prior datasets in the AI+SE field; the estimated cost of training this model is to the tune of tens of thousands of dollars. And even that is a drop in the bucket compared to what is next: training the new state-of-the-art in language models—GPT-32—runs in the order of millions. This may be a small price to pay for Facebook, where Lachaux et al.'s research was conducted, or OpenAI (GPT-3), but this exploding trend in cost to achieve the state of the art has left the ability to train and test such models limited to a select few large technology companies—and way beyond the resources of virtually all academic labs. It is reasonable, then, to worry that a continuation of this trend will stifle some of the innovative capacity of academic labs and leave much of the future of AI-based SE research in the hands of elite industry labs. This Viewpoint is a call to action, in which we discuss the current trends, their importance for our field, and propose solutions.
No entries found