Generative artificial intelligence (AI) has captured considerable popular attention recently. ChatGPT and DALL-E have given members of the general public opportunities to use AI systems to generate text and image outputs for fun and a wide range of other purposes. Google and Meta have announced their intentions to launch similar AI systems soon.
Generative AI has also caught the attention of lawyers who question the legality of ingesting in-copyright works as training data and producing outputs derived from copyrighted training data.
Lawyers representing four programmers (identified so far as John Does) have, for instance, sued GitHub, Microsoft, and OpenAI, alleging that GitHub's Copilot and OpenAI's Codex AI programs have violated laws by using publicly available open source code (including programs the Does developed) posted on GitHub's site as training data for their generative AI systems. Also illegal, say these Does, are Copilot and Codex outputs of code sequences in response to user prompts insofar as the sequences are substantially similar or virtually identical to open source code used as training data.
The Does claim to represent a class of programmers whose legal rights GitHub and OpenAI have violated. They want a federal court to issue an injunction against these generative AI systems and to award the class $9 billion in statutory damages.
OpenAI developed Codex as a generative AI model trained on billions of lines of publicly available computer source code.
This column focuses on the Doe v. GitHub lawsuit as the first of a two-part series on legal challenges to generative AI. A subsequent column will address two similar lawsuits brought against Stability AI for its use of images as training data and for producing outputs based on the training data.
GitHub is an Internet hosting service for software development and version control. It reports having more than 100 million registered developers and hosting 372 million code repositories, including 28 million public repositories. Microsoft acquired GitHub for $7.5 billion in 2018.
OpenAI developed Codex as a generative AI model trained on billions of lines of publicly available computer source code, including code available in GitHub's public repositories. Codex discerns statistical patterns in the structure of existing code. It infers these patterns based on a complex probabilistic analysis of the training data. In response to a user's prompt, Codex produces code to implement the desired function.
In June 2021, GitHub and OpenAI launched Copilot as a cloud-based AI technology that uses Codex to assist the development of software. GitHub users can install Copilot as an extension to various code editors. Copilot treats a user's input to a code editor as a prompt and generates suggested code that may be suitable for the developer's purposes. Copilot subscriptions are available to GitHub users for $10 per month or $100 per year.
Although the complaint says Copilot and Codex are engaged in "software piracy on an unprecedented scale," it does not actually claim GitHub or OpenAI have infringed any copyrights. This is curious because open source software is generally protected by copyright and copyright is the legal basis on which open source licenses are predicated.
The Does' most significant claim is that Copilot and Codex wrongfully removed copyright notices and other copyright-relevant information from open source programs ingested as training data.
The intentional removal or alteration of copyright management information (CMI) from copies of copyrighted works with knowledge that the removal or alteration of CMI is likely to induce, enable, facilitate, or conceal copyright infringements is illegal under § 1202 of Title 17 of the U.S. Code.
A second principal claim is that GitHub and OpenAI have breached open source license agreements by failing to respect license terms, such as requirements to give attribution to the open source developers whose code has been ingested and is being used to generate outputs and to include copyright notices in reused code.
The Does charge GitHub and OpenAI with several other legal violations, including misrepresenting others' licensed code as their own, fraud, unjust enrichment, and violating California unfair competition and privacy laws. This column omits discussion of these subsidiary claims because the lawsuit will primarily be focused on the two principal claims.
Lawyers initiate lawsuits by filing complaints explaining the legal theories on which the lawsuits are based and core facts that support those legal theories. If courts uphold at least one theory in a case, the plaintiffs may be eligible for certain remedies, such as injunctions and damages.
After reviewing complaints, defendants' lawyers sometimes decide to file motions to dismiss complaints for failure to state claims on which courts could grant the remedies requested in the complaint.
When considering motions to dismiss complaints, courts assume that all of the facts stated in the complaint are true (even if the defendants' lawyers plan to contest their truthfulness if the court denies their motions).
Instead of filing an answer to the Does' complaints, which would typically admit some allegations, deny others, and raise defenses, GitHub and OpenAI filed motions to dismiss the Does' complaints for failure to state claims on which relief could be granted.
Among other things, GitHub and OpenAI point out the Does have not identified any code in which they claim rights. Nor have they specified any injury they suffered as a result of GitHub's or OpenAI's acts. Most of their claims are speculative and conclusory, not specific about the elements necessary to succeed on the merits.
OpenAI also moved to dismiss because the Does have not identified themselves. Courts do not usually allow plaintiffs to sue anonymously or pseudonymously absent special circumstances (for example, when there is a risk of retaliation). Procedural rules require plaintiffs to ask a court for permission to file lawsuits as Does. These Does failed to do this. It is, moreover, difficult for defendants to formulate adequate defenses if they do not know who is suing them.
The big money claim in the Doe v. GitHub lawsuit ($9 billion) asserts GitHub and OpenAI illegally removed CMI from source code used as training data.
Section 1202(c) defines CMI as including information identifying the work, its author and/or copyright owner, terms and conditions for use of the work, and/or identifying numbers or symbols representing identifying information.
To violate § 1202, a defendant must have intentionally removed CMI from copies of a work or must have distributed copies of a work knowing its CMI had been removed. In addition, a defendant must know or have reason to know that the CMI removal "will induce, enable, facilitate, or conceal an infringement" of copyright in the work.
Courts can award anywhere between $2,500 and $25,000 in statutory damages for each violation of this law. (When actual damages are difficult to prove, as with removal of CMI, legislatures sometimes decide to establish a statutory damage remedy to ensure some meaningful compensation is available to victims of a law's violation.)
This double knowledge requirement may be difficult to satisfy in the Doe v. GitHub lawsuit, as it was in Stevens v. Corelogic. Stevens is a photographer who specializes in taking digital photographs of houses on behalf of real estate agents. Stevens' photographs are typically posted on Multiple Listing Service (MLS) platforms.
Lawyers initiate lawsuits by filing complaints explaining the legal theories on which the lawsuits are based and core facts that support those legal theories.
Some metadata about Stevens' photographs is automatically created when his digital camera takes photographs. He can also add metadata to digital image files manually using photo-editing software. Metadata embedded in digital files may be invisible to anyone who looks at the image.
Corelogic provides software to MLS for displaying real estate photographs of houses for sale. Because image files can be very large, Corelogic resizes the images and saves the resized images so they occupy less storage space and load faster on MLS sites. In the process of resizing photographs, Corelogic's software did not preserve invisible metadata.
Stevens sued Corelogic for violating § 1202 because of its removal of CMI embedded in his photographs. The Ninth Circuit Court of Appeals affirmed a lower court ruling in Corelogic's favor, holding that Stevens had not shown that Corelogic intentionally removed the CMI, nor that the removal would facilitate copyright infringement.
GitHub and OpenAI argue the Stevens case supports their assertion the Does have not stated a viable claim for violation of § 1202.
The Does' complaint identifies some open source licenses the Does themselves have used for software they developed that they claim GitHub and OpenAI have wrongfully included in Codex and Copilot. The Does say these and other class members' open source licenses require attribution and inclusion of copyright notices in any reuses of their software.
As a defense against the breach of license claims, GitHub relies both on its terms of service and on license rights developers give GitHub when they choose to make their program code part of a public repository.
GitHub requires all of its users to agree to its terms of service. Included in these terms is a license granting GitHub the right to "store, archive, parse, and display ... and make incidental copies" of the users' code, as well as to "parse it into a search index or otherwise analyze it" and "share" the resulting code in public repositories with other users.
And when GitHub users decide to make their code available on its site, they must choose whether to make their code repositories private or public. Users who decide to make their repositories public "grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality."
The biggest mystery in the Doe v. GitHub case is why there is no copyright claim in the complaint. One possibility is the Does are seeking copyright registration certificates for their programs. This is a necessary procedural requirement for U.S. copyright owners who want to sue someone for infringement.
Another possibility is the Does do not want to litigate fair use defenses that GitHub and OpenAI would almost certainly raise if sued for copyright infringement. (Fair use may not be a viable defense to the CMI removal or license breach claims.)
Fair uses are not infringements of copyrights. Courts consider four factors in making fair use determinations: the purpose of the challenged use; the nature of the copyrighted work; the amount and substantiality of the taking; and harms to the market for the work.
Existing U.S. precedents seem to support such a defense if the Does sue GitHub and OpenAI for infringement. The closest precedent is the Authors Guild v. Google case. The Second Circuit Court of Appeals held that Google had made fair use of millions of in-copyright books it scanned to enable computational analysis of a database of these books and for purposes of indexing their contents to serve up snippets of text in response to user search queries.
The court held that Google had made transformative uses of the in-copyright books because the corpus facilitated greater access to information. While Google copied the whole of each book, this was necessary to achieve its transformative purpose of indexing book contents for computational analysis and search. Because Google only served up three short snippets from each book, the snippets were unlikely to undercut the market for the books.
Under the Google decision, ingesting publicly available source code would seem to be as fair as the scanning of books to index their contents. And the snippets of code that Copilot provides in response to user prompts is analogous to the snippets of text from books that Google provides in response to user search queries. Because the court found both the scans and the snippets to be fair uses, GitHub and OpenAI would seem to have plausible fair use defenses.
Generative AI has raised some new technology issues courts have not yet addressed. While the Doe v. GitHub complaint raises some interesting theories of liability, it is far from clear courts will find Copilot or Codex to be unlawful. GitHub is arguing Copilot is socially beneficial because it "crystallizes the knowledge gained from billions of lines of public code, harnessing the collective power of open source software and putting it at every developer's fingertips." In May 2023, a trial court denied the GitHub and OpenAI motions to dismiss as to the removal of CMI and breach of license claims, so the lawsuit will now proceed to address the merits. It remains to be seen how receptive the court will be to GitHub's and OpenAI's defenses.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
In my opinion, the plaintiffs will fail. The GitHub services are no different from humans studying code and developing variants based on the purposes sought for the new programs being developed. Furthermore most of the code in the world is like most of the other code in the world (over 90 percent according to two MIT researchers, as I recall). At the "It was a dark and stormy night" level of small fragments (which is really the level at which automated assistance can be most valuable to a programmer during programming) the computational notions those fragments reflect are likely (I submit) to be elsewhere in the canon in the tens of thousands of instances. In short, most of what programmers write is not truly original, but merely just yet-another idiosyncratic expression of the same concepts (differently named data structures changes nothing). To conclude: if there is to be meaningful boundaries regarding program expressions -- then let's rigorously define those (say using BNF). But getting uppity about how the expressions were created is just narcissism and constitutes a needless waste of everyone's time.
Displaying 1 comment