Originally posted on The Road to AI We Can Trust
On November 30, 2022, OpenAI released ChatGPT, a chatbot powered by the large language model GPT-3. Although it was not particularly more intelligent or powerful than earlier versions of GPT-3, it caught the attention of the public and the press to an unprecedented degree. (From what we have heard, the team that created it at OpenAI was quite surprised.) Within a few days of its release, more than a million users had run experiments of every kind on it.
Needless to say — or, rather, this should be needless to say, but this obvious fact can get buried beneath the ongoing avalanche of hype — ChatGPT made all the same kinds of mistakes that its predecessors did.
In our opinion, this was inevitable, in a system that has been trained to optimize the goal of producing one plausible-sounding word after another rather than actually engage with the meaning of language. ChatGPT couldn't, for example, reliably count to four (figure 1) or do one-digit arithmetic in the context of simple word problem (figure 2). It couldn't figure out the order of events in a story (figure 3). It couldn't reason about the physical world (figure 4). It couldn't relate human thought processes to their character (figure 5). It made things up (figure 6). Its output, when released, exhibited sexist and racist biases (figure 7). It could sometimes produce outputs that were correct and acceptable in these regards but not reliably. ChatGPT is a probabilistic program; if you rerun the experiments in figure 1-7 you may get the same result, or the correct result, or a different wrong result.
What can we and should we do about all this?
Figure 7: Sexism and racism. Tweet by Steven Piantadosi, 12/4/22.
This Twitter thread includes several further examples of this kind of problem
(partly addressed in later iterations of ChatGPT).
In previous years, the two of us spent a fair amount of time devising, running, and publicizing tests to expose the limits of large language model (LLM) technology (e.g. here), but in so doing we could only scratch the surface.
This time, however, we were far from alone; in fact we were quite surprised and gratified to find that people from all over the Internet were running similar kinds of experiments, often more creative than our own, and reporting their results on social media or by emailing them to us. The quality of these tests produced by the "hive mind" as probes of the system's ability was often impressive and they were much more diverse in subject and approach than we could have come up with by ourselves; in many ways, they compare favorably to what is generally found in the benchmarks for AI systems that have been systematically created and published. (A few weeks before, the same thing had happened with the Galactica system, a science-oriented LLM created by Meta AI; in that case, the immediate pushback from the internet world was so intense that the public interface was withdrawn within forty-eight hours.)
What's more, some of these reports had immediate impact. The problems with bias and prejudice revealed in examples like figure 7 above were so unacceptable that OpenAI felt obliged to address them immediately. They therefore rapidly created a patch which answered prompts like the above with a largely canned message refusing to play along.
The patch works to some extent; if you now run the prompt in figure 7, ChatGPT will tell you emphatically that race and gender should not be used as criteria for judging whether someone is a good scientist. Still, the guardrails are imperfect, in some ways too strict and others too loose. They are too loose in that some people have found ways to get around these "guardrails" by framing the requests in various indirect ways; they are too strict in that , the guardrails often lead to outputs that are absurdly over-cautious (figure 8).
The good news is that by now, some of these issues are starting to be well-known, at least in the Twitterverse. But reports about them are also wildly dispersed and pretty unsystematic, which is unfortunate.
For one thing, although many people seem to have heard of a few of these errors (e.g. from reports in the media), few realize how pervasive or broad in scope they are; indeed a recent Twitter thread that claimed falsely that these problems had been resolved got a surprising amount of traction.
For another, although OpenAI may be tracking errors like these for internal purposes, the scientific community has no access to whatever they have collected, which leaves it in the lurch. (It also leaves us puzzling about the word Open in the name OpenAI but that's a story for another day). Meanwhile, systems like GPT-3 and its derivatives pose at least three special problems for anyone trying to understand their nature:
For all these reasons, we recently decided to put together a community-facing corpus of ChatGPT errors (and to include other Large Language models as well). We enlisted a few friends — Jim Hendler, William Hsu, Evelina Leivada, Vered Shwartz, and Michael Witbrock — and the technical help of Michael Ma, and put together this site.
Importantly, we have structured it so that anyone can look at the data at any time.
There is an interface for adding examples and a separate interface for viewing the collection in database format. In adding an example, the model that generated it, a brief description of the error, and either a screenshot or the text of the example are required, and, for verification purposes, you must supply an email address, which is not published. A categorization of error type, a link to a relevant external site (e.g. a posting on social media) and additional comments are optional.
If you care about large language models, we hope you will have a peek, and consider contributing intriguing errors as you note them; even better if you can share this article, along with this link, so that the repository becomes widely known.
§
We won't claim that what we have done is perfect. One could argue, for example, that we should include positive examples as well as negative examples (we haven't, since there is already a site ShareGPT for collecting GPT examples of all kinds ) The taxonomy of error types that on the database is certainly rough, arbitrary, and incomplete (we included a catchall "other"). We are also steering with a very light touch; not every example is to our eyes fully compelling, but our aim is to be inclusive, rather than exclusionary. But as the saying goes, the perfect (in this case, neither achievable nor definable) should not be the enemy of the good. In our view, it is past time to have something like this, given the inherent unpredictability and wide range of the systems.
We also see the collection as an important counterpoint to certain media narratives [e.g. this New York Times article by Kevin Roose] that have tended emphasize the success without a serious look at the scope of the errors.
Already, from the initial data collected in the first few days, a scientist, technologist, or lay reader can now readily see many different types of errors, and try for themselves to understand how replicable those errors are, in order to make informed judgements about scientific questions such as the degree to which ChatGPT's output reflects a clear understanding of the world, and technological questions such as the degree to which its output might be trustworthy for specific applications. Our own views on these matters are well known, now you can peek for yourself, no longer hindered by the limits of what can expressed in a Tweet. Have fun!
The form for entering data is here:
The form for inspecting extant data is here:
Gary Marcus is a scientist, best-selling author, and entrepreneur. Ernest Davis is Professor of Computer Science at New York University. Collaborators for many years, they are the authors of Rebooting AI: Building Artificial intelligence We Can Trust, one of Forbes's 7 Must Read Books in AI.
No entries found