Chat GPT — Fluently True, Fluently False, Here to Stay
Ælfgifu of York was the daughter of Ordgar, Earl of Devon, and she was also the sister of Ordgar, Earl of Devon.
Chat GPT, the AI system that produced the sentence above a couple of months ago, is a tremendous achievement, and the most impressive piece of language engineering I’ve interacted with (having worked in this field since the early 2000s). It’s also become the most talked-about piece of new technology, and has already led to many derived products and alternatives, so by the time you read this, it will be out-of-date.
Chat GPT is a good conversationalist and has answered plenty of questions for me. It wrote a point-in-polygon test (a program to say if a given point is contained in a given shape) in Java, and fixed some deliberately buggy Python code I’d written, explaining its reasoning when asked as follows:
The code you provided uses the print statement, which is a Python 2 construct that is not available in Python 3. In Python 2, the print statement is used to print the value of a variable or expression to the screen.
Spot on! That’s just one example of the topics where Chat GPT explained detailed things clearly and correctly. It might come close to passing a technical interview — at least, it’s the first NLP system that’s ever made me wonder about this. There’s a lot about its breadth of knowledge, grammatical mastery, robustness, ability to refer back to previous messages, that I would praise in detail in a more thorough technical review.
Then I asked Chat GPT about other topics, and it turned out to be similarly confident (and fun to play with). But one main problem is that it was often wrong, especially with more obscure material. I asked about various families related to Emma of Normandy, Queen of England along with King Aethelred and then Canute. While Chat GPT knows the names of 11th-century Anglo-Saxon, Danish, and Norman aristocrats, it gets their relationships wrong most of the time — including silly things like claiming that someone active in the 1010’s was the parent of someone born in the 930’s. The single sentence at the top about Ordgar being Aelgifu’s father and brother at the same time made it most obvious — Chat GPT didn’t “know” either of these as facts (and neither is true), but it made plausible sentences with names encountered in similar contexts. It did this with the same convincing fluency as with its true statements about Python versions.
Yes, I chose to split hairs about an obscure topic, but this matters. These are the sort of situations where we are not confident and defer to experts — so if a computer told me that halofantrine hydrochloride has no harmful side effects, I’d be inclined to believe it because I don’t know any better. Therein lies a big risk: modern brains-over-brawn values have trained us to assume that fluent confident speakers who string together words we don’t fully understand are smart, and thus trustworthy. But sometimes they aren’t.
In earlier years computer programs have beaten experts at chess and Go. Now by 2023 they have mastered lots of language generation, and in a general knowledge quiz Chat GPT would beat me most of the time. This is exciting and challenging for us. Chat GPT creates text where fluency is no guarantee of truth. This technology will make it easier to write proposals, surveys, vision statements, even computer programs — all of which will look convincing but may be fictional. It’s going to be easier to build prototypes and harder to fix bugs.
Already we confuse confidence with accuracy too easily. With language fluency automated, separating truth from bluster will be harder than ever. Many thanks to Chat GPT for highlighting this — and for lots of fun conversations that we’ve only just begun!
***
Some Specific Considerations
By March 2023, the tech news has been full of developments around Chat GPT, its incorporation into Bing search, corresponding attempts from Google and others, and responses to these technologies from other communities. I won’t try to survey these properly, but will just add a couple of points (and may add more later).
Should Chatbots be Banned?
Update already: Claims that AI-written text can be detected and should be banned have already faded. For example, this article on resume writing finds that using AI to help write a resume and cover letter is widespread, increases chances of success, it’s silly to make the case that using a spellchecker or grammar checker is fine but using Chat GPT is wrong, but important for candidates to realize that the responsibility for truthfulness is still theirs. (Even from a purely selfish point of view — a resume can only get an application as far as an interview, and candidates need to make sure it leads to successful interviews.)
Back in January and February there was plenty of talk of banning Chat GPT and similar products from being used in schools, and announcements (sometimes received with much relief) that people are already working on “Chat GPT detectors”. This is a sudden defensive reaction that misses the longer incremental story: writing text in electronic form has been a collaboration between humans and machines for decades.
Computers started with flagging words that aren’t in their dictionary as potential misspellings, moved on to proposing corrections, and completions, and these completions have grown from individual words to whole sentences. Chat GPT does much more than complete my sentences for me, but if its use is to be regulated, what rules can possibly be made around this?
Potentially brilliant people used to be held back and demoralized just because they struggled with the haphazard mess that is English spelling. Computers help us to write, some of us need more help in different areas than others, and trying to legislate this help away won’t work, and won’t make the world fairer. Like a bigger and better spell correction, we need to learn to use language models, accept their help when appropriate, be aware of the kind of mistakes they make, and remember that the responsibility for text that is shared by us, rests with us.
This duty falls especially upon language engineers: for any of us working on automated dialog systems, if these systems generate text that is shared directly with end users, then the responsibility for publishing it is ours. The IEEE Code of Ethics reminds us “to hold paramount the safety, health, and welfare of the public.” Chatbots are by now easily powerful enough to be used for scamming the unsuspecting, and to make harmful suggestions to the vulnerable. Machine learning systems are famously statistical and probabilistic — we don’t know exactly how they will behave in different situations — but that doesn’t mean that unpredictable harmful consequences are unavoidable or “not our fault”. If a civil engineer can’t guarantee the structural safety of a bridge, we close it to traffic. Software and AI engineering have very few hard safety standards like this, and we need to develop these, rather than using the lack of clear standards, combined with statistical uncertainty, as excuses for releasing harmful products.
Can Chatbots be Sentient?
A related discussion revolves around whether Chat GPT and such systems are sentient. One fundamental claim is that such systems are just processing the symbols they’ve been given, and this can’t possibly be considered intelligent or sentient behavior. Such arguments have been common since at least the late 1900s, and as systems built using language models become more impressive, the claims that they can’t possibly be intelligent have become more insistent. But as far as I can tell, the argument behind these claims hasn’t actually changed.
My thoughts on this topic are cautious and reserved at this point. If a system sure looks sentient sometimes, but it’s not the kind of thing I believe can be sentient, then I should question what I mean by a system “looking sentient sometimes”. When we define sentience as “something we recognize is like us”, we can assert that things different from us can’t be sentient, and then we can deny them any of the rights that go along with sentience in our ethical and legal systems. The science for determining intelligence is inconclusive, but in spite of this, history already has way too many examples of people like me restricting it. I don’t have a rigorous scientific definition or test for “sentience”, and any other educational status should not qualify me to judge that something not-like-me isn’t sentient.
What About the Environmental Costs?
Update on costs: Just in March 2023, the narrative about the cost of training LLMs (Large Language Models) has changed from to “retraining GPT costs millions of dollars” to “researchers trained a new chatbot for $600”. Given this pace of change, the suggestion that we shouldn’t invest in LLMs because they’re too costly is obviously flawed.
It’s easy to find articles saying that “scientists are worried about the environmental costs of building large language models”. But when these costs are analyzed, it’s not the models themselves that are the problem.
Environmental concerns are sometimes compared with those from airplane flights — from https://aclanthology.org/P19-1355.pdf, building a single large transformer model might emit 192lbs (87kg) of CO2, while a single passenger roundtrip from New York to San Francisco is estimated at 1984lbs (899kg). The Federal Air Authority administers some 45,000 flights per day, so the comparison with air travel demonstrates how small, not how large, this contribution is. The energy cost of building ten large language models may literally be comparable to the cost of traveling to a single conference to present a paper highlighting concerns about these energy costs!
And that’s the cost of building a large language model vs. using an airplane. Once a model is built, it can be used to serve millions of users and billions of requests. By contrast, if we compared the cost of building a language model with that of building an airplane, this would demonstrate how cheap, not how expensive, it is to build most language models.
Already (in the couple of weeks since I first wrote this!), various news articles have estimated the cost of building a large language model on the scale of a new GPT model as being in the million of dollars: which I expect is a lot bigger than the models leading to the “192lbs of CO2” estimate above. These are costly enough that only a few companies can build them, which raises many concerns other than energy usage. If we compared the cost of building a GPT model (say $10M, and there are 4 so far) with the energy cost of building an airplane (around $100M to build a 737, and about 40 are built each month), we’d conclude that the cost of the language model is pretty insignificant, especially considering how many people get to use it every day.
When costs of building more standard language models are analyzed carefully (as in https://aclanthology.org/P19-1355.pdf), it’s the cost of retraining models many times that is alarming. The problem here isn’t the models themselves: instead, it’s the incentive to train and retrain to find the best possible set of parameters, sometimes with negligible gains over many easier-to-find second-bests. This incentive is felt much more keenly when trying to get a paper published, than when trying to get a production system deployed and serving user-requests. With research papers, we have to beat the best previously-reported results. With production systems, we have to be good enough to meet user requirements. User requirements are often more stringent when analyzed properly, not more relaxed: but not in ways that can be addressed by finding a percentage-point of improvement somewhere, after training hundreds of models with minor differences.
I’ve spent many months, at various times, working on optimizing the efficiency of language models (for example, in this work at LivePerson). This was because the models needed to be leaner, faster, cheaper, more modular, and quicker to fix or retrain when problems were identified. These are really important issues in machine learning that directly affect customer service. But so far, I don’t see good reasons for arguing that the computation used in processing language models is especially environmentally concerning. At least, not when we use them responsibly.
(For context, I’ve written scientific papers, one book, written and managed open-source and commercial software, for building and using distributional vector language models, since 2001. There are a few people who have worked with semantic vector models for this long, even before they became famous as word embeddings last decade, and long before they became so sophisticated and disruptive. I’ve helped directly with their gradual development and adoption first-hand for years, so the opinions above are “expert opinions”. As often happens, they sometimes differ from other expert opinions.)