Advertisement
Advertisement
ai robot explosion
ATD Blog

Beyond the Hype: Understanding Language Model Hallucinations

Tuesday, August 29, 2023
Advertisement

We are living in an exciting time of rapidly growing success for large language models (LLMs). While previously, artificial intelligence (AI) was something that only coders could use, ChatGPT’s release has made this technology accessible, and free, for everyone. Many of us have been using these tools for some time now, while others are only just starting to experiment with them—with sometimes amazing and sometimes poor outcomes.

Furthermore, we have heard of GPT3.5 and GPT4’s successes, and OpenAI’s marketing team has successfully informed us that the GPT4 model excels at the bar exam, cruises through biology olympiad questions, and can write, debug, and improve code. If ever “proof” was needed for the power of AI, we’re being presented with just that. Or are we?

To understand this better, let’s examine how such large language models operate. In simplistic terms, an LLM is trained on a vast amount of text—usually data scraped off the internet—called a training set. Through this process, the language model “learns” what words and sentences usually follow others. Basically, the system is predictive text on steroids. Then, when we prompt it, the LLM models what a good answer should look like, based on the prompt and its training input. It essentially mimics the training text, generating responses based on patterns it identified in the training set.

Admittedly, given the vast amounts of training data that these communication engines use as their knowledge engine, they can provide convincing responses, much better than what we would expect from predictive text. We have all been amazed at some of these models’ output—responses that would have been unthinkable for many of us even a year ago.

However, that is all an LLM is: impressive predictive text, parroting the training set back at us. The model doesn’t understand the content or know what it’s talking about. It doesn’t use logic or have any sense of truth or correctness. Convincing and even seemingly mind-blowing outputs from these models emerge simply from the vastness of their training sets. The better the training set, the better the parroting.

So, how can we explain how the model does so well on the bar exam, biology olympiad questions, or code?

The bar exam is based on factual knowledge rather than analytical skills. Laws, legal texts, court decisions, and precedents are all available on the internet, with every precedent and court decision explained in great detail—a fantastic data set to train an AI model on.

Advertisement

What about the biology olympiad? Of all the olympiads, biology relies the most on factual knowledge. That body of knowledge is also available on the web—another great knowledge engine for AI. In comparison, the model struggles considerably with the physics, chemistry, and math olympiads. Many of the challenges in these tests require building a mental model of a scenario and testing it against specific circumstances—something LLMs can’t do. Unless its training set included many similar challenges and solutions, an LLM generally can’t answer such questions correctly.

And finally, coding. How do these models perform so well on code? First, the internet is full of chatrooms and threads focused on code and coding challenges, with debates, questions, answers, different approaches, bugs, and debugging ideas all shared in writing: a fantastically vast knowledge engine to train the system on. And second, code is simply a language. When we write code, we simply translate processes, decision making, and flowcharts from our language into a computer language. And language is exactly what an LLM system is trained on.

With all of this in mind, we can understand why LLMs return such good results on topics and information the internet is full of—often the same information high school essays are crafted around. When information available on the internet suffices as a knowledge engine, an LLM can deliver impressive results.

However, we have also seen how the engine, when asked to provide a research paper, can completely fabricate research that doesn’t exist. The model might use the names of people who work in the field as authors but completely make up a title and publication. Why? Because, based on the LLM’s training set, that’s what a good answer should look like. It’s predictive text, clever in some ways, but lacking any understanding of content or logic.

Advertisement

Is this a bug? Not in the traditional sense of a coding error. It’s more a natural consequence of the current state of the art in training large neural networks on vast data sets. The model has done exactly what it was designed to do—provide an answer mimicking the text it trained on.

As this field advances, researchers are looking for ways to make models more reliable, less prone to hallucination, and more interpretable. One such promising avenue is pairing a language engine with a logic engine. Doing so would complement the immense knowledge and pattern recognition abilities of the LLM with the structured reasoning of a logic system. This fusion has the potential to yield more accurate and trustworthy outputs, especially in domains that demand rigorous logical consistency.

However, creating this combination is not without its challenges. Bridging the gap between statistical, pattern-based reasoning and symbolic, rule-based logic requires careful design. Furthermore, ensuring that the resulting system remains transparent and understandable to users is vital. Such endeavors represent the next frontier in making AI tools both powerful and reliable.

If you’re interested in reading more on this topic, as well as on biases and ethical concerns, I highly recommend one of the most insightful papers on this topic: “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" by Bender, Gebru, McMillan-Major, and Mitchell. As we stand on the cusp of a new AI era, the potential for LLMs to revolutionize communication, learning, and decision making is undeniable. With continued exploration and innovation, we might soon witness an AI evolution that exceeds our wildest imaginations, reshaping the fabric of how we interact with technology and the world around us.

Keep up with the latest AI insights and additional resources here: https://www.td.org/atd-resources/artificial-intelligence

About the Author

Markus Bernhardt leads Endeavor Intelligence, specializing in AI strategy consulting that blends technological expertise with strategic business applications. Markus supports a range of F500 companies and government organizations regarding AI strategy in his role as the AI strategy lead at The Learning Forum. In collaboration with Mike Vaughan, Markus has developed a comprehensive AI strategy framework through The Thinking Effect, a not-for-profit community for talent, learning, training, and performance professionals focused on AI tools, AI strategy, research, and thought leadership.

1 Comment
Sign In to Post a Comment
Great points made illustrating why LLM’s can hallucinate. It’s an important topic for everyone to understand. It’s very similar to the early days of the internet when we did not fully understand the need and value of good security hygiene. However today, everyone inherently appreciates the need for complex passphrases and 2FA. Hopefully similar best practices will emerge that are easy to adopt and enable us to judge the trustworthiness of a generated response (based on the training models used).
Sorry! Something went wrong on our end. Please try again later.
Sorry! Something went wrong on our end. Please try again later.