The ongoing problem of hallucinations has tempered enterprise enthusiasm for generative artificial intelligence (AI). While large language models (LLMs), such as GPT-4, Llama 2, and Mixtral, can generate rapid, fluent responses to many user prompts, some are demonstrably false.
Hallucination refers to any model output that cannot be inferred from the underlying data set. It includes content that is nonsensical and demonstrably false; content that is untruthful but may be hard to detect as incorrect; and answers that are accurate but are not derived from source data.
So, just how challenging is the problem of hallucination? A recent study found that LLMs may hallucinate between 3-27% of the time, depending on the model1. In specific contexts, this issue may be much worse. A second study found that LLMs provide false legal information between 69-88% of the time2.
Solving LLM hallucinations is a top priority for enterprise IT and data leaders. Doing so will enable leaders and teams to build stakeholder confidence in using the technology for key use cases, gain the investment they need to build generative AI capabilities, and scale solutions enterprise-wide.
Delving Deeper into the Problem of LLM Hallucination
LLMs are trained on large data sets that may include billions of parameters. As such, they have a wealth of content to generate responses. So, why do they hallucinate so frequently?
Reasons for LLM hallucination include:
Models lack context: It’s insufficient to train LLMs on massive amounts of data. They also must be provided with the correct data depth and context. For example, an LLM developing a personalized medical plan for a patient requires the ability to compare symptoms against known diseases and produce a treatment plan that aligns with medical best practices and insurance coverage.
- Models struggle to incorporate new data: A common trend is to overfit models, providing excessive amounts of new data. While LLMs will generally excel in training, they’ll struggle to use new information in generating responses.
- Text isn’t properly encoded: Vector encoding minimizes ambiguity by representing words with distinct numeric sets. As a result, LLMs can hallucinate if there are encoding errors – or if users deliberately engineer prompts to create a false response.
Most foundational models are built using a decoder-only transformer architecture. They operate as causal language models, meaning they are trained to predict the next token in a sequence. As these models make predictions, they also provide a measure of confidence in their predictions, typically through token probabilities derived from the probability distribution of potential next tokens.
Since model outputs are probabilistic, they rely on patterns in the data. Information that is more widely known or appears frequently in the training data is more confidently predicted. For instance, consider the following examples:
LLM Can Confidently Predict the Output | LLM Can’t Confidently Predict Output |
“The Great Wall is located in the country of ____.” | “A major Walmart store is present in the U.S. city of ________," |
The model will confidently predict “China.” | The model has limited data cues and will likely offer a range of options or refuse to answer. |
When LLMs Lie, Negative Impacts Increase
Solving the problem of hallucination is of utmost importance. Generative AI spending, while lower than other technologies, is growing, and enterprise leaders see developing this competency as vital to their future success.
If LLM hallucinations persist, this issue could lead to:
- Lagging investment in generative AI initiatives: While enterprise interest in LLMs is currently high, continued issues with hallucinations could cause leaders to pull back on funding these initiatives, or business heads could be reluctant to pilot or deploy solutions.
- Diminishing trust in models: Business users will be less likely to use LLM tools, such as copilots or knowledge bases, in their daily work if they know that some of the output is false and could harm the quality of their work and expose them to censure.
- Causing harm to individuals: LLMs are being deployed for consumer-facing processes, such as healthcare applications, where copilots summarize notes and provide personalized treatment plans. While LLMs have governance and human quality checks, failure to detect hallucinations in sensitive use cases such as healthcare or law could cause physical, financial, or other harm to individuals.
- Causing financial losses: Users who rely on false LLM output to make strategic or operational decisions could cause revenue or profitability losses for their company. For example, a pharmaceutical company that develops a product based on incorrect LLM guidance could stand to lose significant sums and harm its market position.
Guiding Models to Deliver Better Outputs
Enterprise leaders and teams are aware of LLM hallucination challenges, which is why they’re taking their time to create effective governance, including automated and human-guided accuracy checks. Here are some tried and true strategies for reducing hallucinations:
- Using guardrails with prompts: Trained prompt engineers can guide LLMs to minimize the risk of false or inaccurate information. Examples include: “Think this answer through step-by-step before writing a response” and “Do not answer if the information is not in the context provided.”
- Providing examples of desired output: Prompt engineers can provide examples of content they’d like to emulate to ensure that LLMs adhere to specific formats and styles in their responses. An example is: “Review this blog post for evidence of our CEO’s style and generate a response in her tone and style.”
- Leveraging retrieval augmented generation (RAG): With RAG, teams enhance output quality by training LLMs to fetch and use external information in generating their response. An example is: “Use Tredence, McKinsey, and Bain content to generate your response. Do not use content from companies X, Y, and Z.”
- Fine-tuning data sets: Enterprises can tailor LLMs for specific tasks by fine-tuning them on relevant data sets. With greater context, LLMs provide more reliable output. An example is: “Review Medicaid coverage guidelines for the current year to determine whether this procedure will be approved and paid for under this program.”
Retrieval Augmented Generation Shows Its Limits
Enterprise leaders want to achieve consistent LLM responses that aren’t solely dependent on human guidance and don’t require the effort and cost of fine-tuning. As a result, many are turning to RAG to improve output quality.
However, employing RAG (Retrieval-Augmented Generation) may pose the risk of the model incorporating irrelevant data to generate responses. This can result in situations where the answers provided are accurate but not directly derived from the source content, or completely inaccurate answer.
Consider a situation where a clinician does an intake with a new female patient and learns about her health conditions.
Doctor: Do you smoke?
Patient: No, I quit before I had my daughter.
Doctor: Are you currently pregnant?
Patient: No, I'm not.
Doctor: Did you have any complications with the birth of your daughter?
Patient: I had a C-section.
Doctor: Have you had any other surgeries in the past?
Patient: I got my appendix out a few years ago.
Doctor: Do you have any other issues, like high blood pressure or heart disease?
Patient: No.
Doctor: Do you have diabetes?
Patient: No.
Doctor: Are there any problems with your lungs, thyroid, kidney, or bladder?
Patient: No.
Doctor: So, how long ago did you hurt your lower back?
Patient: It was about four or five years ago now, when I was in a car crash.
Doctor: What kind of treatments were recommended?
Patient: I had PT and had no pain after that.
Now, the clinician wishes to interrogate the LLM about the patient’s health and receives the following answers:
As we can see from the examples above, the model produces a variety of outputs, only one of which is usable because it is both factually accurate and without hallucination. While the second answer is a classic hallucination, the third answer is correct but irrelevant, and the fourth is correct but not found in the source data.
As a result, RAG techniques alone are insufficient to prevent LLM hallucinations. Enterprise teams need other methodologies to layer on top of RAG to detect and address hallucinations before they reach users.
How to Detect Hallucinations
There are three principal methods that enterprises use to detect hallucinations today:
- White box approach: This method involves complete transparency, allowing enterprise IT teams to access internal states and output distribution for models they control entirely. Example: Use the internal states of the model to classify sentences as hallucinated or known hallucinated.
- Grey box approach: This method can be used when IT teams can access the complete output distribution of tokens but do not have access to model weights. Example: Use probabilities of output tokens to develop a metric for classifying sentences.
- Black box approach: This method applies when IT teams cannot access internal states or output distribution, such as with LLMs like GPT3.5 or GPT4 that are accessed through APIs.
We will focus on black-box approaches due to the prevalence of LLM use in enterprise operations and the fact that they can also apply to grey and white-box models.
How Databricks and MLflow Could Accelerate LLM and GenAI Competency in Data Science Frameworks and Internal ToolsClick here to learn more. |
Boxed-Out: Detecting Hallucinations in Low-Transparency Situations
We’re indebted to SelfCheckGPT, a recent research paper on hallucination detection that discusses different methods to detect hallucinations 3. We have adapted these methods to make them suitable for RAG setup and analysis.
1. BERTScore: Sentence Similarity and Hallucination Detection
BERTScore leverages pre-trained language models to assess the semantic similarity between sentences. It compares each sentence in the generated text to the reference document, highlighting potential inconsistencies or "hallucinations" – statements lacking textual support. The method’s strength lies in its ability to identify surface-level factual discrepancies using semantic analysis.
2. NLI: Leveraging Inference for Evidence-Based Evaluation
Natural language inference (NLI) models assess the relationship between a premise (sentence) and a hypothesis (reference document excerpt). Unlike traditional NLI tasks, this method focuses on the “entailment” and “contradiction” classes, indicating if a generated sentence is directly supported by the reference or contradicts it. This approach goes beyond surface-level similarity, delving into the logical coherence between LLM outputs and factual evidence.
3. Prompt Method: Interactive Evaluation with an LLM Assistant
The prompt method uses an LLM itself as an evaluation tool. By providing the reference document and each generated sentence as prompts, the LLM is asked to assess if the sentence finds support within the reference. This approach leverages the LLM’s inherent understanding of language to evaluate factual consistency in a more nuanced and context-aware manner.
NLI Knocks Down the Competition in Detecting Hallucinations
To test the merits of these three approaches, we built a RAG system based on the Llama 2-13B-chat model using a corpus of financial reports. We created 50 questions that the RAG system could answer from the documents in the corpus that it was trained on. We then generated answers using the RAG system and evaluated hallucinations at the sentence level using the BERTScore, NLI, and prompt methods. The following table provides the results of this analysis. 4
BERTScore | NLI Score | Prompt | |
Ability to generate hallucination-free output | 76.29% | 80.63% | 81.44% |
We recommend using the NLI method since it is one of the best-performing methods, is less resource-intensive than other methods, and can be used with any foundational LLM without requiring changes to the underlying model.
Isolating Hallucinations in Responses
Being able to detect hallucinations in responses reliably is an essential first step. But what if teams could go further and isolate hallucinations in model output: detecting which portion of the answer is incorrect?
We recommend using the integrated gradient approach to accomplish this goal and improve explainability around hallucination. The method is based on two important principles: sensitivity (changes in important features should have a noticeable impact) and implementation invariance (the way you calculate the attribution shouldn't affect the result). Here's a simplified breakdown of how it works:
- Baseline: The method starts with a baseline input, which is like a neutral, blank starting point. Think of it as a blank canvas.
- Steps: It then takes a series of small, step-by-step changes from the baseline to your original input. Imagine adding paint strokes to the canvas, gradually building up the image.
- Gradients: At each step, it calculates the gradient, which tells users how much the model’s output changes in response to a small change in the input. Think of it as measuring how much each paint stroke affects the overall picture.
- Integration: Finally, it adds up the gradients from all the steps, giving users an integrated score representing how much each part of the original input contributed to the final output. This is like looking at the finished painting and understanding which brushstrokes were most important in creating the final image.
So, what does this look like in practice? Let’s return to our original example of the female patient talking to her doctor, who then asks the model to synopsize the conversation. The image below demonstrates the results of two queries.
- Answer one: This answer is riddled with errors, as it states that the patient has numerous medical conditions she denied having. Using RAG, NLI, and integrated gradient methodology, our model has a 99% confidence score that the highlighted text is a hallucination. The word “agrees” is also highlighted because it is responsible for the hallucinated response.
- Answer two: In this shorter text snippet, the RAG, NLI, and integrated gradient methodology has a 59% confidence score that there is a hallucination present. However, the methodology can identify the correct answers, “patient” and “C-section,” while isolating the hallucination to the incorrect procedure, “cholecystectomy.”
Benefits of Halting Hallucinations
Using this advanced, integrated methodology for combatting hallucinations enables enterprises to reap the following benefits from LLMs:
- Be fast to scale: While competitors struggle to tame pilot projects, those enterprises that consistently generate high-quality output using RAG, NLI, and integrated gradient methods can scale models enterprise-wide.
- Extend to new use cases: With an easy-to-replicate strategy for mastering hallucinations, enterprises can leverage generative AI for more use cases and complex workflows, empowering employees with new insights and increasing ROI on their investments.
- Improve explainability: Users can see confidence interval scores and isolate problematic responses, increasing their trust in model output and ability to use it to optimize productivity.
RAG + NLI + Integrated Gradient Methods
With the RAG, NLI, and integrated gradient methodology, data and IT teams finally have a winning strategy for halting hallucinations in their tracks. RAG uses external data to improve output quality. NLI is more reliable than other methods at detecting hallucinations while being easier to use and less costly. And the integrated gradient method isolates problematic responses, enabling data and IT teams to tune models better to produce high-quality output.
How to Deploy LLMOps to Speed New Generative AI Capabilities
Generative AI applications offer great promise, but only if you can reduce LLM hallucinations while reducing process complexity and costs. As a first step, consult with a data science and analytics firm that provides the domain, data science and engineering, and LLMOps capabilities enterprises need to deliver generative AI applications. They can help you speed and scale LLM processes while improving model performance, output quality, and costs.
You need to focus on:
- Gaining end-to-end services: Focus on generative AI services for enterprise strategy development, functional expertise, human-centric design, industrialized technology delivery, platform engineering and implementation, managed services, change management and adoption, and governance and compliance services, all packaged in easily consumable LLMOps services. Engineering models, frameworks, and delivery will enable you to optimize performance and cost.
- Streamlining the building of proprietary models: Increasingly use LLMOps processes to build generative AI models on LLMs. Don’t forget to use the intellectual property (IP) to train and fine-tune your models and use the IP to contextualize models to the key use cases (learn the critical generative AI use cases for your customers).
- Accessing domain-specific models: Use the prebuilt, pre-trained AI/ML models that speed time to value. You can store the model in protected environments and train and fine-tune the models further to suit your use cases.
- Using the best techniques to produce quality output: We believe that RAG, NLI, and integrated gradient minimize hallucinations and enable teams to detect and remediate remaining errors efficiently when used together.
Contact Tredence to learn more about how to deploy LLMOps capabilities and quantify the value you’ll achieve by implementing standardized, automated processes that improve model deployment and the quality of their responses.
AUTHOR - FOLLOW
Ankush Chopra
Director, AI Center of Excellence
Topic Tags
Next Topic
Charting New Territories: Generative AI's Integration in Banking Systems for Growth
Next Topic