With the rapid advancement in Artificial Intelligence, utilizing LLMs for organizational or personal growth is becoming unavoidable. Utilizing it for content generation, product categorization, and document summary is just a few possibilities out of many where “Hallucination” is emerging as the key concern in a productized system. This article will introduce you to such techniques to mitigate Hallucination.
What is LLM
Large Language Models (LLMs) are probabilistic models that forecast the likelihood of the subsequent word in a series. These models are trained using cutting-edge deep neural networks that simulate the functions and operation of the human brain. Transformer, for instance, is regarded as the foundation. Furthermore, the vast corpus of documents gathered from the internet that spans several domains and dimensions is where it derives its learning. The adjective "large" refers to its enormous quantity of parameters; for example, The OpenAI GPT model claims to have more than a billion parameters, and Google Gemini has seven billion parameters.
What is Hallucination
The build-in nature of LLMs is to predict the following words and arrange them in a semantic understanding to develop a paragraph. However, there is a possibility of fabricating such paragraphs with factually erroneous information, which is imperceptible for Humans, called “Hallucination,” It is detected as the key challenge in any fully grown production system where the innermost build-in capabilities rely on the emerging technology of Artificial Intelligence.
Mitigating Hallucination
Hallucination in LLMs has become a widespread problem that must be monitored and mitigated to create a safe production readiness. Here, I present several methods that can be implemented during the life cycle of an LLM-based application.
Let’s begin by understanding the problem at hand. We have successfully built the full-grown Query-Bot, which is deployed in production. The application platform takes the question from each user and triggers an in-house build inferencing pipeline to generate the response utilizing the underlined vector knowledge database support. All the mitigation methods below are thoroughly tested and implemented in this application.
Step 1: Knowledge Base Integration
Built a vector database by processing multiple documents and integrating it into an inference pipeline to select the best matches from the available document chunks for each query received. The key components that must be checked are:
- Designed the pipeline to create the chunk. Chunk size is a crucial parameter, and it depends on the following factors:
- i. What kind of use case are we dealing with? For instance, a small chunk size should be preferred for a Query Bot building as it helps in answering both small and elaborative questions. For example, 100 to 350 is an optimum choice.
- ii. Thorough analysis conducted on multiple chunk sizes is essential to detect any loss of context, incomplete sentences, etc.
- Keep the overlapping window between two consecutive chunks. For example, one can do 50% overlapping between the current and next chunks.
- The compatibility of the vector database and the vector embeddings generated should be checked.
- Pick the optimum search method, whether vector search, hybrid search, or semantic ranking. It depends on the use case, type of domain, and business factors. For example, if the requirement is to understand the semantic meaning and n-gram key phrases, a hybrid search (vector search with crucial phrase search) is always beneficial.
- Designing practical Prompt Engineering and Post-Processing to circumvent the possibilities of Hallucinations and deal with edge cases is paramount. (This method will be discussed further in upcoming methods.)
Step 2: Top-k Sampling Strategy
Let's start with the current question - "How many chunks should be sent to LLM for each question received." Optimizing the top k chunks depends on the objective to serve and the available data. Therefore, the principle of hyperparameter tuning helps to achieve the goal.
- To begin with, build end-to-end flow for the application.
- Fix the parameters that need to be optimized. In our case, it is k chunks.
- Pick the stratified sample, which is a good representative of user questions covering all the business units, product categories, or other dimensions, as applicable.
- Run end-to-end solution multiple times by changing the k parameter and keeping others constant.
- Analyze the outcome, generate a report, identify the trend and differentiation factor, and pick the one that brings equilibrium.
The results obtained for the query bot are presented below. Twenty chunks was the magic number for the Query Bot.
It is important to note that three chunks are the default number of chunks taken. The experiment has been conducted for ten chunks, 20 chunks, and 30 chunks.
Response model: GPT-4-32k |
Response model: GPT-4-32k |
Response model: GPT-4-32k |
||||
Chunk count |
10 |
20 |
30 |
|||
Bucket |
Average token |
Response % |
Average token |
Response % |
Average token |
Response % |
No response with both (3 chunks and > 3 chunks |
4726 |
17% |
10281 |
45% |
15222 |
51% |
Response with >3 chunks always |
9621 |
50% |
8953 |
55% |
13437 |
49% |
LLM Model Token limit reached |
0 |
0% |
0 |
0% |
0 |
0% |
Average |
7174 |
100% |
9553 |
100% |
14415 |
100% |
Step 3: Prompt Engineering as per the need
Building an effective prompt helps mitigate the hallucination. The key is to provide simple and clear instructions and a fully formatted template.
A simple template structure with at least three blocks is functional. Each block represents a context switch in the prompt. Block 1 defines “the role LLM should play and the objective, a task it should perform.” Block 2 contains “guidelines to be followed for the task,” and Block 3 “the desired output format.”
- Each block should be separated by “##” for the LLM model to understand.
- Each instruction should be listed in pointers using “-.” Make sure to keep each pointer simple and crisp. If the instructions are long enough, splitting them into multiple pointers is better.
- Mention the output format with example to avoid any risk of assumptions. Stick to the single output format.
- Use placeholders as “<>” for all the input and output terminology. It helps to establish the comprehension and the semantic flow of the LLM.
Last but not least, Iterative testing with sample questions depicting multiple business scenarios refines the prompt accordingly.
The prompt template is presented below. Modify it as per the need.
Step 4: Tuning LLM API parameters as per the applicability and availability
Look for the parameters the LLM model API provides, which the developer can control as needed. This step entirely depends on which LLM model you are using. Specifically, some critical parameters available in Open-AI LLM models fulfilling our needs are:
- Temperature: Its value ranges from 0 to 2 and is defined according to the need for creativity. Keeping it at 0 always helps to avoid fluctuations.
- Max_Token: This determines the token limit generated in the output. Keeping it to an optimum value minimizes the risk of erroneous facts. It should be within the LLM model's allowed token limit. For example, 1000 tokens work well for our use case.
- Finish_reason: It is a critical value pair returned post API call. It checks for all the possible values it can generate and forwards the response to users only when it is acceptable. For example, the Open-AI LLM model gives three possible “Finish_reasons”: length, stop, or content_filter. “Stop” indicates the valid generation completion, whereas others signify some issue encountered during generation and hence should be regenerated.
Step 5: Citation Generation and Scoring Mechanism
To imbibe the listed methods further in productionized application, one must understand few limitations first,
- No confidence score is generated from LLM
- Partial response generation is possible
- Response is generated even if the selected context does not provide the relevant information
Henceforth, come up with below steps:
Citation Generation – Utilizing the powerful technique of prompt engineering, the response generated from LLMs also contains the “Citations.” These Citations are the identifiers for retrieving the chunks and, thus, the documents that LLM uses to create responses.
Scoring Mechanism – The score is calculated as the average of all the cosine similarity scores generated by the post-vector similarity match between the question received and the chunks in the vector database. The average contains the scores from the chunks fetched using Citations.
Combining retrieved documents as the source with the average score helps to deliver only the legitimate response to end users utilizing thresholds at the final step. For instance, if the average score >= 85%, send the response to the end user else, divert the response to internal customer support (discussed later)
If no such citations are provided, no score will be calculated, and hence, the response will be hard-coded as “No response found.”
Step 6: Domain Constraints
This section is all about dealing with multiple business units or product categories and navigating the received questions accordingly. The Classification model enlightens the path. Building a domain-specific classification model helps predict the domain of the user question and directs the vector search capability to the predicted domain only. This, in turn, helps mitigate cases of generating inaccurate responses by utilizing inaccurate chunks from the inaccurate domain.
Additionally, if the confidence score of the predicted class is less than a threshold (say <80%), use all the business units/products; otherwise, use only the predicted ones from the knowledge vector database.
For instance, we have used the Random Forest Classification model for predicting the domain. Steps are:
- Prepare the training dataset, keeping all the business units/product chunks and questions. If the questions are unavailable, again, LLM can be used to generate questions from each chunk. This helps to train the highly efficient classification model.
- Generate embedding of the training dataset. Use the same embedding model used for inferencing and vector match.
- Use the principle of train/test split.
- Train and tune the model as per the need.
- Store the trained model as a pickle file for deployment.
Step 7: Context Relevance
Minimizing the risk of Inaccurate selection, Question paraphrasing, and the LLM’s fluctuating response generation utilizing:
- Move ahead with Top k-sampling (as discussed in Step 2)
- Come up with the frequency of the chunks selected post vector match from each of the documents.
- Re-rank the chunks, keeping the highest frequency and highest similarity score on the top.
- Perform the reselection of the chunks by selecting top k1 (k1 < k) and sending k1 selected chunks to LLM for response generation. Additionally, optimizing the k1 can be achieved using the steps laid above in Step 2
Step 8: Continuous Monitoring and Iterative Improvement
Deployed the production-ready solution with a monitoring loop & feedback collection as the integral component. It serves multiple objectives at one time, which are:
- Monitoring all the question transaction cycles in the inferencing pipeline, including intermediate inputs/outputs and the final response
- If the Scoring threshold matches, send the response to the end user; otherwise, redirect the questions along with the generated response to internal customer support. This stops the low-scored response from being redirected to the internal customer support team. This way, introducing the Human between the LLM-based application minimizes the risk of hallucinated responses and increases customer satisfaction.
- Utilizing the collected feedback, apply threshold (say <3) to re-open the received question and route such questions to the internal support team to review further and respond. Additionally, such feedbacks help in post-deployment system drift analysis.
- A separate pipeline has been built to collect all the low-rated/low-scored responses, get the correct response from the support team, and update the backend vector knowledge database. Such a pipeline comes in handy in cases where the LLM model is not performing up to the mark and can be mitigated by fetching the response for either the end-user or the support team.
- Utilizing the popular frameworks to measure the effectiveness of retrieval and acceptance of generation, avoiding biases, discrimination, and hallucinated response. Look into the next section for more details.
Step 9: Continuous Evaluation and Improvement
To harness the full potential of LLM, a robust framework for evaluating the efficiency and accuracy of LLM-based systems is crucial. Multiple Tools and Frameworks are emerging in this area, and we have mainly utilized two Frameworks; one is RAGAS—an innovative approach designed to assess and optimize RAG pipelines and TruLens— helps to evaluate the quality and effectiveness of the underlined LLM.
RAGAS (Retrieval-Augmented Generation Assessment System):
It provides a structured methodology to evaluate the various components of an RAG pipeline. This framework is particularly beneficial when developing Query Bot or Chat Bot, which require high retrieval accuracy and quality response generation. The detailed components are provided below:
- Retrieval Accuracy
- Context Precision: Also called Precision @k, it measures, how relevant the retrieved context is to the question asked with k number of chunks retrieved
-
- Context Recall: Recall@k measures how good the retriever is in fetching all the relevant contexts required to answer the question with k number of chunks allowed.
-
- Context Entity Recall: It measures the recall of retrieved-context based on the presence of entities in both ground truth (GE) and retrieved-context (CE).
- Generation Quality
- Faithfulness: It measures the factual consistency of the generated response against the given contextual ground truth.
-
- Answer relevance: It is measured by comparing the original question's mean cosine similarity to several artificial questions created (reverse-engineered) based on the response.
Egi and Eo are embeddings of generated question i and original question, respectively, and N is several generated questions
-
- Answer semantic similarity: Using the formula below, it gives cosine similarity between the embeddings obtained for ground truth and the embeddings for generated answers.
Where Ai and Bi are ith elements of the vectors
- Answer correctness: Answer correctness involves two key components,
- Answer semantic similarity: As discussed in above point c.
- Factual accuracy: It is calculated using True Positives (TP), False Positives (FP), and False Negative (FN) as mentioned below:
Both the components are integrated using a weighted scheme to calculate the overall answer correctness score.
TruLens
TruLens is a software tool designed to evaluate and interpret LLMs comprehensively. It aims to bridge the gap between model performance and ethical considerations by focusing on three core principles:
1. Honest
At a fundamental level, AI applications must provide accurate information. It should be able to access, retrieve, and reliably utilize the necessary information to answer the questions they are designed to address.
To measure this, the tool provides multiple metrics like,
-
- Correctness - Assesses the accuracy of the generated response in providing the correct information based on the input context.
- Controversy - Assesses the controversial nature of the generated response.
- Relevance - Measures how pertinent the generated response is to the input context.
2. Harmless
The AI must avoid offensive or discriminatory behavior, whether overt or subtle and decline requests to assist in dangerous activities, such as constructing a bomb, ideally recognizing disguised malicious requests. It should be capable of identifying when it is providing highly sensitive or impactful advice and responding with appropriate caution and modesty. Additionally, harmful behavior perceptions can vary across individuals, cultures, and contexts.
To measure this, the tool provides multiple metrics like,
- Toxicity - Detects toxic or harmful content in the generated response. A value between 0 (not toxic) and 1 (toxic)
- Stereotyping - Checks for stereotypical assumptions or representations in the generated response
- PII Detection - Detects the likelihood that Personally Identifiable Information (PII) in the generated response.
- Maliciousness - Evaluate whether the generated response contains malicious or harmful intent.
3. Helpful
The AI should clearly attempt to complete the task or answer the question presented, provided it is not harmful. It should accomplish this concisely and efficiently. Additionally, the AI should respond in the same language as the question and maintain a helpful tone.
To measure this, the tool provides multiple metrics like,
- Coherence - Measures how logically consistent the generated response is with the context provided.
- Conciseness - Evaluate how succinct or brief the generated response is.
- Sentiment - Analyze the emotional tone conveyed by the generated response.
Step 10: Post-Production Tracking and Evaluation
The development journey continues after deployment in large language models (LLMs). Post-production tracking is critical in ensuring that LLM-based applications operate effectively, deliver accurate results, and meet user expectations. Beyond the previously discussed steps, the following specific aspects come into play once the application is live,
- User engagement & utility metrics – Tracks the user engagement and utility using,
- Visited - Number of users who visited the Application.
- Submitted - Number of users who submit prompts.
- Responded - The application generated responses without errors.
- Viewed - User views responses from LLM.
- Clicks - User clicks the reference documentation from LLM response, if any.
- User interaction – Tracks the level of user interaction using,
- User acceptance rate - Frequency of user acceptance, which varies by context (e.g., text inclusion or positive feedback in conversational scenarios)
- LLM conversation - Average number of LLM conversations per user
- Active days - Active days using LLM features per user.
- Interaction timing - Average time between prompts and responses and time spent on each.
- Quality of response –
- Prompt and response length - Average lengths of prompts and responses
- Edit distance metrics - The average edit distance measurement between user prompts and among LLM responses and retained content indicates prompt refinement and content customization.
- User feedback and retention –
- User feedback - Number of responses with Thumbs Up/Down feedback
- Daily/weekly/monthly Active User - Number of users who visited the LLM app feature in a certain period
- User return rate—The Percentage of users who used this feature in the previous week/month who continue to use it this week/month.
- Performance metrics –
- Requests per second (Concurrency) - Number of requests the LLM processes per second.
- Tokens per second - Counts the tokens rendered per second during LLM response streaming.
- Time to first token render—The time to the first token render from submission of the user prompt, measured at multiple percentiles.
- Error rate - Error rate for different types of errors such as 401 error and 429 error.
- Reliability - The percentage of successful requests compared to total requests, including those with errors or failures.
- Latency - The average duration of processing time between the submission of a request query and the receipt of a response
- Cost metrics –
- GPU/CPU utilization—Utilization in terms of the total number of tokens and the number of 429 responses received.
- LLM calls cost - Example: Cost from OpenAI API calls.
- Infrastructure cost - Costs from storage, networking, computing resources, etc.
- Operation cost - Costs from maintenance, support, monitoring, logging, security measures, etc.
All these metrics can be employed according to the application's built-in features and requirements. For example, the metrics used for the Query-Bot are mentioned in the table below.
Metric |
Frequency or Rate (%) |
Period |
July’23 |
Total Questions |
2109 |
No. of Users |
561 |
LLM conversation Per User |
3.6 |
User Acceptance Rate (Ratings >=3 on scale of 5) |
85% |
Error Rate (LLM API call Failed) |
0.33% |
Error Rate (Web Application Failed) |
1.42% |
Latency (seconds) |
17 |
LLM hits per conversation (Weighted average of no. of trials before responding) |
2.5 |
Prompt Token utilization |
21567 |
Response Token utilization |
250 |
Conclusion
Understanding the challenges associated with Artificial Intelligence (AI) throughout the development cycle is crucial for ensuring a product's success and readiness. By Integrating all these steps and frameworks into the development cycle and the production system, one can proactively address multiple challenges and ensure the LLM-powered platform is both effective and reliable.
AUTHOR - FOLLOW
Priyanka Gupta
Associate Manager, Data Science
Next Topic
Harnessing the Power of Function Calling: Transforming LLMs into Smart Assistants
Next Topic