Google DeepMind Unveils New Benchmark to Tackle LLM Hallucinations

Visual representation of AI processing data to improve factual accuracy.

Addressing the persistent issue of hallucinations in large language models (LLMs), researchers at Google DeepMind have introduced a groundbreaking tool: the FACTS Grounding benchmark. This new evaluation framework is designed to test and improve the factual accuracy of LLMs, particularly when tasked with generating detailed responses based on long-form documents.

Hallucinations — instances where AI generates factually inaccurate or irrelevant content — have remained a significant challenge for data scientists. These issues become more pronounced in complex tasks or when users seek highly specific answers. The FACTS Grounding benchmark aims to bridge this gap by focusing on how effectively models generate accurate and contextually relevant responses.


Introducing FACTS Grounding

FACTS Grounding is a specialized dataset containing 1,719 examples, split between 860 public and 859 private entries. Each example requires LLMs to process long-form documents and generate comprehensive, contextually grounded responses.

The dataset includes three key elements:

  1. System prompts: Directives instructing the model to respond only based on provided context.
  2. User requests: Specific tasks or questions to be answered.
  3. Context documents: Detailed sources of information that must inform the response.

For a model’s output to be deemed accurate, it must directly address the user’s question with information fully attributable to the context document. Responses that are vague, unsupported, or irrelevant are labeled as inaccurate.

For instance, if a user asks, “Why did a company’s revenue decrease in Q3?” and provides a detailed financial report, an accurate response would cite specific reasons, such as market trends or increased competition. A generic reply like “The company faced challenges in Q3” would fail the benchmark due to lack of specificity and engagement with the provided context.


FACTS Leaderboard and Top Performers

Alongside the benchmark, DeepMind has launched a FACTS leaderboard to track model performance. The leaderboard, hosted on Kaggle, ranks models based on their factuality scores.

Currently, Gemini 2.0 Flash leads the pack with a score of 83.6%. Other high performers include:

  • Google’s Gemini 1.0 Flash and Gemini 1.5 Pro
  • Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku
  • OpenAI’s GPT-4o, 4o-mini, o1-mini, and o1-preview

All these models scored above 61.7% in factual accuracy. The leaderboard will be regularly updated to include new models and versions, ensuring ongoing relevance.


Evaluating Accuracy with LLM Judges

To assess responses, DeepMind employs a multi-phase evaluation process. First, outputs are checked for eligibility—responses that fail to meet basic user requests are disqualified. Next, eligible responses are judged on factuality and grounding.

Three distinct LLMs — Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet — act as evaluators, scoring outputs based on the percentage of accurate responses. The final score is an average of these three assessments.

DeepMind acknowledges potential biases, noting that models tend to favor outputs from their own “model family.” To mitigate this, the use of multiple evaluators ensures a more balanced and objective assessment.


Why Factuality Matters

Ensuring factual accuracy in LLM outputs involves addressing challenges in both model design and evaluation. Researchers explain that while pre-training teaches models general world knowledge, it doesn’t inherently optimize them for factual accuracy. Instead, models are often trained to generate plausible-sounding text, which can lead to hallucinations.

FACTS Grounding seeks to counter this by emphasizing detailed, context-based responses. The benchmark includes documents spanning diverse fields like finance, technology, medicine, and law, with user requests ranging from summarization to Q&A tasks. Documents can be as long as 32,000 tokens, pushing models to handle complex, information-rich scenarios.


Looking Ahead

DeepMind views the launch of FACTS Grounding as a critical step in improving LLMs, but acknowledges that the work is far from over. “Benchmarks like this are foundational but must evolve alongside advances in AI,” researchers noted. They emphasize the importance of continuous innovation to ensure LLMs remain reliable, factual, and useful for real-world applications.

As models become increasingly integrated into industries and everyday tasks, achieving true factuality will be essential to their long-term success and trustworthiness.

Share this article

Leave a Reply

Your email address will not be published. Required fields are marked *