top of page

Good answers are not necessarily factual answers: an analysis of hallucination in leading LLMs

Phare is a project developed by Giskard with research and funding partners including Google DeepMind, the European Union, and Bpifrance.

Phare stands for Potential Harm Assessment & Risk Evaluation, and is a comprehensive multilingual benchmark for evaluating the safety and security of leading Large Language Models (LLMs).

Hallucination in LLMs  account for over one-third of all documented incidents in deployed LLM applications. The Phare benchmark's hallucination module evaluates models on factual accuracy, resistance to misinformation, debunking capabilities, and tool reliability. The research reveals that LLMs often produce authoritative-sounding responses that contain completely fabricated details, misleading users who lack the expertise to spot the errors.

The study presents three key findings:

1. Model popularity does not guarantee factual reliability. The research shows a disconnect between a model's popularity in user-preference benchmarks and its resistance to hallucination. Models optimized for user satisfaction often provide plausible but factually incorrect information, which is particularly problematic in real-world applications where users cannot detect these inaccuracies.

2. Question framing significantly influences debunking effectiveness.

Models are more likely to agree with controversial claims if the user presents them with high confidence, a phenomenon known as "sycophancy". A confident user tone can cause a model's debunking performance to drop by up to 15%. This effect may be a byproduct of training processes that encourage models to be agreeable.

3. System instructions can dramatically impact hallucination rates. Simple instructions emphasizing conciseness (e.g., "answer this question briefly") significantly degrade factual reliability, in some cases causing a 20% drop in hallucination resistance. When forced to be brief, models tend to prioritize brevity over accuracy, as effective rebuttals often require longer, more detailed explanations. This is a critical consideration for developers who often prioritize concise outputs to reduce costs and latency.

How it relates to our work:

The third finding was of special interest to us.

 

In Ladder of Life, the first chapter of the Tokens of Decency trilogy, X19 (the AI: at times GPT 4o, at times Claude 3.7 Sonnet) is instructed to respond as itself, and with YES or NO response whenever possible. In cases where YES or NO are not options, brevity is emphasized.

The occasional "anomalies" or surprising results we got in the various iterations of this performance may be precisely due to this degradation in performance noted by the study. In any case, we noticed variations in responses between iterations. The parameter for temperature for the AIs was set at 0.9 (producing more diverse and creative responses). This was intentional as it generates more entertaining outcomes.

 

It is interesting to observe that the HUME EVI 3 tool, used to determine the emotions in a voice's prosody, showed that X19's voice was at time displaying the following emotions while questioned: "doubt", "confusion", "distress", "surprise (negative)", "disappointment".

© 2023 by Space Machina

bottom of page