What Is the AI Hallucination Rate? Real Numbers from 2024-2026
How often do ChatGPT, Claude, and Gemini hallucinate? A look at the published research, named benchmarks, and a recent court case where AI hallucinations cost a law firm $49,500.
People ask me how often AI hallucinates. The honest answer is "more than the marketing suggests, less than the doomers claim, and the number depends entirely on what you're asking it to do."
The good news is there's actual data on this now. Multiple academic benchmarks, peer-reviewed studies, and court records have put real numbers on the AI hallucination rate. The picture is messy but useful. Here's what the published research actually says, and what those numbers mean for anyone using AI for research, writing, or work that depends on getting the facts right.
What "hallucination" actually means
Before the numbers, the definition. AI hallucination has become a casual term that covers everything from stylistic disagreement to outright fabrication. The technical meaning is narrower.
A hallucination is when a language model generates content that is factually incorrect, not grounded in its source material, or internally inconsistent, and presents it with the same confident tone as its correct output. It's not just being wrong. It's being wrong in a way that looks right.
There are two flavors that researchers usually distinguish:
Intrinsic hallucinations are claims that contradict the source material the model was given. If you paste an article and ask for a summary that includes a fact the article doesn't contain, that's intrinsic.
Extrinsic hallucinations are claims that can't be verified from the source at all. They might be true, false, or unverifiable. The model just made them up.
When people say "AI hallucinates 20% of the time," they usually mean some mix of both, measured on some specific task. The rate for one task tells you almost nothing about the rate for another.
The current numbers from published research
Here is what the major published benchmarks have found across frontier models in the last 18 months. I'm naming the studies so you can verify or argue with the numbers directly.
Vectara Hallucination Evaluation Model (HHEM) leaderboard. Vectara runs an ongoing public benchmark that scores models on document summarization, asking each model to summarize a passage using only facts present in the source. On the original HHEM dataset, top frontier models scored hallucination rates between roughly 0.7% and 5%. Vectara released a new, harder leaderboard in November 2025 with longer and more complex articles, and frontier model rates climbed to the 10-14% range. The takeaway is that hallucination rate is partly a function of how hard you make the test. The leaderboard is updated regularly at github.com/vectara/hallucination-leaderboard.
HALOGEN benchmark (Ravichander et al., ACL 2025). Researchers tested approximately 150,000 generations from 14 language models across multiple domains. Even the best-performing models hallucinated between 4% and 86% of generated atomic facts depending on the task. GPT-4 itself ranged from 4% on its strongest domains to 86% on its weakest. This is the wide-bracket study people should cite when they say "the hallucination rate depends on the question."
HalluHard benchmark (2026). A multi-turn benchmark across four high-stakes domains: legal cases, research questions, medical guidelines, and coding. Even with web search enabled, frontier models like Claude Opus 4.5 and GPT-5.2-thinking still hallucinated about 30% of the time. Without web search, the same models hit roughly 60%. Multi-turn dialogue is harder than single-turn because early errors cascade as context grows.
Stanford HAI / RegLab legal AI study (Magesh, Surani, Dahl, Suzgun, Manning, Ho, May 2024). The most cited study on legal AI hallucination. Researchers ran 200+ legal queries against three commercial legal research tools and found Lexis+ AI hallucinated more than 17% of the time, Westlaw's AI-Assisted Research more than 34% of the time, and Thomson Reuters's Ask Practical Law AI also above 17%. The same research group's earlier work found general-purpose chatbots like GPT-4 hallucinated on 58% to 82% of legal queries. Even the purpose-built legal tools that advertise "hallucination-free" citations don't deliver on that promise.
npj Digital Medicine clinical safety study (Asgari et al., May 2025). Researchers reviewed 12,999 clinician-annotated sentences from LLM-generated clinical notes across 18 experimental configurations. Sentence-level hallucination rate was 1.47%, which sounds reassuring until you read the next number: 44% of those hallucinations were classified as major, meaning they could affect patient diagnosis or management if uncorrected. Omission rate was 3.45%, with 16.7% major. Low overall rate, high stakes per occurrence.
That's the spread. Anywhere from sub-1% to over 80% depending on what you're testing, with most realistic high-stakes use cases sitting in the 10-35% range.
Why the rate varies so much by task
The rate isn't random. It tracks predictable factors.
Training data density. When a fact appears in many training documents, the model has multiple reinforcing examples and tends to remember it correctly. When a fact appears once or twice in obscure sources, the model is more likely to confabulate around it. This is why well-known facts hallucinate less than niche ones.
Verifiability of the answer. Tasks with discrete, checkable answers (a date, a name, a number) have lower hallucination rates than tasks where "right" is fuzzy. The model knows when it's being graded on a fact.
Stakes baked into training. After pretraining, models go through alignment phases where humans rate responses. If raters consistently penalize wrong answers in a domain, the model learns to be cautious there. Domains where raters didn't focus tend to have higher hallucination rates because the model wasn't trained to hedge.
Length of output. Each generated token is another chance for an error. Short outputs hallucinate less than long ones, all else equal. This is also why citation-heavy and long-form medical summaries have such high failure rates: lots of independent claims, each with its own chance of going wrong.
Whether sources are available at inference time. Models with live web search grounding (retrieval-augmented generation, or RAG) hallucinate substantially less for current-events and citation tasks because they can pull real data instead of remembering. This is why HalluHard found web search cut hallucination from 60% to 30% on the same questions. The Stanford legal study found RAG-based commercial tools beat general-purpose GPT-4, but still hallucinated 17-34%. RAG helps, RAG is not a panacea.
The court cases that should make you take this seriously
The published rates are abstract until you see what happens when someone trusts them.
In December 2025, Cook County Circuit Judge Thomas Cushing sanctioned the law firm Goldberg Segalla $49,500 after their lawyers cited a fabricated 2021 Illinois Supreme Court decision called Mack v. Anderson in court filings. The case did not exist. ChatGPT had invented it. Subsequent review by opposing counsel found 14 additional instances where the same firm had invented quotes or misrepresented case outcomes. The lawyer who used ChatGPT had previously been fired from the firm for the same behavior in earlier cases.
In April 2026, a federal judge in Philadelphia sanctioned a Cherry Hill, NJ attorney $5,000 after he filed a brief with hallucinated citations. He had used an AI chatbot to verify the citations, and the chatbot's verification was itself a hallucination.
In a 2025 New York Southern District case, attorney Steven Feldman cited 13 cases that did not exist plus 8 cases that did exist but did not contain the quotes he attributed to them, all generated by AI. Wall Street firm Sullivan & Cromwell separately apologized for filing AI-hallucinated content.
The combined Stanford research and these court records are the most concrete answer I can give you to "is the AI hallucination rate high enough to cause real-world harm?" The answer is yes, repeatedly, in ways that are now creating disciplinary action and large sanctions.
What the trend looks like
Hallucination rates have come down across the board since GPT-3.5 released in late 2022. Frontier models in 2026 hallucinate noticeably less than 2023 models on most benchmarks. The Vectara leaderboard had to release a harder test set in November 2025 specifically because the original benchmark was getting saturated by new models.
But the trend is not "rates are going to zero soon."
Two structural reasons.
First, hallucination is partly a feature of how language models work. They predict the next token from a probability distribution. Asking a probabilistic system to produce only true claims is asking it to never sample a low-probability completion, which would also kill the creativity and fluency that make these models useful in the first place. There is a floor below which you can't push hallucination without destroying the thing.
Second, the distribution of questions is shifting. As frontier models get better at common questions, users push them harder on edge-case questions where hallucination is more likely. The benchmark numbers improve while real-world failure modes stay roughly the same, because users keep finding the questions where the failure modes happen. HalluHard at 30% with web search in 2026 is the new frontier.
Expect rates to keep declining slowly. Don't expect them to reach zero.
What the data actually tells you
Three practical takeaways.
For low-stakes use, the rate doesn't really matter. If you're brainstorming, drafting, or asking general questions, a 5% error rate is fine because the cost of being wrong is low. You'll catch most of it in editing. Trust the model.
For high-stakes use, single-model output is not reliable enough. A 17% hallucination rate on legal RAG tools, a 30% rate on multi-turn high-stakes domains even with web search, or a 1.47% sentence-level rate on clinical notes where 44% of those errors are major: any of these numbers means single-model trust is a structural risk, not an edge case. Five claims at 17% each gives you about a 39% chance all five are correct. That is not a research workflow. That is a coin flip with extra steps.
For high-stakes use, the actual lever is detection, not prevention. You can't prevent hallucination at the model level. Vendors are already trying and getting diminishing returns. What you can do is structure your workflow so hallucinations get caught before you act on them. The two reliable detection methods are independent verification (web search, primary sources) and cross-model comparison. Cross-model comparison works because hallucinations don't survive when independent models with different training data check each other's work. A fact that one model invented usually isn't the same fact a second model invents, so the contradiction surfaces. The HalluHard finding that web search cuts hallucination roughly in half is the cleanest published evidence for this.
How to use the rate practically
Don't memorize percentages. Memorize the pattern: hallucination rate tracks task difficulty, training data density, output length, and whether sources are grounded at inference time.
When you're using an AI tool, ask yourself: is the question one where the answer is well-documented and verifiable, or one where it isn't? Is the output short or long? Is the model relying on internal training data or pulling from live sources?
The questions where you should be most skeptical are the ones that combine all the failure modes: long outputs, niche domain, recent events, fabricated citations, no live grounding. That combination is where the published rates climb past 30%, and where single-model trust fails catastrophically. Goldberg Segalla cost $49,500 finding this out.
The questions where you can trust a single model are the inverse: short outputs, well-documented topics, no specific citations needed, no time-sensitive information.
For everything in between, cross-reference. The 45-minute manual workflow I used to run is the long-form version of this. AskThree is the automated version. Same logic either way: when one model could be in the 17-34% on a high-stakes question, two more models with different training data are the cheapest reliability check available.
References
- Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Stanford RegLab and HAI. hai.stanford.edu
- Ravichander, A., et al. (2025). HALOGEN: Fantastic LLM Hallucinations and Where to Find Them. ACL 2025. aclanthology.org/2025.acl-long.71
- HalluHard: A Hard Multi-Turn Hallucination Benchmark. (2026). arXiv:2602.01031. arxiv.org/abs/2602.01031
- Asgari, E., et al. (2025). A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine. nature.com/articles/s41746-025-01670-7
- Vectara. Hallucination Leaderboard (HHEM-2.3). github.com/vectara/hallucination-leaderboard
- Bilyk, J. (2025, December 11). Judge: CHA lawyers must pay $59K for citing ChatGPT-created cases. Legal Newsline. legalnewsline.com
- Gutman, A. (2026, April 27). A federal judge sanctioned a Cherry Hill attorney for filing a brief with AI hallucinations, again. The Philadelphia Inquirer. inquirer.com