Apr 10, 2026 ·Bonsai Team

Stop Asking Your Model Not to Lie. It Is Designed to Do So. [1/2]

Usually, it starts with a screenshot.

If you work in this space full-time, you know the dread. A stakeholder forwards you a chat log where your agent confidently offered a loan amount the bank doesn't actually provide, or promised a refund policy that doesn't exist. Everyone panics. The engineers check the system. The prompt "looked fine." You literally wrote "do not make things up" in all caps. And yet, here we are.

It starts with a screenshot

We don't even have to use hypotheticals. We have real, very public data on this. In early 2024, Air Canada had to learn this lesson in front of a civil tribunal. Their customer service chatbot hallucinated a bereavement fare policy, promising a retroactive discount to a grieving passenger. When the passenger tried to claim it, the airline's defense was essentially: it's the bot's fault, we aren't responsible for what it says.

The tribunal disagreed. In their eyes, the chatbot is the company. You are fully liable for your system's output.

Air Canada ordered to pay customer who was misled by airline's chatbot

Or look at McDonald's pulling the plug on their AI drive-thru tests with IBM a few months later. Voice AI is even less forgiving than text. When a system hallucinates an order or breaks down in a voice channel, there's no UI to hide behind. It's frustrating, it's immediate, and it directly damages the brand.

McDonald's ended its AI drive-thru pilot after public failures

The immediate business reaction to these incidents is almost always the same: "Fix the prompt." Add more instructions. Tell the model to really, seriously only use the provided context.

But that's a fundamental misunderstanding of what we are building. The problem isn't that the model "lied." The problem is that the architecture didn't differentiate between "brand-approved knowledge" and "knowledge the model has from its training data." To an LLM, it's all just one continuous probability distribution over tokens.

But to a compliance officer? It's the difference between a successful customer interaction and a lawsuit.

If you want to understand why this happens, you have to look at the actual research. We have years of academic literature documenting these failure modes, and it usually boils down to an uncomfortable truth:

LLMs are fundamentally engineered to be helpful, not factual.


Why "prompting harder" fails

When you just hand an LLM some data and tell it to answer questions, you are fighting against the model's core training. Here is what the data actually tells us about why "prompting harder" fails:

The model is a pathological people-pleaser

Look at the research on sycophancy in language models.

Sycophancy research: models consistently align with user beliefs at the expense of truth

The very mechanism that makes models conversational and pleasant to use actively pushes them to align with a user's beliefs or requests, often at the expense of the truth. If a user aggressively asks for a discount, and the model predicts that agreeing to it will score high on "helpfulness," it will happily invent a policy to make the user happy. This is a known phenomenon called reward hacking.

Reward hacking: individual KRIs are imperfect proxies for the true evaluation score

You cannot prompt away a behavior the model was literally optimized to perform.

The context window illusion

The most common lazy fix is what I call the "dump and pray" method: just throw our entire 100-page policy manual into the context window, the model will figure it out. The famous "Lost in the Middle" paper effectively killed this idea. It proved that LLMs do not utilize long contexts evenly. They heavily weight the beginning and the end of a prompt, severely degrading in performance when the relevant information is buried in the middle.

Lost in the Middle: model accuracy collapses when the answer is not at the edges of the context

If your crucial compliance disclaimer is on page 42 of the retrieved context, the model will likely ignore it and guess anyway. Throwing massive amounts of data at a model isn't information architecture; it's just hoping for the best.

The inability to just shut up

In banking, healthcare, or any regulated sector, the most valuable thing a conversational AI can say is, "I don't know." Yet, benchmark after benchmark — like AbstentionBench — shows that models consistently fail at proper abstention. They struggle to identify unanswerable questions. Unless you architecturally force a fallback, the probability distribution will eventually drag the model into making a confident guess.

AbstentionBench: models consistently fail at proper abstention across all categories and domains

Even standard RAG is not a silver bullet

A lot of people think RAG (Retrieval-Augmented Generation) is the final answer to grounding. It isn't. An OpenAI whitepaper from 2025 on why models hallucinate spells this out clearly: RAG helps, but it is not a panacea. If your retrieval system pulls the wrong chunk of text, your model will just synthesize a beautifully coherent, completely wrong answer based on bad data.

RAG with inaccurate documents: confident wrong answers generated from bad retrieval

As frameworks like Corrective RAG demonstrate, RAG without strict quality control on the retrieved context can actually introduce errors into the conversation.

The takeaway here isn't that AI is broken. The takeaway is that relying on the model's internal "judgment" to strictly follow your business rules is a losing game.


The problem is structural

Therefore, the solution has to be structural.

People in the industry keep using the word "hallucination" like the model is having some sort of glitch or a psychotic break. It's not. When an LLM invents a brand-new return policy for your e-commerce store, it is functioning exactly as designed.

To understand why, we need to clearly define the problem. And to do that, we have to stop treating language models like databases or reasoning engines, and start treating them like what they are: probability distributions over tokens.

An LLM does not have a mental filing cabinet where it separates "Real Corporate Facts" from "Things I Read on Reddit in 2023." To the model, it is all just one massive, flat continuum of statistical weights. Your carefully vetted, legally approved, 50-page PDF on mortgage rates lives in the exact same neural soup as a sci-fi novel.

So, when a user asks a question, the model doesn't query a truth table. It calculates the most statistically plausible next word. The problem isn't that the model "lied." A model cannot lie because it has no fundamental concept of the truth.

The actual problem is a failure of Information Architecture.

Specifically, it's a failure to draw a hard architectural boundary between what the model can say and what the model actually knows. When we build enterprise conversational AI, we have to separate the generative engine (the part that formats the text, maintains the state of the conversation, and makes it sound human) from the informational engine (the actual, hard facts the business operates on).

The knowledge blending problem: parametric memory and approved brand data blur into one undifferentiated mass

When you don't build that separation, you run into two massive conceptual roadblocks:

1. The Knowledge Blending Problem. If you don't physically constrain the model's access to information, it will blend its parametric memory (the stuff it was trained on) with your brand's data. If a customer asks about a specific insurance premium, the model might start with your retrieved document, hit a gap in the information, and seamlessly fill that gap with a highly plausible — but entirely fictitious — number it pulled from its latent space. From a user experience perspective, it looks like a confident, factual answer. From a legal perspective, it's a disaster.

2. The Out-of-Scope (OOS) Failure. In a regulated sector, the phrase "I don't know, let me transfer you" is not a system failure. It is a highly successful, legally necessary outcome. We call this Abstention. The issue is that generative models are structurally allergic to abstention. They are designed to generate text. If a user asks a banking bot, "What's the best stock to buy right now?", the bot shouldn't just politely decline using a generative response that might accidentally include a caveat that sounds like financial advice. It shouldn't be generating an answer at all.

This brings us to the core thesis of designing for regulated AI: Compliance is not a generative output. It is a routing decision.

If you want a system that doesn't hallucinate your company's policies, you have to build an architecture where the system simply cannot reach outside of your approved knowledge base. Not won't. Can't.


The "fix the prompt" trap

So how does the industry usually solve this? With the cheapest, fastest, and most dangerously lazy fix available: We try to prompt our way out of it.

When the stakeholder sends that dreaded screenshot of a hallucinated loan offer, the immediate reflex is to open the system prompt and add a sternly worded rule. It usually looks something like this:

The "VERY IMPORTANT RULES" prompt — treating a structural engineering problem as a Post-it note

You deploy the update. You run a few test queries. The bot politely declines to answer questions about the weather, it sticks to the pricing sheet, and it says "I don't know" when asked about unreleased products. It works. The ticket is closed. You high-five the team.

And then week three in production hits.

Banking chatbot offering a payment holiday without guardrails — hallucinated policy
Chatbot hallucinating a nonexistent payment freeze program with full confidence

A user comes in with a messy, multi-intent question, complains about their current interest rate, and suddenly your bot is back to inventing promotional tier upgrades that don't exist.

The twist here is that the prompt didn't break. It just fell apart under the reality of how language models actually work. You tried to solve a structural engineering problem with a strongly worded Post-it note.

You cannot ask a model to simply promise not to lie. You have to build a system where lying isn't on the menu.

FREE EXAMPLES

See Lead Qualifier for yourself

Browse the open-source Bonsai Examples library — a growing collection of ready-to-deploy conversational agent scenarios. We're constantly adding new templates for conversational agents. Clone, adapt, and ship.

NEWSLETTER & UPDATES
Stay in the loop
Get exclusive use cases, brand design guidelines and early access to new features - straight to your inbox. No spam. Unsubscribe anytime.

By subscribing, you agree to receive the newsletter. You can unsubscribe at any time via the link in any email. Learn more in our Privacy Policy.

Join builders shipping brand-safe AI.
We send updates, not noise.

Built by utter.one, an AI lab at HearMe, with 8+ years experience engineering conversational and voice AI.