Evals First: Define LLM Quality Issues Before You Fix

Your head of customer success pings you: “ImportantCustomer is getting weird outputs for their answers. Not sure how common it is, but they mentioned it twice this week. OtherBigCustomer described something similar a few weeks back”.

You know this situation. It’s a special hell of LLM-based products: every quality issue arrives with perfect ambiguity. Is this a quick prompt tweak or a fundamental architecture problem? Should you drop everything to reproduce, or add it to the backlog? Will fixing it break what actually works well for your other customers?

Most teams handle this in one of two ways:

Prioritize speed, test on a few synthetic examples, and deploy targeted patches
Thoroughly gather examples, then build an eval and a comprehensive fix before shipping

Both approaches have drawbacks, and trying to split the difference between them misses the actual problem.

The real problem isn’t choosing between fast and thorough. It’s that you’re solving problems before defining them properly.

Here’s the key point: you should build the eval the moment you identify a quality issue, not when you decide to fix it.

Why We Get Trapped

Most teams treat evaluation creation and problem-solving as part of a single process. Customer reports an issue → gather examples → build comprehensive test suite → implement fix → ship when you hit 95% accuracy.

This bundles two completely different activities:

Defining what’s broken (evaluation)
Fixing what’s broken (solution)

When you do them together under customer pressure, you get stuck choosing between thorough evaluation (slow) and quick fixes (risky).

The Separation Principle

Just because a customer identifies a problem doesn’t mean you fix it immediately. And just because you decide not to fix a problem today doesn’t mean it won’t suddenly catch fire tomorrow.

That’s the beauty of separating eval creation from problem solving. By training your muscle memory to always create an eval as soon as an issue can be identified, you build historical data for yourself.

When the day comes that you MUST fix issue_x, you’ll already have:

A clear definition of the failure mode
Baseline metrics on how often it occurs
Representative test cases ready to validate against
Historical context on whether it’s getting worse

You’re not scrambling to define the problem while customers wait for solutions.

The Immediate Eval Method

Step 1: Name the Beast

Transform complaints like “the LLM is acting weird” into “the LLM uses the wrong client name when answering from RAG over previous answers”. Specific, trackable, solvable.

Step 2: Build Tiny Targeted Eval (20 minutes max)

Take every example the customer provided
Spend 20 minutes searching logs for similar patterns (e.g., does this only happen when the context window is full?)”
Create simple pass/fail check (programatic rule or LLM-judge, but binary is simplest)
Run it once to get baseline failure rate

Step 3: File It Away

Don’t fix it yet. Just add it to your quality dashboard. Let it collect data. You now have a named, measurable quality issue that you can track over time and prioritize systematically.

Step 4: Fix When Ready

When business priority, engineering capacity, and customer impact align – you’re ready to act immediately. No scrambling to define the problem or gather examples. You have months of data and a proven evaluation framework.

Why This Changes Everything

Traditional approach:

Problem discovered → Decide between fixing vs ignoring → Rush to define AND solve simultaneously once prioritized → Choose between speed and thoroughness

Proactive eval approach:

Problem discovered → 20-minute eval creation → Systematic tracking → Confident fixes once prioritized

Train your product org reflexes:

Good habits are the cornerstone of strong product organizations. Evaluation frameworks are useless without data. Simple habits have the greatest chance of being adopted.

Getting Started

Don’t wait for your next crisis. Look at your last three customer quality complaints. Can you name each failure mode specifically? Do you have simple pass/fail checks for them? Do you know their baseline occurrence rates?

If not, spend 20 minutes on each one right now. Build the evals before you need them.

3 months of this practice will give you a comprehensive quality dashboard built from real customer pain points, not theoretical edge cases. When leadership asks “what should we prioritize?” you have data, not opinions.

The Real Insight

Most teams think the problem is choosing between speed and thoroughness. The real problem is trying to define quality issues while solving them.

Separate the concerns. Build evals immediately. Fix when ready.

Your future self, and your customers, will thank you.