AI Evals Are Not So Different From the Tests You Already Write
If you have worked with LLMs in production, you have probably heard the term "evals" and wondered how they relate to the test suites you already maintain. The short answer: evals are a natural extension of automated testing into a world where outputs are non-deterministic. The longer answer is more nuanced, and understanding both the similarities and the differences will help you ship AI-powered features with confidence.
Why You Need Evals (Alongside Your Existing Tests)
Traditional tests verify deterministic behavior. Given input X, you expect output Y, every single time. Evals exist because LLM outputs do not work that way. A prompt that produces a great answer today might produce a subtly different one tomorrow, and you need a systematic way to measure whether "different" means "better," "worse," or "about the same."
This does not mean evals replace your existing test suite. They complement it. Your unit tests still cover your deterministic application logic. Evals cover the parts of your system where the output is probabilistic.
There are two broad scenarios where evals earn their keep.
Projects with LLM integrations. Any time you are calling a model API, you need a way to compare prompt variations, evaluate whether model A outperforms model B for your use case, and catch regressions when you update your prompts. Tools like Promptfoo formalize this process by letting you define test cases, providers, and assertions in a single config:
# promptfooconfig.yaml
prompts:
- prompts/summarize-v1.txt
- prompts/summarize-v2.txt
providers:
- openai:gpt-4o
- anthropic:messages:claude-sonnet-4-6
tests:
- vars:
article: file://fixtures/article-1.txt
assert:
- type: llm-rubric
value: "Summary captures the three main arguments without hallucinating details"
- type: javascript
value: output.length < 500
- vars:
article: file://fixtures/article-2.txt
assert:
- type: llm-rubric
value: "Summary is written in a neutral tone"
Developing custom agent skills. If you are building agent workflows or custom skills, evals let you compare a purpose-built skill against a bare prompt to a default model. You can also compare variations of the same skill to decide whether to optimize for speed versus depth, or to enforce a specific output format.
Key Differences From Traditional Testing
Non-deterministic outputs require statistical thinking. This is the fundamental difference. A single test run is not enough. You need multiple runs of the same eval to compute an average score, because the same prompt can produce varying quality on successive calls. Where a traditional test either passes or fails, an eval might pass 85% of the time, and that number is the metric you actually track.
Assertions are more flexible. Traditional tests are binary: pass or fail. Evals open up a spectrum. There are three main types of assertions you can use:
- Programmatic checks validate exact outputs or structural formats. These are closest to traditional assertions. Does the response contain valid JSON? Is it under the token limit?
- LLM-as-judge uses one model to assess the output of another. You provide a rubric, and the judge model scores the response. This is powerful for evaluating subjective qualities like tone, completeness, or factual accuracy.
- Human judgment pauses the eval and waits for a person to score the output. This is slower but sometimes the only reliable option for nuanced quality assessments.
You can also use rubrics to weight different criteria. One prompt version might produce output that is well-structured but misses a key detail; another might nail the content but be poorly formatted. Rubrics let you express that content accuracy matters more than formatting, or vice versa.
Model choice affects results in ways hardware choice does not. With traditional tests, you can run your suite on weaker hardware and the only penalty is speed. The tests still pass or fail the same way. Evals do not have this property. Running an eval on a cheaper model, or with a smaller context window or token budget, can completely change whether assertions pass. Even different versions of the same model can shift results. This makes it important to pin your model version in eval configs and to run evals explicitly when upgrading to a newer version.
Evals probably should not live in your CI pipeline. Traditional tests are cheap, fast, and deterministic enough to gate every commit. Evals are expensive (you are making real API calls) and inherently flakier due to non-determinism. Running them on every push is wasteful and will produce noisy results.
A more practical approach is to trigger evals selectively. For example, you could configure your CI to only run evals when files in a prompts/ directory change:
# .github/workflows/evals.yml
name: Run Evals
on:
push:
paths:
- 'prompts/**'
- 'evals/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install -g promptfoo
- run: promptfoo eval --config evals/promptfooconfig.yaml
This keeps eval costs proportional to actual prompt changes rather than burning tokens on every unrelated commit.
Success metrics are tracked over time. Unlike a test suite where you care mainly about the current green/red status, eval results are most useful as a trend. Tracking scores across runs lets you catch gradual regressions and measure the impact of prompt improvements quantitatively. Think of it less like "did the build pass" and more like "is our prompt quality trending upward."
Lessons From Traditional Testing That Still Apply
Despite the differences, the decades of wisdom we have accumulated around automated testing transfer remarkably well to evals.
Eval quality determines result quality. Vague assertions produce vague signals. If your eval just checks "did the model return something," it is not going to catch meaningful regressions. The more specific and thorough your expectations, the more useful your eval results become. This is the same principle behind writing good unit tests: a test that asserts result != null is technically a test, but it is not a very useful one.
Coverage gaps lead to behavior gaps. Just as untested code paths can harbor bugs, unprompted edge cases can harbor bad model behavior. If you only eval the happy path, you will miss the cases where your prompt falls apart on unusual inputs.
Evals catch regressions. One of the most valuable properties of a traditional test suite is that it prevents new changes from breaking existing behavior. Evals serve the same purpose for AI features. When you add a new capability to your prompt or agent, your existing evals verify that prior behavior still meets expectations.
Maintain your eval fixtures. In traditional testing, you maintain test fixtures as your codebase evolves. The same discipline applies to evals. If your application domain changes, your "golden datasets" (the reference inputs and expected outputs you eval against) need to be updated to reflect the new reality. Stale fixtures lead to stale evals that either cry wolf or miss real problems.
Eval-Driven Development. Test-Driven Development works because it forces you to define expected behavior before writing implementation code. The same approach adapts naturally to AI integrations. Before writing or refining a prompt, define exactly what outputs you expect for a set of representative inputs. Encode those expectations as eval assertions. Then iterate on the prompt until the evals pass. This inverts the common pattern of tweaking prompts until the output "looks right" in a playground and replaces it with something measurable and repeatable.
Wrapping Up
Evals are not a replacement for your existing test suite. They are the testing strategy for the non-deterministic parts of your system. The mental models you have built around test quality, coverage, regression detection, and test-driven development all carry over. The main adjustments are statistical (run evals multiple times, track trends) and economic (be deliberate about when and where you spend API tokens on evaluation).
If you are shipping LLM-powered features without evals, you are essentially deploying untested code. The tools and patterns exist. The hardest part is the same as it has always been with testing: writing good assertions and maintaining them as your system evolves.



