AI Evals Are Not So Different From the Tests You Already Write

Engineering Insights

Brian Matuszak

Min Read

Published On

May 1, 2026

Updated On

May 1, 2026

AI Evals Are Not So Different From the Tests You Already Write

If you have worked with LLMs in production, you have probably heard the term "evals" and wondered how they relate to the test suites you already maintain. The short answer: evals are a natural extension of automated testing into a world where outputs are non-deterministic. The longer answer is more nuanced, and understanding both the similarities and the differences will help you ship AI-powered features with confidence.

Why You Need Evals (Alongside Your Existing Tests)

Traditional tests verify deterministic behavior. Given input X, you expect output Y, every single time. Evals exist because LLM outputs do not work that way. A prompt that produces a great answer today might produce a subtly different one tomorrow, and you need a systematic way to measure whether "different" means "better," "worse," or "about the same."

This does not mean evals replace your existing test suite. They complement it. Your unit tests still cover your deterministic application logic. Evals cover the parts of your system where the output is probabilistic.

There are two broad scenarios where evals earn their keep.

Projects with LLM integrations. Any time you are calling a model API, you need a way to compare prompt variations, evaluate whether model A outperforms model B for your use case, and catch regressions when you update your prompts. Tools like Promptfoo formalize this process by letting you define test cases, providers, and assertions in a single config:

‍# promptfooconfig.yaml prompts: - prompts/summarize-v1.txt - prompts/summarize-v2.txt

providers: - openai:gpt-4o - anthropic:messages:claude-sonnet-4-6

tests: - vars: article: file://fixtures/article-1.txt assert: - type: llm-rubric value: "Summary captures the three main arguments without hallucinating details" - type: javascript value: output.length < 500 - vars: article: file://fixtures/article-2.txt assert: - type: llm-rubric value: "Summary is written in a neutral tone"‍

Developing custom agent skills. If you are building agent workflows or custom skills, evals let you compare a purpose-built skill against a bare prompt to a default model. You can also compare variations of the same skill to decide whether to optimize for speed versus depth, or to enforce a specific output format.

Key Differences From Traditional Testing

Non-deterministic outputs require statistical thinking. This is the fundamental difference. A single test run is not enough. You need multiple runs of the same eval to compute an average score, because the same prompt can produce varying quality on successive calls. Where a traditional test either passes or fails, an eval might pass 85% of the time, and that number is the metric you actually track.

Assertions are more flexible. Traditional tests are binary: pass or fail. Evals open up a spectrum. There are three main types of assertions you can use:

Programmatic checks validate exact outputs or structural formats. These are closest to traditional assertions. Does the response contain valid JSON? Is it under the token limit?
LLM-as-judge uses one model to assess the output of another. You provide a rubric, and the judge model scores the response. This is powerful for evaluating subjective qualities like tone, completeness, or factual accuracy.
Human judgment pauses the eval and waits for a person to score the output. This is slower but sometimes the only reliable option for nuanced quality assessments.

You can also use rubrics to weight different criteria. One prompt version might produce output that is well-structured but misses a key detail; another might nail the content but be poorly formatted. Rubrics let you express that content accuracy matters more than formatting, or vice versa.

Model choice affects results in ways hardware choice does not. With traditional tests, you can run your suite on weaker hardware and the only penalty is speed. The tests still pass or fail the same way. Evals do not have this property. Running an eval on a cheaper model, or with a smaller context window or token budget, can completely change whether assertions pass. Even different versions of the same model can shift results. This makes it important to pin your model version in eval configs and to run evals explicitly when upgrading to a newer version.

Evals probably should not live in your CI pipeline. Traditional tests are cheap, fast, and deterministic enough to gate every commit. Evals are expensive (you are making real API calls) and inherently flakier due to non-determinism. Running them on every push is wasteful and will produce noisy results.

A more practical approach is to trigger evals selectively. For example, you could configure your CI to only run evals when files in a prompts/ directory change:

# .github/workflows/evals.yml name: Run Evals on: push: paths: - 'prompts/**' - 'evals/**'

jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm install -g promptfoo - run: promptfoo eval --config evals/promptfooconfig.yaml

This keeps eval costs proportional to actual prompt changes rather than burning tokens on every unrelated commit.

Success metrics are tracked over time. Unlike a test suite where you care mainly about the current green/red status, eval results are most useful as a trend. Tracking scores across runs lets you catch gradual regressions and measure the impact of prompt improvements quantitatively. Think of it less like "did the build pass" and more like "is our prompt quality trending upward."

Lessons From Traditional Testing That Still Apply

Despite the differences, the decades of wisdom we have accumulated around automated testing transfer remarkably well to evals.

Eval quality determines result quality. Vague assertions produce vague signals. If your eval just checks "did the model return something," it is not going to catch meaningful regressions. The more specific and thorough your expectations, the more useful your eval results become. This is the same principle behind writing good unit tests: a test that asserts result != null is technically a test, but it is not a very useful one.

Coverage gaps lead to behavior gaps. Just as untested code paths can harbor bugs, unprompted edge cases can harbor bad model behavior. If you only eval the happy path, you will miss the cases where your prompt falls apart on unusual inputs.

Evals catch regressions. One of the most valuable properties of a traditional test suite is that it prevents new changes from breaking existing behavior. Evals serve the same purpose for AI features. When you add a new capability to your prompt or agent, your existing evals verify that prior behavior still meets expectations.

Maintain your eval fixtures. In traditional testing, you maintain test fixtures as your codebase evolves. The same discipline applies to evals. If your application domain changes, your "golden datasets" (the reference inputs and expected outputs you eval against) need to be updated to reflect the new reality. Stale fixtures lead to stale evals that either cry wolf or miss real problems.

Eval-Driven Development. Test-Driven Development works because it forces you to define expected behavior before writing implementation code. The same approach adapts naturally to AI integrations. Before writing or refining a prompt, define exactly what outputs you expect for a set of representative inputs. Encode those expectations as eval assertions. Then iterate on the prompt until the evals pass. This inverts the common pattern of tweaking prompts until the output "looks right" in a playground and replaces it with something measurable and repeatable.

Wrapping Up

Evals are not a replacement for your existing test suite. They are the testing strategy for the non-deterministic parts of your system. The mental models you have built around test quality, coverage, regression detection, and test-driven development all carry over. The main adjustments are statistical (run evals multiple times, track trends) and economic (be deliberate about when and where you spend API tokens on evaluation).

If you are shipping LLM-powered features without evals, you are essentially deploying untested code. The tools and patterns exist. The hardest part is the same as it has always been with testing: writing good assertions and maintaining them as your system evolves.

‍

Written by

Brian Matuszak

Software Engineer

, The Gnar Company

Brian is a Full Stack Software Engineer at The Gnar Company, Inc., based in Cambridge, Massachusetts. A graduate of the University of Connecticut, he brings a robust technical background to his work in software development.

Stay In The Loop

Never miss an update – get the latest blogs, webinars, and expert insights straight to your inbox from The Gnar.

Related Insights

Engineering Insights

AI Evals Are Not So Different From the Tests You Already Write

Moving from deterministic code to LLMs doesn't mean abandoning your testing rig—it means evolving it. Discover why "evals" are essentially the automated tests of the probabilistic world and how to apply the testing wisdom you already have to ship AI features with total confidence.

Product Insights

Thoughtbot Alternatives: Choosing the Right Software Development Partner in 2026

Thoughtbot alternatives for software development in 2026. The Gnar offers guaranteed milestone pricing, a 12-month bug-free warranty, and a 100% US-based team.

Product Insights

How Much Does Custom Software Development Cost? (The Real Answer)

Most software projects go over budget because they're priced before the problem is understood. Learn how structured discovery gives you a guaranteed build price before development starts.

The Gnar is a fire-breathing, Boston-based software consultancy made of  problem-solvers.

Whether you’re bringing new products to market or adding new features, we can help across all stages of product development.

Bring product to life with a dedicated product development team, bug-free code, and development practices built for speed and scalability.

Services Gnar AI Spark Gnar Ideate Gnar Ignite Gnar Forge Gnar Embed Gnar Integrate

Technologies Web Development Mobile Development UI/UX Design

Company

Our Work About Us Contact Schedule Meeting

Resources

Blog AI Readiness

The Gnar is a U.S.-based software development partner that delivers scalable, production-ready code to help businesses overcome technical challenges and grow with confidence.

Privacy & Cookie Policy
Sitemap

AI Evals Are Not So Different From the Tests You Already Write

Engineering Insights

AI Evals Are Not So Different From the Tests You Already Write

Why You Need Evals (Alongside Your Existing Tests)

Key Differences From Traditional Testing

Lessons From Traditional Testing That Still Apply

Wrapping Up

Stay In The Loop

Related Insights

AI Evals Are Not So Different From the Tests You Already Write

Thoughtbot Alternatives: Choosing the Right Software Development Partner in 2026

How Much Does Custom Software Development Cost? (The Real Answer)

The Gnar is a fire-breathing, Boston-based software consultancy made of problem-solvers.

The Gnar is a fire-breathing, Boston-based software consultancy made of  problem-solvers.