Evaluations
Building evals you'd actually show a board
"Accuracy" is the wrong metric for an executive agent. We walk through the evaluation harness we use for production CFO and procurement agents, the gold-set methodology behind it, and the kinds of failure modes generic benchmarks completely miss.
Conformal Engineering · 2 Mar 2026 · 14 min read
A board does not care that your agent scored 87 percent on a generic benchmark. It cares whether the system can answer the ten questions that change a meeting. It cares whether a wrong answer is detectable. It cares whether the product knows when to stop. That makes evaluation less like an exam and more like a control system.
The word accuracy hides too much. An executive agent can retrieve the correct number and still be wrong because it used the wrong fiscal period. It can write valid SQL and still be wrong because the business definition changed after a reorg. It can summarize correctly and still be dangerous because it omits the caveat that should have changed the decision.
Gold sets need owners
The first mistake is letting engineers invent the test set alone. Engineers are good at edge cases in code. Business owners are good at edge cases in meaning. A useful gold set is built with the person who owns the decision. For a finance agent, that means actual variance questions from recent reviews, including the awkward ones. For procurement, it means the supplier and commodity questions that expose whether the system understands category logic.
Each case needs an expected answer, the reasoning path a good analyst would take, acceptable variance, required caveats, and known traps. We also record the source systems and the date of extraction. Without that, teams argue about whether the agent failed or whether the ground truth moved. Evaluation data is production data with a chain of custody.
Grade the trace, not just the answer
The trace reveals failures the final answer conceals. Did the agent choose the right tables? Did it filter the right period? Did it join at the right grain? Did it call the retrieval tool before summarizing a policy? Did it notice that a region changed names? A final-number comparison will miss many of these errors until the one time they matter.
We grade answers in layers: intent, source selection, tool use, query validity, business definition, numerical result, narrative quality, and refusal behavior. This looks heavier than a single score, but it makes improvement faster. If source selection is weak, prompt tuning the final summary is wasted effort. If refusal behavior is weak, higher answer accuracy can make the system more dangerous.
Use adversarial normal questions
The best eval cases are not trick prompts. They are normal executive questions with hidden ambiguity. "Why is EBITDA down in the north region?" might require excluding one-time freight costs, mapping two legacy region names, and comparing against budget rather than last year. A generic model benchmark will not contain that failure mode. Your company will.
We include stale-data cases, permission-boundary cases, missing-source cases, and cases where the right response is a clarifying question. We also include repeated questions phrased differently, because production users do not preserve prompt templates. The agent has to be robust to human language, not just to the one sentence used in a demo.
A board-ready eval is explainable
The final artifact should be legible to a non-technical governance audience. It should show the case set, pass criteria, failure examples, severity bands, unresolved risks, and the human fallback. It should include traces for representative passes and failures. It should say where the system is allowed to operate and where it is not.
This creates a healthier conversation. Instead of asking whether AI is accurate, leaders can ask whether the product is accurate enough for a defined decision under defined controls. That is the standard real software has always had to meet. Agents do not deserve a looser one because they speak in complete sentences.
A good eval report also creates a maintenance contract. It says which cases must be rerun after schema changes, prompt changes, model upgrades, permission changes, and new data sources. Without that contract, quality silently decays. The board should not be shown a one-time score; it should be shown the machinery that keeps the score meaningful. In production, evaluation is not a launch artifact. It is the product's immune system.
This is why evals belong in the operating rhythm, not in a research appendix. Every serious release should carry its own evidence packet. If the packet is thin, the release is not serious yet.
Boards understand that language because it looks like control, not optimism or theater.