Scored against an answer key
How accurate is the extraction?
A tool that reads insurance documents is only worth trusting if its accuracy is a number you can watch. Every sample below has a known-correct answer key. We extract it with the model, compare every field, and report the score. When we improve the prompt, this number moves.
Across 7 synthetic ACORD-25 certificates, 233 individual fields were checked: limits, dates, insurers, and coverage flags. The model is told to return “not stated” rather than guess, so the errors that remain are honest disagreements instead of invented data.
By certificate
By field
Sorted weakest first. This is the punch list for what to improve next.
What’s still wrong
Every remaining miss, in the open. The point of an eval is to show exactly where the tool falls down, so the next fix is obvious.
Reading the misses
Both remaining misses are the model being careful, not reckless. On one Workers’ Comp line it read the waiver-of-subrogation box as unchecked when the answer key marks it granted. That is a conservative false negative: it flags a gap rather than inventing protection. On the professional liability line it returned “not stated” for the per-claim limit instead of assuming the “Each Claim” figure was the same thing as per-occurrence. Declining to guess an unfamiliar label is the behavior the whole tool is built on.
I could erase both by writing prompt rules aimed straight at these two fixtures. I have not, on purpose. Tuning the prompt to my own answer key until it reads 100% is the exact “confirmation over truth” failure this tool exists to catch. A real number with two honest misses is worth more than a perfect one I engineered.
Methodology and limits
A number is only as honest as the conditions that produced it. Here is what this score does and does not prove. None of this is buried, because a tool that hides its own limits has no business judging anyone else’s coverage.
Synthetic documents only
Every certificate here is a synthetic ACORD-25 I generated. No real customer documents and no real PII. They are clean digital text, not scans or faxes.
Small sample
Seven certificates, a few hundred fields. Read this as directional, not statistical. At this size there are no meaningful error bars.
I wrote the answer keys
The ground truth is my own reading of each form. Where I am wrong, the score rewards the model for agreeing with me. Independent answer keys would be stronger.
One call is a judgment, not a fact
Workers' Comp aggregate is recorded as null because that line carries no general aggregate. That is a defensible industry convention, not objective truth. A reasonable broker could read the disease policy limit differently.
Real-world accuracy is unknown
This has not been tested on messy inputs: scans, faxes, unusual carrier layouts, handwriting. Accuracy on those is unmeasured and probably lower than the number above.
Errors are not yet split by direction
For a go/no-go gate, a false 'covered' is far worse than a false 'not covered'. This score treats every miss the same. Separating the two is the next thing to build.
What would make it trustworthy
Answer keys written by someone other than me. Real certificates, including scans and the ugly carrier layouts. A bigger sample, so the number has error bars. And an error-direction breakdown that separates a false “covered” from a false “not covered,” because in an insurance gate those two mistakes are not equal. That is the roadmap, and it is honest work still left to do.