Scored against an answer key

How accurate is the extraction?

A tool that reads insurance documents is only worth trusting if its accuracy is a number you can watch. Every sample below has a known-correct answer key. We extract it with the model, compare every field, and report the score. When we improve the prompt, this number moves.

99.1%

231 of 233 fields correct

Across 7 synthetic ACORD-25 certificates, 233 individual fields were checked: limits, dates, insurers, and coverage flags. The model is told to return “not stated” rather than guess, so the errors that remain are honest disagreements instead of invented data.

By certificate

A clean certificate that meets every requirement.100.0%

32/32 fields

Coverage already lapsed. The dates have passed.100.0%

32/32 fields

Real coverage, but the dollar limits are too low.100.0%

32/32 fields

Good limits, but it doesn't list you as additional insured.96.9%

31/32 fields

No Workers' Compensation coverage on the certificate.100.0%

23/23 fields

Full tower: GL, auto, umbrella, workers' comp, and professional liability.98.0%

49/50 fields

Workers' Comp with mismatched employers' liability limits, a common extraction trap.100.0%

32/32 fields

By field

Sorted weakest first. This is the punch list for what to improve next.

Per-occurrence limit

95.5%21/22

Waiver of subrogation

95.5%21/22

Document type

100.0%7/7

Named insured

100.0%7/7

Certificate holder

100.0%7/7

Producer

100.0%7/7

Issue date

100.0%7/7

Coverage present

100.0%22/22

Insurer

100.0%22/22

Policy number

100.0%22/22

Effective date

100.0%22/22

Expiration date

100.0%22/22

Aggregate limit

100.0%22/22

Additional insured

100.0%22/22

What’s still wrong

Every remaining miss, in the open. The point of an eval is to show exactly where the tool falls down, so the next fix is obvious.

Field (sample)Answer keyModel said

Workers' Compensation · waiverOfSubrogation · missing-additional-insuredyesno

Professional Liability (E&O) · perOccurrence · full-tower$2,000,000not stated

Reading the misses

Both remaining misses are the model being careful, not reckless. On one Workers’ Comp line it read the waiver-of-subrogation box as unchecked when the answer key marks it granted. That is a conservative false negative: it flags a gap rather than inventing protection. On the professional liability line it returned “not stated” for the per-claim limit instead of assuming the “Each Claim” figure was the same thing as per-occurrence. Declining to guess an unfamiliar label is the behavior the whole tool is built on.

I could erase both by writing prompt rules aimed straight at these two fixtures. I have not, on purpose. Tuning the prompt to my own answer key until it reads 100% is the exact “confirmation over truth” failure this tool exists to catch. A real number with two honest misses is worth more than a perfect one I engineered.

Methodology and limits

A number is only as honest as the conditions that produced it. Here is what this score does and does not prove. None of this is buried, because a tool that hides its own limits has no business judging anyone else’s coverage.

Synthetic documents only

Every certificate here is a synthetic ACORD-25 I generated. No real customer documents and no real PII. They are clean digital text, not scans or faxes.

Small sample

Seven certificates, a few hundred fields. Read this as directional, not statistical. At this size there are no meaningful error bars.

I wrote the answer keys

The ground truth is my own reading of each form. Where I am wrong, the score rewards the model for agreeing with me. Independent answer keys would be stronger.

One call is a judgment, not a fact

Workers' Comp aggregate is recorded as null because that line carries no general aggregate. That is a defensible industry convention, not objective truth. A reasonable broker could read the disease policy limit differently.

Real-world accuracy is unknown

This has not been tested on messy inputs: scans, faxes, unusual carrier layouts, handwriting. Accuracy on those is unmeasured and probably lower than the number above.

Errors are not yet split by direction

For a go/no-go gate, a false 'covered' is far worse than a false 'not covered'. This score treats every miss the same. Separating the two is the next thing to build.

What would make it trustworthy

Answer keys written by someone other than me. Real certificates, including scans and the ugly carrier layouts. A bigger sample, so the number has error bars. And an error-direction breakdown that separates a false “covered” from a false “not covered,” because in an insurance gate those two mistakes are not equal. That is the roadmap, and it is honest work still left to do.