Our number, not a vendor number

Per-field extraction accuracy, published.

Marketing pages love to quote “95-99% accuracy.” That's a vendor benchmark, yours might be different. These are our own per-field results, scored on a public, reproducible harness, including the tier where the model slips. The exact harness and the test set are in the public repo.

Headline

99.31%

Field-level accuracy

99.2%

Full PO correctness

1.00

Line-item F1

9.6s

Avg. response

120 POs · 8,856 field decisions · OrderPier extraction pipeline · generated 20260609T002902Z

Accuracy by document tier

A single accuracy number hides where a model breaks. We segment the set by how the PO arrives, clean digital files, normal scans/faxes, and a deliberate stress tier of low-DPI faxes of tiny-font POs, the tier where vision-OCR finally slips, which we publish too.

Tier	POs	Field-level accuracy
Digital POs (native PDF text)	80	99.99%
Scanned / faxed POs (normal font)	20	100.00%
Rough fax of a small-font PO (stress)	20	94.11%

Per-field accuracy

po_number	99.2%
customer_name	100.0%
order_date	100.0%
due_date	100.0%
ship_to	100.0%

Line-item sub-field accuracy

sku	97.8%
description	99.9%
quantity	99.5%
unit	99.9%
unit_price	98.8%
line_total	97.7%

Methodology

The harness lives in src/moa/eval.py. Scoring rules:

IDs (PO #, dates): exact match after normalization (case + whitespace. Dates parsed to ISO).
Customer name, ship-to: rapidfuzz ratio ≥ 0.85 / 0.85.
Line items: SKU-exact pass first, then Hungarian assignment over description fuzz with floor 0.70.
Per-sub-field: SKU exact, quantity/price numeric round-to-2, description fuzzy ≥ 0.80, unit exact.

Each PO renders in one of 8+ layout variants (clean table, nested header, dense, multi-currency, sparse free-text body, handwritten annotations, multi-page long). The scan and rough tiers go further: we rasterize the PO to an image and degrade it, skew, blur, sensor noise, low DPI, JPEG compression, so the model must read pixels, the same vision-OCR task it faces on real inbound mail. The rough tier is a deliberately bad fax of a small-font PO. Degradation is deterministic per seed, so the whole set is reproducible: clone the repo, run python -m moa.cli gen-samples --count 120 then eval.

Repo · Security · Sub-processors

Run the harness on your own POs.

Email us a few anonymized samples and we will run them through the same harness, then send you the per-field numbers on your own documents.