Last fall I decided to test something I’d been hearing about at every legal ops conference for two years: AI-powered metadata extraction. The pitch is simple. You upload your contracts, the AI reads them, and it spits out structured data: party names, effective dates, expiration dates, renewal terms, governing law, payment terms, the whole catalog. Instead of a human opening each PDF and typing values into a spreadsheet, the machine does it in seconds.

I’d been maintaining my contract metadata mostly by hand. Not proudly. I had a spreadsheet with about 200 active contracts and maybe 15 fields per contract, and I’d been filling them in as I onboarded each agreement. It was working, but it was slow, and I knew I had gaps. Contracts I’d uploaded to ContractSafe but never fully tagged. Fields I’d left blank because I was in a hurry. The usual stuff.

So I ran the experiment. I pointed the AI at 200 contracts and compared what it extracted to what I knew was actually in those agreements.

Here’s what happened.

The easy stuff was almost perfect

Party names, effective dates, basic dollar amounts: the AI nailed these. Not 100%, but close enough that I’d call it reliable. I spot-checked about 40 contracts against my manual entries and the AI matched on party names and effective dates in all but two cases (both were scanned documents with poor image quality, which I’ll get to).

This tracks with what researchers are finding. A paper from Box’s AI team tested LLM-based extraction on contract datasets and found that models achieve above 90% accuracy on straightforward fields like party names and signatures. For basic, clearly stated metadata, the technology works.

If all you need is to get party names and execution dates out of a pile of PDFs, AI extraction will save you a real amount of time. I’d been spending maybe two minutes per contract on just those fields. Across 200 contracts, that’s almost seven hours of work the AI did in about ten minutes.

The hard stuff was where it fell apart

The problem is that the fields I actually needed help with (the ones I’d been leaving blank because they required careful reading) were exactly the fields the AI struggled with.

Renewal terms were messy. About a third of my contracts have auto-renewal clauses, and the language varies wildly. Some say “this agreement shall automatically renew for successive one-year periods.” Some say “unless either party provides written notice of non-renewal at least 60 days prior to expiration.” Some bury the renewal language in a general “term and termination” section where it shares a paragraph with three other concepts. The AI got the basic renewal/no-renewal distinction right most of the time. But it missed notice periods in about a quarter of the contracts I checked, and it occasionally confused a termination-for-convenience clause with a renewal opt-out.

Governing law was surprisingly unreliable. You’d think “this Agreement shall be governed by the laws of the State of Delaware” would be easy to extract. It usually was. But in contracts where the governing law was stated in a choice-of-law clause buried in a dense miscellaneous section, or where it was referenced indirectly (“the laws of the jurisdiction set forth in Exhibit A”), the AI either returned nothing or returned the wrong value.

Payment terms were the worst. “Net 30” is easy. But “payment due within 30 days of receipt of invoice, subject to the discount schedule in Appendix B, with late payments accruing interest at 1.5% per month” is not easy. The AI would extract “Net 30” and ignore the rest. Which is exactly the kind of incomplete extraction that creates problems downstream when someone relies on the metadata without reading the actual contract.

The accuracy question nobody wants to answer honestly

Here’s what frustrates me about how the industry talks about AI extraction accuracy. Vendors love to say “95% accuracy” or “99% accuracy” without specifying what they’re measuring. Ninety-five percent of what? Fields? Contracts? Characters?

Concord’s field guide on AI in CLM lays this out more honestly than most vendors do. They recommend targeting 85% or better accuracy for OCR and metadata extraction on mixed document types, and they specifically note that accuracy varies by document type and quality. They also point out that most organizations see 20 to 40 percent efficiency improvements from AI-driven contract automation, not the 90% reductions you hear in sales demos.

That matches my experience exactly. The AI didn’t eliminate my metadata work. It cut it roughly in half. I still had to review every extraction, correct errors on complex fields, and manually fill in the stuff it couldn’t parse. But cutting the work in half is still significant when you’re looking at 200 contracts.

Scanned documents are a different animal

About 30 of my 200 contracts were scanned PDFs. Old agreements, some from before we had a digital-first process. A few were scans of faxes, which tells you how old they were.

The AI handled clean, high-resolution scans reasonably well. But the older, lower-quality scans were a problem. Party names got garbled. Dates were misread (a “2019” became a “2014” in one case because the scan was slightly blurry). Dollar amounts were occasionally wrong by orders of magnitude because the OCR misread a digit.

This isn’t an AI problem specifically. It’s an OCR problem that compounds when you layer extraction on top of it. If the underlying text conversion is wrong, the extraction will be wrong too, and it won’t know it’s wrong. The same Box research paper found that even advanced systems struggle with complex layouts, poor scan quality, and handwritten annotations. The technology is only as good as the input.

My takeaway: if you’re going to run AI extraction, clean up your source documents first. Re-scan anything that’s blurry. Convert paper documents to high-quality digital copies before you feed them to the machine. The time you spend on document prep will come back as better extraction quality.

What I actually use it for now

After the experiment, I settled into a workflow that I think is realistic about what AI extraction can and can’t do.

For new contracts, I use ContractSafe’s AI extraction to pull the easy fields automatically: parties, effective date, expiration date, contract type. That saves me the most tedious part of onboarding a new agreement. Then I manually review and fill in the fields that require judgment: renewal terms with notice periods, payment structures, obligation details, anything that lives in an exhibit or appendix.

For the existing backlog, I ran extraction in batches and used it as a starting point, not a finished product. The AI gave me a first pass that was maybe 70% complete and 85% accurate on the fields it did fill in. I spent a few hours over a couple of weeks cleaning up the rest. That’s still faster than doing it all from scratch.

I don’t trust it for anything where being wrong has consequences. If I need to know the exact notice period for a renewal that’s coming up in 60 days, I’m reading the contract. I’m not relying on what the AI extracted six months ago.

The honest version of what AI extraction is

Here’s how I’d describe it to someone considering it: AI extraction is a very fast, somewhat unreliable research assistant. It’s great at the boring, repetitive stuff that takes you two minutes per contract but adds up to hours across a portfolio. It’s not great at the nuanced stuff where the answer depends on reading three paragraphs in context and understanding how they interact.

MIT’s State of AI in Business report found that 95% of enterprise AI pilots fail to deliver ROI. Legal is actually one of the few sectors where AI is showing measurable efficiency gains, probably because the work is text-based and the savings are easy to quantify. But “measurable gains” is different from “magic.” The gains are real. They’re just more modest than the demos suggest.

The vendors who are honest about this (and some are) will tell you the same thing. Target 85% accuracy on mixed documents. Expect to do human review. Plan for the AI to handle volume and for humans to handle complexity.

That’s been my experience. It’s useful. It’s not transformative. And if you go in expecting it to replace your metadata process instead of accelerate it, you’re going to be disappointed.


Leave a Reply

Your email address will not be published. Required fields are marked *