Workflow

Batch Remediating a Legacy Document Archive: Lessons from a 40,000-PDF Project

Aug 12, 2025 · 10 min read

By Brad Thompson, Head of Product

When a county government came to us with 40,000 PDFs spanning 25 years of public records and a 90-day compliance deadline, we had to build a pipeline capable of handling document diversity at serious scale. This wasn't a theoretical exercise—it was a real-world accessibility remediation project that exposed the gaps between manual workflows and what's actually required at enterprise scale.

The Real-World Context: County Government Archive

The client was a medium-sized county government with over two decades of accumulated public records, ranging from planning documents and permit records to health department notices and financial disclosures. These PDFs formed the backbone of public records access and directly affected citizens' ability to interact with government services.

The legal mandate was clear: all PDFs needed to be WCAG 2.1 AA compliant within 90 days. There was no option to host an HTML alternative or delay. The pressure was intense, but the opportunity was significant—we could demonstrate that large-scale remediation is possible when the right infrastructure is in place.

Discovery and Initial Assessment

Before touching a single PDF, we spent two weeks analyzing the archive. This discovery phase was critical and often underestimated in project planning.

Sampling and Classification

We randomly sampled 500 PDFs across the archive to understand the landscape. The results told a story: 62% were native PDFs created from Office documents, 38% were scanned images with no OCR. Of the scanned documents, 31% contained handwritten annotations or non-standard formatting that would require human review.

Beyond format, we classified by complexity: text-only documents (easiest), documents with tables (moderate), image-heavy documents with charts and diagrams (harder), and forms with complex layouts (hardest). We also tracked document age—older documents often had encoding issues or non-standard fonts that caused problems in modern accessibility tools.

Metadata and Accessibility Baseline

We discovered that nearly 100% of the archive lacked proper document language tags. More critically, 94% had no document title tag in the PDF metadata, and approximately 88% had no semantic structure whatsoever. Some documents had been scanned with OCR engines that captured text but didn't understand context—a scanned form might have all the words, but they weren't tagged as form fields.

Building a Classification-First Strategy

The key insight was that not all documents should be processed the same way. A simple compliance letter doesn't need the same level of attention as a multi-page budget report with complex tables and charts.

Document Type Routing

We created five processing tiers:

Tier 1 (Automated, fully): Simple text documents under 5 pages, no tables, no images beyond logos. Estimated 28% of archive.
Tier 2 (Automated + light review): Native PDFs with moderate complexity—tables, occasional charts, readable text layers. Estimated 35% of archive.
Tier 3 (Automated + structured QA): Scanned documents with good OCR, or native PDFs with complex layouts. Estimated 22% of archive.
Tier 4 (Hybrid human-machine): Scanned documents with unclear text, handwriting, or unusual layouts. Estimated 13% of archive.
Tier 5 (Manual): Documents requiring fundamental reconstruction—severely degraded scans, image-only documents, or items with legal significance requiring human verification. Estimated 2% of archive.

This tiered approach meant we could optimize cost and time allocation. Tier 1 documents could be processed in seconds per document. Tier 5 documents might require 30–45 minutes each, but they represented only 800 documents out of 40,000.

Handling OCR at Scale

OCR was one of the biggest operational challenges. For scanned documents, we needed accurate text recognition, but OCR itself introduced new problems.

OCR Strategy and Tool Selection

We tested three engines: Tesseract (open source, free, variable quality), ABBYY (excellent accuracy, cost per page), and Textract (AWS, scalable, acceptable quality). Given the volume, we used a hybrid approach: ABBYY for critical documents and Tier 4/5 items where accuracy directly affected accessibility, and Textract for Tier 2/3 where speed mattered more than perfection.

OCR Accuracy Challenges

OCR engines struggle with certain document types. Historical documents with degraded scans, documents with colored backgrounds or watermarks, and documents with non-standard fonts all caused accuracy drops of 10–25%. We built an automated confidence scoring system: if OCR returned confidence below 85%, the document was flagged for human review.

We also discovered that font substitution was critical. If OCR produced "1" when the original was "l" (one vs. lowercase L), it could break comprehension. Our post-OCR validation caught these via pattern matching and dictionary checking.

Building the Processing Pipeline

Architecture Overview

The pipeline had four stages:

Intake & Metadata: Ingest, extract metadata, assign tier, queue for processing.
Automated Remediation: Apply tier-appropriate transformations—add language tag, run OCR if needed, add document structure, generate alt text for images.
Quality Assurance: Automated and sampled manual checks, validation against WCAG criteria.
Delivery & Archiving: Export, store, track, and provide reports to stakeholders.

Detailed Processing Stages

Stage 1: Intake. Documents came from a shared file server. We built a scheduled crawler that ingested new PDFs, extracted metadata (title, creation date, page count), and assigned them to processing tiers based on rules and heuristics. This stage also validated that files were valid PDFs (surprisingly, some weren't—some were corrupted or had unusual encoding).

Stage 2: Automated Remediation. For native PDFs:

Add document language tag (detected from content analysis or metadata).
Extract text layer and analyze reading order.
Auto-tag headings using font size and styling heuristics.
Detect and tag lists.
Flag tables for review or apply auto-tagging if structure is obvious.
Generate basic alt text for images using computer vision (Azure Computer Vision API), then flag for human review.
Detect form fields and tag them.

For scanned documents:

Run OCR.
Create embedded text layer.
Apply heuristic-based structure tagging.
Attempt image-based alt text generation.
Flag for manual review based on confidence scores.

Stage 3: Quality Assurance. This was multi-layered:

Automated validation: Run PAC 2024 in batch mode on all documents, log failures.
Sampling QA: Randomly select 2–5% of each document type, manually verify in Adobe Acrobat Pro using accessibility checklist.
Tier-specific review: All Tier 4 and Tier 5 documents received manual review before delivery.
Regression testing: Occasionally test older batches to ensure consistency.

The Automation Ratio and Human Effort

Our final breakdown was granular:

73% required no human intervention: These documents passed automated processing and automated validation without issues. Mostly Tier 1 and simple Tier 2 documents.
19% required minor human correction: Alt text needed refinement, reading order had minor issues, or a heading tag was applied slightly incorrectly. Average time: 3–5 minutes per document.
8% required substantial review: OCR accuracy was too low, document structure was non-standard, or legal/safety content required human verification. Average time: 25–40 minutes per document.

This distribution is crucial for cost modeling. If you manually remediated everything, you'd invest 40,000 × 30 minutes = 20,000 hours. Our actual human effort was approximately 2,400 hours (mostly from the 8% category plus sampling for the 19% category). That's an 88% reduction in human labor.

Handling Edge Cases and Special Document Types

Forms

PDF forms with complex layouts were particularly tricky. AcroForm fields need to be tagged and semantically associated with their labels. Heuristics worked sometimes (label above or to the left of the field), but not always. We developed a two-pass approach: automated tagging followed by confidence scoring, with manual review for low-confidence forms.

Spreadsheets Saved as PDF

This was surprisingly common—finance and planning departments exported spreadsheets to PDF without considering structure. A spreadsheet with 50 columns and no headers becomes nearly unusable when converted to PDF without structure. We implemented a detector for grid-like structures and auto-tagged them as tables, but complex financial spreadsheets required human intervention to ensure semantic correctness.

Image-Only Documents

About 2% of the archive were scanned documents without OCR or documents that were image-only (perhaps a PDF wrapper around an image file). These required OCR, and if OCR confidence was low, they were flagged for potential re-scanning or manual transcription. Some items—particularly historical documents and photographs—required thoughtful alt text that conveyed context, not just describe content.

Multi-Language Documents

A handful of documents were in Spanish or other languages. Language detection was automated, but OCR accuracy degraded significantly for non-English text. These required manual review more often.

Timeline and Throughput Metrics

Processing Velocity

Week 1: Discovery and classification (no documents processed, but 500 samples analyzed).

Weeks 2–10: Production processing, averaging 4,600 documents per week after the first week of setup. This varied by tier:

Tier 1: 800–1,200 per week (high automation).
Tier 2: 1,200–1,600 per week.
Tier 3: 900–1,100 per week (more review time).
Tier 4: 200–300 per week (human-intensive).
Tier 5: 50–80 per week (mostly manual).

Final Result: 40,000 documents remediated in 67 days (9.5 weeks). This included discovery, processing, and QA.

Quality Metrics

94.3% of documents passed PAC 2024 validation on first delivery. The remaining 5.7% had minor issues (typically undescribed images or list markup corrections). Post-delivery, we made two rapid correction passes to achieve 99.8% compliance before final handoff.

Cost Analysis

Cost per Document

Total project cost (including labor, tool licenses, and infrastructure): $7,200.

Per-document cost: $7,200 / 40,000 = $0.18 per page.

This included all labor (we allocated 2,400 hours at an average rate of $65/hour = $156,000, but note: we're citing the actual contracted cost of $7,200 for the tooling and infrastructure; labor is separate and depends on team composition). For organizations doing this internally, the cost is essentially labor.

Comparison: Manual accessibility remediation by specialized firms typically costs $8–12 per page, or $320,000–480,000 for a 40,000-page archive. Our approach achieved 95%+ compliance at roughly 2.5% of manual cost.

Cost Breakdown

OCR licensing (ABBYY + Textract): ~$2,100.
Image analysis (Azure Computer Vision): ~$400.
PDF processing tools and APIs: ~$800.
Validation tools (PAC licenses): ~$1,200.
Infrastructure and hosting: ~$2,700.

Lessons Learned

What Worked Exceptionally Well

Classification first. Investing two weeks in understanding the archive before processing was the best decision. Routing documents to appropriate processing paths eliminated wasted effort on heavy automation for documents that needed human review.

Staged QA. We didn't wait until the end to validate quality. Continuous validation from week 2 onward let us catch systemic issues early. If we'd discovered a pattern error after processing 30,000 documents, we'd have wasted weeks.

Hybrid human-machine workflow. The 73/19/8 split works because machines are fast at tedious tasks but bad at judgment. Humans excel at edge cases. Combining both achieves both speed and quality.

Confidence scoring. Our automated scoring system for OCR accuracy and tagging confidence was invaluable. It let us route documents intelligently rather than applying uniform processing.

What Was Harder Than Expected

Historical degradation. Documents from 1990–2005 often had scanning issues, compression artifacts, or encoding problems that modern tools struggle with. We didn't fully account for time needed to preprocess these items.

Context and semantic meaning. Machines can detect that something is a heading or a list item based on formatting, but they can't always determine whether a heading hierarchy makes sense or whether a table's semantic structure reflects the author's intent. This is where human review remains essential.

Stakeholder expectations. The client initially expected "100% automated." When we explained that 8% would need human review, there was pushback. Educating them about why machines can't make semantic judgments for every document was important for trust and project success.

What We'd Do Differently

We'd invest more time upfront in building confidence scores for different document types. We built these reactively, after processing started; building them during discovery would have reduced rework.

We'd also have reserved more budget for re-OCR. Some documents failed OCR on the first pass and required re-scanning at higher resolution. Having a re-scan workflow ready earlier would have been more efficient.

Recommendations for Large-Scale Archive Projects

For Organizations Starting Similar Projects

Start with discovery, not processing. Sample at least 500 documents. Understand your archive before committing to a process.
Build a tiered workflow. Don't apply the same remediation to every document. Classify by type and complexity.
Invest in quality assurance infrastructure. Continuous validation beats post-hoc fixes.
Use OCR carefully. OCR is powerful but imperfect. Always validate accuracy, especially for documents with legal or safety implications.
Plan for edge cases. Budget time and resources for the documents that don't fit standard patterns. They'll represent 2–10% of your archive but 20–30% of your effort.
Measure and report. Track metrics: processing speed, quality pass rate, human review rate by tier. These help refine the process and communicate value to stakeholders.
Prepare for timeline challenges. Even with automation, large projects encounter unexpected delays. Budget extra time for handling discovery-phase surprises.

Technology Stack Recommendations

For OCR: Use a hybrid approach if volume is high. Open-source tools like Tesseract are free but require tuning. Commercial engines like ABBYY are fast and accurate but cost more. Cloud APIs (Textract, GCP Vision) offer scalability without upfront licensing costs.

For tagging and structure detection: PDF.js libraries (e.g., pdfjs-dist or iText) work well for content extraction. For automated tagging, combination of heuristics and machine learning (like spaCy for NLP) can detect headings, lists, and basic structure.

For validation: PAC 3 (for Windows) or VeraPDF (open source) can be run in batch mode. Both provide detailed violation reports that feed back into quality workflows.

For image alt text: Azure Computer Vision and AWS Rekognition both offer batch processing. Don't rely on generated descriptions alone—they require human review, but they provide a strong starting point.

Final Thought

A 40,000-document archive is not an edge case anymore. Government agencies, financial institutions, healthcare systems, and educational organizations all manage archives of this scale or larger. The question isn't whether to remediate them—legal and ethical obligations make this clear. The question is how to do it cost-effectively and at quality. A well-designed, phased approach combining automation and human expertise can achieve 94%+ compliance at a fraction of manual costs and in a fraction of the time.

Ready to make your PDFs accessible?

Upload any PDF and get a fully compliant, audit-ready document back in seconds.

Try free PDF audit

← Back to all posts