The Promise of AI Alt Text Generation
Vision AI has fundamentally transformed how we approach alt text for PDFs. Models like Claude's vision capabilities, GPT-4V, and other large multimodal models can look at an image and produce descriptive alt text that captures both what's depicted and its functional purpose. For many PDF use cases, this works remarkably well.
In internal testing across thousands of PDF images, modern vision models produce alt text meeting WCAG 2.1 AA requirements for approximately 87% of images without requiring human review. For photos, logos, simple diagrams, and common business graphics, the quality is often indistinguishable from what an expert accessibility consultant would write.
But that 13% that requires human review is important. Understanding where AI succeeds and where it fails is critical for building a reliable remediation workflow.
How Modern Vision Models Describe Images
What Vision Models Actually Do
A vision model breaks an image into patches, encodes each patch into numerical representations, and processes these encodings through attention mechanisms that learn relationships between different regions of the image. The model has been trained on billions of image-text pairs, so it learns statistical correlations between visual features and natural language descriptions.
When asked to describe an image, the model doesn't "see" like humans do. It doesn't consciously understand that "this is a dog" because it somehow perceived dog-ness. Rather, it recognizes patterns in pixel values that correlate with the text token "dog" in its training data, and produces that token.
Why This Works Well for Many Images
For straightforward images—a photograph of two people shaking hands, a simple bar chart with clear labels, a company logo—this statistical approach works. The model has seen millions of similar images in training, so the patterns are well-established.
The Training Data Problem
Vision models are only as good as their training data. If a model was trained heavily on internet images (mostly photos, screenshots, common graphics), it will be excellent at those and mediocre at specialized visualizations (medical diagrams, specialized technical charts, scanned handwritten documents).
Where AI Excels: Image Categories That Require Minimal Review
Photographs and Real-World Objects
Modern vision models are excellent at describing photographs. A photo of an office meeting, a landscape, a product, or people in an action—the model will produce accurate alt text. Example outputs:
- Photograph input: "Three colleagues in an office setting reviewing documents on a desk with a laptop"
- Actual context: Could be from a case study, training material, or compliance document
- AI output: "Three people in business attire review documents together at a desk with a laptop nearby." — Accurate and useful.
Simple Logos and Brand Graphics
A company logo, product badge, or simple icon—the model identifies these reliably. "RemeDocs logo" or "Microsoft Office icon" is straightforward for vision models.
Standard Charts and Graphs
Simple bar charts, pie charts, and line graphs with clear labels are handled well. The model can often extract the key data relationships. Example:
- Input: A bar chart showing quarterly revenue with four bars labeled Q1–Q4, with values 10M, 12M, 15M, 18M
- AI output: "Bar chart showing quarterly revenue growth from Q1 ($10M) through Q4 ($18M), with steady increases each quarter." — Accurate and conveys the key insight.
Simple Diagrams and Flowcharts
A basic organizational chart, simple process diagram, or flowchart with clear boxes and labels—vision models handle these well. The model identifies the boxes, their labels, and connection directions.
Maps and Geographic Visualizations
A map showing store locations, regional data, or geographic boundaries—the model can describe what's shown. "Map of the United States with regional sales data color-coded by state" is achievable.
Tables and Data Layouts
While tables should ideally be semantic HTML tables in PDFs, when they appear as images, vision models can extract table structure. "A 5-column table with headers: Product, Q1, Q2, Q3, Q4, containing regional revenue data" is producible.
Where AI Struggles: Complex Images Requiring Human Review
Multi-Series Charts with Color-Only Differentiation
This is one of the most common failures. A line chart with five lines in different colors, showing revenue trends for five product lines—the model struggles because:
- The model must identify each line, which requires distinguishing by color
- If the colors are similar or if rendering has degraded quality, differentiation fails
- Without labels on the lines themselves, the model must map colors to a legend, and that mapping can be fragile
A vision model might produce: "Line chart showing trends for five different metrics." But it often can't reliably say which line is which product or what the specific values are.
Complex Infographics with Dense Information
An infographic packing 20+ data points, multiple visual elements, icons, and nested information—the model struggles to prioritize what matters. It might describe shallow visual features ("an infographic with many colors and icons") without capturing the key insights the infographic was designed to convey.
Technical and Architectural Diagrams
A complex systems architecture diagram, network topology, or technical schematic with dozens of labeled components and relationships—vision models often fail. Why:
- The model must read text labels precisely
- It must understand the meaning of connection types (solid vs. dashed lines, arrows, etc.)
- It must understand domain context (what does "load balancer" mean in a system architecture?)
A model might describe the visual layout ("boxes connected by lines") without understanding the functional relationships or the domain-specific meaning.
Scanned Documents and Handwritten Text
A scanned image of a handwritten form, hand-drawn diagram, or faded document—vision models struggle with legibility and context. OCR-style text extraction is unreliable, and understanding intent from hand-drawn diagrams is nearly impossible.
Highly Specialized Domain Images
Medical imaging (X-rays, MRI scans), scientific microscopy images, or highly specialized technical diagrams (circuit schematics, geological cross-sections)—vision models trained on general internet data lack the domain-specific knowledge to interpret these accurately. A model trained on general data won't understand the clinical significance of an MRI finding.
Images Dependent on Surrounding Context
An image whose meaning is context-dependent—a before/after comparison, a diagram illustrating a specific process, or a chart meant to support a particular argument in the text—the model generates alt text in isolation, missing the intended context. Example:
- In a document discussing "the top five reasons users abandon forms," an image shows a form with 20 fields
- AI alt text (in isolation): "A web form with multiple fields for user information"
- Proper alt text (contextual): "Example of a form with excessive fields, demonstrating the complexity that leads to abandonment"
Decorative Images Masquerading as Content
Vision models tend to over-describe. A purely decorative graphic (a divider line, a background image, a design flourish) gets verbose alt text. A truly accessible PDF would mark this as decorative with an empty alt attribute, but the model doesn't know.
What Good Alt Text Looks Like Per WCAG 2.1 AA
Functional Purpose, Not Just Description
WCAG 2.1 AA requires alt text that conveys "functional purpose." This doesn't mean writing a novel about the image. It means answering: "Why is this image here? What information or function does it serve in this document?"
Length Guidelines
- Simple images: 10–15 words. "A photo of a team meeting in the company cafeteria."
- Standard charts/diagrams: 20–50 words. "Bar chart comparing customer satisfaction scores: Product A (85%), Product B (72%), Product C (91%)."
- Complex visualizations: 50–150 words or a link to a long description. For a complex multi-series chart, the short alt text might be "Revenue trends by product line, 2023–2025 (see detailed breakdown below)" with a text table in the document.
Accessibility vs. Redundancy
If the image is purely for visual enhancement and the information is already in surrounding text, alt text should be minimal or empty. If the image conveys information not in text, alt text must be comprehensive.
Building a Human-in-the-Loop Workflow
The 87% Rule: When to Trust AI
Based on testing, vision models produce acceptable alt text for approximately 87% of PDF images without human review. The remaining 13% are typically:
- Complex multi-series charts (5%)
- Specialized technical/medical diagrams (3%)
- Scanned or low-quality images (3%)
- Context-dependent images where surrounding text is critical (2%)
Workflow Design: Categories of Review
Category A: Auto-accept (no review)
- Photographs and real-world objects
- Simple logos and icons
- Basic bar/pie/line charts with 3 or fewer series
- Simple process diagrams and org charts
Category B: AI-generated + spot-check (10% sample review)
- Standard business charts with 4+ series
- Maps and geographic visualizations
- Standard infographics
- Tables rendered as images
Category C: Human review required (100% review)
- Complex technical diagrams
- Medical or specialized domain images
- Scanned documents or handwritten content
- Context-critical images where meaning depends on surrounding text
Cost-Benefit Analysis
Using this categorized approach, a typical document set breaks down as:
- 65% Category A: Auto-accept at ~$0.01–$0.02 per image
- 25% Category B: AI + spot-check at ~$0.05–$0.10 per image (including review time)
- 10% Category C: Full human review at ~$0.50–$2.00 per image (depending on complexity)
Average cost per image across all three categories: ~$0.15–$0.25 with AI assistance, vs. $0.50–$2.00 for full manual alt text.
How RemeDocs Approaches Alt Text Generation
Multi-Model Approach
Rather than relying on a single vision model, RemeDocs uses multiple vision models and compares their outputs. If models agree, confidence is high. If they disagree, the image is flagged for human review.
Contextual Awareness
RemeDocs analyzes surrounding text to understand context. If an image appears in a section titled "Budget Trends 2023–2025," the model understands the image should be contextualized within that frame.
Image Category Detection
RemeDocs pre-classifies images (photograph, chart, diagram, table, etc.) and applies category-specific alt text generation rules. A chart gets parsed for axis labels and data values; a photo gets described for its subjects and composition.
Confidence Scoring
Each alt text output gets a confidence score. High-confidence outputs (photographs, simple charts) bypass human review. Medium-confidence outputs get a quick review. Low-confidence outputs (complex diagrams, scanned content) get full expert review.
Best Practices for Reviewing AI-Generated Alt Text
Rule 1: Does It Convey Functional Purpose?
Read the alt text without looking at the image. Can you understand what the image is conveying to someone using a screen reader? If not, revise.
Rule 2: Is It Appropriately Detailed?
Simple images shouldn't have paragraph-length alt text. Complex diagrams shouldn't be reduced to one sentence. Match detail to complexity.
Rule 3: Don't Repeat Surrounding Text
If the paragraph immediately before the image says "Budget trends show Q4 growth of 15%," the image's alt text doesn't need to repeat this. It should add information the image uniquely conveys.
Rule 4: Be Specific About Data
For charts: include axis labels, key values, and trends. "Chart showing revenue growth" is too vague. "Revenue chart: Q1 $2M, Q2 $2.8M, Q3 $3.5M, showing 75% growth over three quarters" is specific.
Rule 5: Mark Decorative Images Appropriately
If an image is purely decorative (background texture, divider, icon with no semantic meaning), set alt text to empty string (not a long description). Vision models often fail at this—they describe the visual content when an empty attribute is correct.
The Reality: AI is a Multiplier, Not a Replacement
AI alt text generation is genuinely useful and can handle the majority of PDF images without human review. But it's not magic. It's a tool that handles routine cases efficiently and flags complex cases for expert attention. The most effective approach combines:
- AI generation for speed and baseline quality
- Automated quality scoring to identify low-confidence cases
- Human review for complex or domain-specific images
- SME validation for specialized documents
This hybrid workflow scales remediation to thousands of PDFs while maintaining quality standards that meet WCAG 2.1 AA. It's faster than full manual alt text and more reliable than AI-only automation.