AI & Accessibility

How AI Generates Alt Text for Complex PDF Images — And Where It Still Struggles

Feb 3, 2026 · 8 min read

By Kate Mitchell, Lead Accessibility Engineer

The Promise of AI Alt Text Generation

Vision AI has fundamentally transformed how we approach alt text for PDFs. Models like Claude's vision capabilities, GPT-4V, and other large multimodal models can look at an image and produce descriptive alt text that captures both what's depicted and its functional purpose. For many PDF use cases, this works remarkably well.

In internal testing across thousands of PDF images, modern vision models produce alt text meeting WCAG 2.1 AA requirements for approximately 87% of images without requiring human review. For photos, logos, simple diagrams, and common business graphics, the quality is often indistinguishable from what an expert accessibility consultant would write.

But that 13% that requires human review is important. Understanding where AI succeeds and where it fails is critical for building a reliable remediation workflow.

How Modern Vision Models Describe Images

What Vision Models Actually Do

A vision model breaks an image into patches, encodes each patch into numerical representations, and processes these encodings through attention mechanisms that learn relationships between different regions of the image. The model has been trained on billions of image-text pairs, so it learns statistical correlations between visual features and natural language descriptions.

When asked to describe an image, the model doesn't "see" like humans do. It doesn't consciously understand that "this is a dog" because it somehow perceived dog-ness. Rather, it recognizes patterns in pixel values that correlate with the text token "dog" in its training data, and produces that token.

Why This Works Well for Many Images

For straightforward images—a photograph of two people shaking hands, a simple bar chart with clear labels, a company logo—this statistical approach works. The model has seen millions of similar images in training, so the patterns are well-established.

The Training Data Problem

Vision models are only as good as their training data. If a model was trained heavily on internet images (mostly photos, screenshots, common graphics), it will be excellent at those and mediocre at specialized visualizations (medical diagrams, specialized technical charts, scanned handwritten documents).

Where AI Excels: Image Categories That Require Minimal Review

Photographs and Real-World Objects

Modern vision models are excellent at describing photographs. A photo of an office meeting, a landscape, a product, or people in an action—the model will produce accurate alt text. Example outputs:

Photograph input: "Three colleagues in an office setting reviewing documents on a desk with a laptop"
Actual context: Could be from a case study, training material, or compliance document
AI output: "Three people in business attire review documents together at a desk with a laptop nearby." — Accurate and useful.

Simple Logos and Brand Graphics

A company logo, product badge, or simple icon—the model identifies these reliably. "RemeDocs logo" or "Microsoft Office icon" is straightforward for vision models.

Standard Charts and Graphs

Simple bar charts, pie charts, and line graphs with clear labels are handled well. The model can often extract the key data relationships. Example:

Input: A bar chart showing quarterly revenue with four bars labeled Q1–Q4, with values 10M, 12M, 15M, 18M
AI output: "Bar chart showing quarterly revenue growth from Q1 ($10M) through Q4 ($18M), with steady increases each quarter." — Accurate and conveys the key insight.

Simple Diagrams and Flowcharts

A basic organizational chart, simple process diagram, or flowchart with clear boxes and labels—vision models handle these well. The model identifies the boxes, their labels, and connection directions.

Maps and Geographic Visualizations

A map showing store locations, regional data, or geographic boundaries—the model can describe what's shown. "Map of the United States with regional sales data color-coded by state" is achievable.

Tables and Data Layouts

While tables should ideally be semantic HTML tables in PDFs, when they appear as images, vision models can extract table structure. "A 5-column table with headers: Product, Q1, Q2, Q3, Q4, containing regional revenue data" is producible.

Where AI Struggles: Complex Images Requiring Human Review

Multi-Series Charts with Color-Only Differentiation

This is one of the most common failures. A line chart with five lines in different colors, showing revenue trends for five product lines—the model struggles because:

The model must identify each line, which requires distinguishing by color
If the colors are similar or if rendering has degraded quality, differentiation fails
Without labels on the lines themselves, the model must map colors to a legend, and that mapping can be fragile

A vision model might produce: "Line chart showing trends for five different metrics." But it often can't reliably say which line is which product or what the specific values are.

Complex Infographics with Dense Information

An infographic packing 20+ data points, multiple visual elements, icons, and nested information—the model struggles to prioritize what matters. It might describe shallow visual features ("an infographic with many colors and icons") without capturing the key insights the infographic was designed to convey.

Technical and Architectural Diagrams

A complex systems architecture diagram, network topology, or technical schematic with dozens of labeled components and relationships—vision models often fail. Why:

The model must read text labels precisely
It must understand the meaning of connection types (solid vs. dashed lines, arrows, etc.)
It must understand domain context (what does "load balancer" mean in a system architecture?)

A model might describe the visual layout ("boxes connected by lines") without understanding the functional relationships or the domain-specific meaning.

Scanned Documents and Handwritten Text

A scanned image of a handwritten form, hand-drawn diagram, or faded document—vision models struggle with legibility and context. OCR-style text extraction is unreliable, and understanding intent from hand-drawn diagrams is nearly impossible.

Highly Specialized Domain Images

Medical imaging (X-rays, MRI scans), scientific microscopy images, or highly specialized technical diagrams (circuit schematics, geological cross-sections)—vision models trained on general internet data lack the domain-specific knowledge to interpret these accurately. A model trained on general data won't understand the clinical significance of an MRI finding.

Images Dependent on Surrounding Context

An image whose meaning is context-dependent—a before/after comparison, a diagram illustrating a specific process, or a chart meant to support a particular argument in the text—the model generates alt text in isolation, missing the intended context. Example:

In a document discussing "the top five reasons users abandon forms," an image shows a form with 20 fields
AI alt text (in isolation): "A web form with multiple fields for user information"
Proper alt text (contextual): "Example of a form with excessive fields, demonstrating the complexity that leads to abandonment"

Decorative Images Masquerading as Content

Vision models tend to over-describe. A purely decorative graphic (a divider line, a background image, a design flourish) gets verbose alt text. A truly accessible PDF would mark this as decorative with an empty alt attribute, but the model doesn't know.

What Good Alt Text Looks Like Per WCAG 2.1 AA

Functional Purpose, Not Just Description

WCAG 2.1 AA requires alt text that conveys "functional purpose." This doesn't mean writing a novel about the image. It means answering: "Why is this image here? What information or function does it serve in this document?"

Length Guidelines

Simple images: 10–15 words. "A photo of a team meeting in the company cafeteria."
Standard charts/diagrams: 20–50 words. "Bar chart comparing customer satisfaction scores: Product A (85%), Product B (72%), Product C (91%)."
Complex visualizations: 50–150 words or a link to a long description. For a complex multi-series chart, the short alt text might be "Revenue trends by product line, 2023–2025 (see detailed breakdown below)" with a text table in the document.

Accessibility vs. Redundancy

If the image is purely for visual enhancement and the information is already in surrounding text, alt text should be minimal or empty. If the image conveys information not in text, alt text must be comprehensive.

Building a Human-in-the-Loop Workflow

The 87% Rule: When to Trust AI

Based on testing, vision models produce acceptable alt text for approximately 87% of PDF images without human review. The remaining 13% are typically:

Complex multi-series charts (5%)
Specialized technical/medical diagrams (3%)
Scanned or low-quality images (3%)
Context-dependent images where surrounding text is critical (2%)

Workflow Design: Categories of Review

Category A: Auto-accept (no review)

Photographs and real-world objects
Simple logos and icons
Basic bar/pie/line charts with 3 or fewer series
Simple process diagrams and org charts

Category B: AI-generated + spot-check (10% sample review)

Standard business charts with 4+ series
Maps and geographic visualizations
Standard infographics
Tables rendered as images

Category C: Human review required (100% review)

Complex technical diagrams
Medical or specialized domain images
Scanned documents or handwritten content
Context-critical images where meaning depends on surrounding text

Cost-Benefit Analysis

Using this categorized approach, a typical document set breaks down as:

65% Category A: Auto-accept at ~$0.01–$0.02 per image
25% Category B: AI + spot-check at ~$0.05–$0.10 per image (including review time)
10% Category C: Full human review at ~$0.50–$2.00 per image (depending on complexity)

Average cost per image across all three categories: ~$0.15–$0.25 with AI assistance, vs. $0.50–$2.00 for full manual alt text.

How RemeDocs Approaches Alt Text Generation

Multi-Model Approach

Rather than relying on a single vision model, RemeDocs uses multiple vision models and compares their outputs. If models agree, confidence is high. If they disagree, the image is flagged for human review.

Contextual Awareness

RemeDocs analyzes surrounding text to understand context. If an image appears in a section titled "Budget Trends 2023–2025," the model understands the image should be contextualized within that frame.

Image Category Detection

RemeDocs pre-classifies images (photograph, chart, diagram, table, etc.) and applies category-specific alt text generation rules. A chart gets parsed for axis labels and data values; a photo gets described for its subjects and composition.

Confidence Scoring

Each alt text output gets a confidence score. High-confidence outputs (photographs, simple charts) bypass human review. Medium-confidence outputs get a quick review. Low-confidence outputs (complex diagrams, scanned content) get full expert review.

Best Practices for Reviewing AI-Generated Alt Text

Rule 1: Does It Convey Functional Purpose?

Read the alt text without looking at the image. Can you understand what the image is conveying to someone using a screen reader? If not, revise.

Rule 2: Is It Appropriately Detailed?

Simple images shouldn't have paragraph-length alt text. Complex diagrams shouldn't be reduced to one sentence. Match detail to complexity.

Rule 3: Don't Repeat Surrounding Text

If the paragraph immediately before the image says "Budget trends show Q4 growth of 15%," the image's alt text doesn't need to repeat this. It should add information the image uniquely conveys.

Rule 4: Be Specific About Data

For charts: include axis labels, key values, and trends. "Chart showing revenue growth" is too vague. "Revenue chart: Q1 $2M, Q2 $2.8M, Q3 $3.5M, showing 75% growth over three quarters" is specific.

Rule 5: Mark Decorative Images Appropriately

If an image is purely decorative (background texture, divider, icon with no semantic meaning), set alt text to empty string (not a long description). Vision models often fail at this—they describe the visual content when an empty attribute is correct.

A best practice: Have a subject-matter expert (SME) review alt text for documents in their domain. A budget officer reviewing financial charts, an engineer reviewing technical diagrams. SMEs understand domain conventions and can catch where AI misses context-specific meaning.

The Reality: AI is a Multiplier, Not a Replacement

AI alt text generation is genuinely useful and can handle the majority of PDF images without human review. But it's not magic. It's a tool that handles routine cases efficiently and flags complex cases for expert attention. The most effective approach combines:

AI generation for speed and baseline quality
Automated quality scoring to identify low-confidence cases
Human review for complex or domain-specific images
SME validation for specialized documents

This hybrid workflow scales remediation to thousands of PDFs while maintaining quality standards that meet WCAG 2.1 AA. It's faster than full manual alt text and more reliable than AI-only automation.

Ready to make your PDFs accessible?

Upload any PDF and get a fully compliant, audit-ready document back in seconds.

Try free PDF audit

← Back to all posts

The Promise of AI Alt Text Generation

How Modern Vision Models Describe Images

What Vision Models Actually Do

Why This Works Well for Many Images

The Training Data Problem

Where AI Excels: Image Categories That Require Minimal Review

Photographs and Real-World Objects

Simple Logos and Brand Graphics

Standard Charts and Graphs

Simple Diagrams and Flowcharts

Maps and Geographic Visualizations

Tables and Data Layouts

Where AI Struggles: Complex Images Requiring Human Review

Multi-Series Charts with Color-Only Differentiation

Complex Infographics with Dense Information

Technical and Architectural Diagrams

Scanned Documents and Handwritten Text

Highly Specialized Domain Images

Images Dependent on Surrounding Context

Decorative Images Masquerading as Content

What Good Alt Text Looks Like Per WCAG 2.1 AA

Functional Purpose, Not Just Description

Length Guidelines

Accessibility vs. Redundancy

Building a Human-in-the-Loop Workflow

The 87% Rule: When to Trust AI

Workflow Design: Categories of Review

Cost-Benefit Analysis

How RemeDocs Approaches Alt Text Generation

Multi-Model Approach

Contextual Awareness

Image Category Detection

Confidence Scoring

Best Practices for Reviewing AI-Generated Alt Text

Rule 1: Does It Convey Functional Purpose?

Rule 2: Is It Appropriately Detailed?

Rule 3: Don't Repeat Surrounding Text

Rule 4: Be Specific About Data

Rule 5: Mark Decorative Images Appropriately

The Reality: AI is a Multiplier, Not a Replacement

Related Articles

Ready to make your PDFs accessible?