Technical

A Complete Guide to PDF Tag Trees: What They Are and Why They Matter

Feb 8, 2025 · 12 min read

By Kate Mitchell, Lead Accessibility Engineer

The tag tree is the hidden skeleton of every accessible PDF. It is a hierarchical, machine-readable structure embedded within the PDF file that describes the document's logical organization. When you open a PDF in a screen reader, the tag tree is what the assistive technology actually reads. Without a proper tag tree, a screen reader encounters only a flat, meaningless stream of characters with no semantic structure — no way to identify headings, paragraphs, lists, tables, or images.

For sighted users, the visual layout of a PDF provides immediate context: you can see the heading, skip to the section you need, spot a table, and understand its purpose at a glance. For blind or low-vision users relying on screen readers, the tag tree is the only way to navigate and understand document structure. A broken or missing tag tree forces them to listen to every word sequentially, making even simple documents exhausting and inaccessible.

Why the tag tree matters for accessibility

The tag tree directly impacts how assistive technology interprets a PDF. When a user navigates with a screen reader, they rely on structure tags to:

Navigate by heading: Screen reader users jump between headings using keyboard shortcuts, just as sighted users scan visually. Proper H1–H6 tags create a navigable outline.
Understand document flow: The tag tree defines the reading order. Without it, text may be read in random or reversed order, making comprehension impossible.
Interpret tables: Complex data tables require proper Table, row, and header cell markup so screen readers can announce row and column headers with each cell.
Identify images and figures: Alt text lives inside the tag tree. Without tags, images are invisible to assistive technology.
Find lists and list items: Nested L and LI tags allow users to navigate lists programmatically.
Access interactive forms: Form fields must be tagged and labeled so users know what each field does and how to complete it.

From a legal and compliance perspective, a proper tag tree is foundational to WCAG 2.1 AA and Section 508 compliance. The U.S. Department of Education's Office for Civil Rights (OCR) consistently cites missing or broken tag trees in accessibility complaint resolutions. More importantly, it's the single most effective way to make PDFs actually usable for people with disabilities.

Understanding PDF tag tree anatomy

A tag tree mirrors the DOM structure in HTML. At the root is a Document tag that wraps all content. Inside, structural elements create a logical hierarchy:

Root and structural elements

<Document> — The root container. Every tagged PDF has exactly one.
<Part> — A major section or chapter. Optional, but useful for long documents.
<Section> — A logical division within a part, often corresponding to a subsection.
<Article> — A standalone article or story, useful in magazines or newsletters.

Heading hierarchy

<H> — Generic heading (less preferred; specify level when possible).
<H1>, <H2>, ... <H6> — Heading levels. H1 is the document title; H2 starts main sections; H3 starts subsections. Hierarchy must never skip levels (e.g., H1 followed by H3 violates the rule).

Text and paragraph elements

<P> — Paragraph. Standard body text goes here.
<BlockQuote> — Extended quotation.
<Caption> — Title or description for a figure or table.
<Span> — Inline text with semantic meaning, like emphasis or code. Used sparingly inside paragraphs.

List elements

<L> — List container (ordered or unordered).
<LI> — List item. Always a direct child of <L>.
<Lbl> — Optional label inside an LI (e.g., "1." or "a."). For visual presentation only.
<LBody> — Content of the list item. Can contain inline elements or nested lists.

Table elements

<Table> — Table container.
<THead> — Optional header row(s).
<TBody> — Optional body rows.
<TFoot> — Optional footer row(s).
<TR> — Table row.
<TH> — Table header cell. Must include a Scope attribute (Row, Column, or Both).
<TD> — Table data cell.

Table markup is one of the most critical and most frequently broken elements. Screen readers announce the header cell contents when reading each data cell, but only if the TH/Scope relationship is correctly established. A single missing scope attribute makes a table useless for blind users.

Content elements

<Figure> — Container for an image or graphic. Must include <Caption> and/or /Alt attribute.
<Formula> — Mathematical equation or formula. Should have a descriptive /Alt attribute.
<Artifact> — Decorative or non-semantic content (like a background image or dividing line). Properly tagged artifacts are ignored by screen readers.

Every <Figure> must have an alternative text description. This can be a /Alt attribute on the Figure tag itself or a <Caption> child element. Both is even better.

How to inspect and view tag trees

Adobe Acrobat (Pro)

Adobe Acrobat Pro is the most straightforward tool for tag tree inspection. Open your PDF and navigate to View → Navigation Panels → Tags (or press Ctrl+6 on Windows, Cmd+6 on Mac). A side panel opens showing the complete tag tree as a collapsible, hierarchical outline. You can expand and collapse sections, click tags to jump to them in the document, and see the exact nesting structure. This is the primary method used by accessibility professionals to audit PDF structure.

In the Tags panel, look for:

Is the root Document tag present?
Are headings in correct hierarchical order (no skipped levels)?
Do lists contain only LI elements?
Do tables have proper TH headers with scope attributes?
Do images have Figure tags with alt text?
Is the tag tree deeply nested, or is it flat?

PDF Accessibility Checker (PAC)

PAC 2024 is a free, specialized tool from the Swiss Foundation "Access for all." It provides automated PDF accessibility checking and includes a detailed tag tree inspector. PAC visualizes the tag tree alongside the PDF content, making it easy to see how tags map to document elements. PAC also identifies specific tagging errors: missing tags, empty tags, incorrect nesting, and misused artifact tags.

For organizations doing systematic remediation work, PAC is invaluable because it:

Generates a detailed accessibility report with every issue listed and categorized.
Shows tag tree structure in a clear, navigable interface.
Identifies the exact location of each error in the document.
Provides remediation guidance for each issue.
Is free and runs on Windows and Mac.

Other tools

The PDFUA Validator (PDF/UA is the ISO standard for accessible PDFs) checks for ISO 14289-1 compliance. Command-line tools like pdfinfo and Python libraries such as PyPDF2 or pikepdf can programmatically inspect tag tree structure, useful for batch validation. However, these technical tools require developer familiarity and don't provide the visual interface that Acrobat and PAC offer.

Common tag tree problems and their causes

Flat or missing tag tree

This is the most common failure. A PDF created from a scan with no OCR, or exported from a tool that doesn't generate tags (like many older Microsoft Word→PDF conversions), will have no tags at all. The document may look fine on screen, but a screen reader sees nothing but raw character coordinates.

Cause: The original file was not created with accessibility in mind. Scanning without OCR, exporting from inaccessible templates, or copying text into a tool that strips tagging all produce flat PDFs.

Broken heading hierarchy

Headings jump levels: H1 is followed directly by H3, or H2s are scattered throughout with no clear hierarchy. Screen readers rely on heading levels to build an outline. When levels are skipped, users cannot reliably navigate by heading.

Cause: Source documents have inconsistent heading styles (someone used "Heading 3" for visual emphasis rather than semantic structure). When exported to PDF, these inconsistencies transfer directly into the tag tree.

Tables without scope attributes

A table might have TH and TD tags, but the header cells lack Scope attributes. Screen readers cannot announce which column or row a data cell belongs to.

Cause: Automated PDF conversion tools (like Microsoft Word→PDF) sometimes tag tables structurally but don't set scope attributes. Manual remediation is required.

Empty or placeholder alt text

Images are tagged with <Figure> but the alt text is empty, says "image," "picture," or contains only a filename. This provides no meaningful description.

Cause: Template-based PDF creation tools insert placeholder alt text. Original document creators don't replace it with actual descriptions. Alt text writing requires human judgment and cannot be fully automated.

Artifact tags misused or missing

Decorative elements (background images, dividing lines, page numbers) are tagged as content, cluttering the document structure for screen reader users. Conversely, actual content is marked as "Artifact" and hidden from assistive technology.

Cause: Automated tagging mistakes. PDF authoring tools struggle to distinguish between decorative and semantic content.

Incorrect tag nesting

Lists contain paragraphs instead of list items; tables are nested inside text blocks; sections lack proper part/section wrapper tags. This confuses screen readers about the document's logical structure.

Cause: Manual tagging errors or tool-generated incorrect nesting. Fixing requires understanding the PDF specification and careful re-tagging.

Reading order issues

The tag tree order doesn't match the visual reading order of the document. A screen reader reads text in tag tree order, not visual order. Multi-column layouts, sidebars, and footnotes are especially prone to reading order problems.

Cause: PDFs with complex layouts (newsletters, magazines, technical documents with sidebars). Automatic tag generation cannot always infer the correct reading order from visual position alone.

How to fix broken tag trees

Manual remediation with Adobe Acrobat Pro

For small documents or targeted fixes, Acrobat's Tags panel allows manual editing:

Right-click a tag to edit its properties (type, attributes).
Drag tags to reorder them or move them to different parents.
Delete incorrect tags and manually add correct ones.
Add or edit alt text on figure tags directly in the Tags panel.

This approach is precise but labor-intensive. A single complex document might take hours to fully remediate. For organizations with thousands of PDFs, manual remediation is not scalable.

Automated remediation tools

Tools like RemeDocs, Acrobat's Make Accessible action, and PDF2HTML converters with re-tagging can automatically analyze PDF content and generate a proper tag tree. Modern ML-based tools:

Detect headings by visual prominence and hierarchy.
Identify and scope table headers automatically.
Recognize list structures from indentation and formatting.
Assign reading order based on visual layout analysis.
Generate placeholder alt text (which humans should review and improve).

Automated remediation is not perfect — tables with unusual layouts, complex figures, and non-standard structures may still require manual review. But it handles the bulk of systematic tagging, particularly for high-volume document sets.

Re-creating the PDF from source

When the original source document (Word, InDesign, etc.) is available and accessible, regenerating the PDF from source is often faster than manual remediation. Ensuring the source file uses proper styles, heading levels, table formatting, and image descriptions, then exporting to PDF with accessibility options enabled, produces a properly tagged PDF in one step.

This approach only works if the source document exists and is accessible. For legacy scanned documents or PDFs created from inaccessible sources, this option is not available.

How automated remediation builds proper tag trees

When RemeDocs or similar tools process a PDF, they perform these steps:

Content analysis

The tool extracts all text, images, and layout information from the PDF, ignoring the original (often missing) tag tree. It analyzes font sizes, weights, colors, and positioning to infer semantic meaning.

Structure inference

The largest, boldest text is likely a document title (H1). Slightly smaller, bold text recurring throughout is likely section headings (H2). Text with consistent indentation and bullets is a list. Tabular arrangements of text are tables.

Tag generation

The tool creates a new tag tree with appropriate structural tags, builds reading order based on visual position, and assigns heading levels, list nesting, and table scope attributes.

Image and alt text handling

For images, the tool may attempt OCR to extract any embedded text (for charts or diagrams) or generate a generic placeholder description. Organizations should review and improve auto-generated alt text, as it may miss nuance or context.

Validation and output

The remediated PDF is validated against WCAG and PDF/UA standards, and any remaining issues are flagged for human review.

Relationship between tag trees and reading order

The reading order of a screen reader is determined by the tag tree, not the visual layout. This is critical to understand because it explains why visually correct documents can still be inaccessible:

If the tag tree reflects tags in the wrong order, the screen reader will read them in that wrong order, regardless of how the document looks on screen.
For single-column documents, tag order usually matches visual order naturally.
For multi-column layouts, sidebars, or complex grids, visual and tag order often diverge. The tag tree must explicitly define the correct reading order.
The PDF specification includes a separate "reading order" structure (the logical structure dictionary), but in practice, the tag tree is the primary determinant of reading order.

A document with a perfect tag tree but incorrect reading order is still inaccessible. During remediation, always verify both the tag structure and the reading order.

Real examples: good vs. bad tag trees

Good tag tree example: Academic paper

A properly tagged academic paper has this structure:

Document
- H1 "Advances in Quantum Computing"
- P Abstract paragraph
- H2 "Introduction"
- P Introductory paragraph
- H2 "Methodology"
- P Methodology text
- Table Experimental results (with TH headers, scope attributes)
- H2 "Results"
- P Results paragraph
- Figure with alt text "Graph showing correlation between variables"
- H2 "Conclusion"
- P Conclusion paragraph

A blind user can navigate this document by pressing the heading key ("H") in their screen reader to jump section-to-section, understand the document flow, and access all data in tables and figures.

Bad tag tree example: Scanned document with no remediation

A PDF created by scanning a printed book with no OCR has this structure:

Document (sometimes even this is missing)
- Raw character objects with coordinates but no semantic tags

A screen reader user hears: "A d v a n c e s i n Q u a n t u m C o m p u t i n g a d v a n c e s i n..." — every character read individually with no structure. The document is completely unusable.

Bad tag tree example: Flat structure from automated conversion

A Word document exported to PDF without proper source formatting becomes:

Document
- P "Advances in Quantum Computing" (visually large, but tagged as body text)
- P Abstract paragraph
- P "Introduction" (should be H2, but is just a paragraph)
- P Introductory paragraph
- P "Methodology"
- P Methodology text
- Table with no header scope attributes
- [... all content flattened into P tags with no hierarchy]

A screen reader user cannot navigate by heading. To find the methodology section, they must listen to every paragraph sequentially. Tables are incomprehensible because header relationships are missing. The document is technically "tagged" but structurally useless.

Tag trees and compliance

A proper tag tree is not just an accessibility best practice — it is a legal requirement under:

WCAG 2.1 Level AA (criterion 1.3.1 Info and Relationships): Document structure must be conveyed programmatically.
Section 508 of the Rehabilitation Act (technical standards for federal agencies): Electronic documents must have proper structure tags.
ADA Title II and Title III (civil rights for public and private entities): Auxiliary aids and adjustments, including accessible documents, are required.
PDF/UA (ISO 14289-1) (universal accessibility standard): PDFs must have a valid logical structure tree.

Auditors, regulators, and plaintiffs in accessibility litigation specifically examine tag trees. A missing or broken tag tree is immediate evidence of non-compliance and a red flag during accessibility reviews.

Getting started with tag trees

For individual PDFs: Open in Adobe Acrobat Pro, navigate to Tags panel, and spend 15 minutes inspecting the structure. If it's empty or deeply broken, your document needs remediation before it can be considered accessible.

For document collections: Run a sample of PDFs through PAC or a similar tool to identify systemic issues. Is the problem flat/missing tags, broken hierarchy, or missing alt text? The answer determines your remediation strategy.

For organizations with high-volume PDFs: Implement automated remediation for bulk processing, then allocate human time to manual review of complex documents. Build a style guide and template library so new PDFs are tagged correctly from creation, not repaired afterward.

The tag tree is invisible to sighted users but invisible to screen reader users without it. Investing in proper tag trees is investing in genuine, not performative, accessibility.

Ready to make your PDFs accessible?

Upload any PDF and get a fully compliant, audit-ready document back in seconds.

Try free PDF audit

← Back to all posts