PDF Accessibility

Main Takeaways

  • PDFs are great for printing, while it is possible to make them fully accessible it can also be challenging.
  • Sharing other forms of electronic documents is often the easier way to ensure you’re being as inclusive as possible.
  • "Searchable text" that can be generated using OCR (Optical Character Recognition) is one component of PDF accessibility, but more is needed.
  • Properly remediating for accessibility includes making sure that electronic documents have section headings, images have alt text, and other elements are properly tagged.
  • If you must share a PDF, make sure the source document has all the previously mentioned accessibility features and "export" to PDF, do not "print" to PDF.

Sharing PDF Documents

The Portable Document Format (PDF) is a popular way to share documents that display consistently on screen and when printed. The ability to display and print PDF is fairly ubiquitous across computing platforms, leading to the popular perception that it is “accessible” because it is widely supported without needing to buy additional software.

PDFs pose several challenges to accessibility for people with disabilities, and it is one of the hardest file formats to make accessible. HTML pages, Word Documents, and Google Documents are recommended alternatives to PDFs because they are easier to make compliant with accessibility standards. This Targeted Accessibility Guide (TAG) outlines many of the issues related to PDF accessibility and guidance in choosing the most inclusive way forward.

About Accessible PDFs

Understanding how to ensure PDFs are accessible can be complicated because it depends greatly on how they are created. It is helpful to differentiate between text-based documents that are created as PDFs, such as converting an Office-type document to a PDF file, and those that are assembled from other files that don’t include “searchable text”, such as when you create a PDF from multiple JPG image files or by scanning printed pages.

Text in PDF

PDF is designed to create documents that print consistently, and include a combination of images and other elements that can make it hard to determine with a quick spot check if the text being presented is also accessible.

To be accessible, PDFs require:

  • text that can be accessed through a screen reader or other assistive technology,
  • “tags” that correctly identify the organization of the document such as section headings and reading order, 
  • a linear reading order that clarifies complex layouts (like multiple column news articles), and
  • alternative text to describe visual elements.

This document will cover general information important to creating accessible PDFs regardless of constituent files. 

Optical Character Recognition (OCR)

Scanning a document is similar to the process of making a photocopy: it creates an image of a page. Optical Character Recognition is the process of detecting text in an image and embedding it into the document so that the text is searchable. OCR is a required component of scanning physical documents to create an accessible PDF, but a PDF that only has OCR is not truly accessible. 

Office-type documents, such as Microsoft Word and Google Docs, include searchable text which is retained in the process of conversion to PDF when the proper steps are taken. Therefore, OCR is not required as part of this process because searchable text, order, and structure are preserved accurately. 

If you assemble a PDF from image files of scanned pages, you will need to perform OCR. The quality of the image plays a large role in the accuracy of the searchable text in the PDF. Clearer images of “text” (more pixels, clear contrast, less compression, and fewer distortions) generally result in better outcomes for OCR. 

When creating OCR, you can review and clean it. Unreviewed OCR is sometimes called “dirty OCR.” Dirty OCR will rarely support accessibility very well: the resulting text must be accurate, in the right order, and structured. Accuracy and equivalence is a benchmark for accessibility conformance, so just as with the process for automatic captioning, verifying that OCR content correctly represents the words and structure in the source document is important. 

As noted above, the outcome will depend largely on the quality of the input. Images scanned from microfilm, for example, will likely result in inaccurate OCR and require additional remediation. 

Software

Tools that can make PDFs more accessible include Microsoft Office applications and Adobe Acrobat. Staff computers come installed with Microsoft Office, and Adobe Acrobat Pro DC is available through Software Center (Windows) and Self Service (Mac). You can also generate accessible PDFs from Google Docs using the Grackle extension licensed by the university.

The most recommended option for making an accessible PDF is exporting a Microsoft Office document that has been authored with recommended practices including:

  • Proper use of headings
  • Images have associated alternative text
  • Tables have assigned header rows and cells
  • Consideration of color contrast and font size

Similarly Google Docs can be exported using Grackle to create a PDF that preserves accessibility features.

Don’t use the “Print” to PDF option from these tools, as the result is likely to be an image-only PDF which will not be accessible.

Adobe can create PDFs from a variety of file formats, whether they are Office-type documents or image files. Additionally, Acrobat Pro DC is able to perform OCR. 

Additional programs, such as FoxitPDFtk (toolkit), and PAC are available for download on the Web. Tools like tesseract, Google Vision, and Microsoft Azure, are command line or API-based applications that perform OCR but may not have all features needed to make PDFs fully accessible in other regards. They can be combined with tools like Acrobat to further remediate the PDF for accessibility. 

Getting Started

Before sharing a document, first consider “Does it need to be a PDF?” 

PDF files are ideal for printing, but web pages and Office-type documents will generally be more accessible. Working from the original source document makes it easier to remediate accessibility barriers. 

You can include the PDF for convenience along with more accessible formats, but use extra diligence if you choose to share only the PDF. Also be aware that others that are unaware of the challenges related to PDF accessibility may share the PDF instead of the more accessible version.

Next, consider the quality of source files. 

When you’re creating a PDF from scans of pages, better quality images will produce more accurate OCR results. Office-type documents will provide the most accessible result when the source document is authored with appropriate practices and verified with the built-in tools such as the “Check Accessibility” feature in Microsoft Word.

Finally, verify any PDF you share with others is accessible.

You can use programs like Adobe Acrobat or PAC to check for accessibility. Tools like Adobe Acrobat will both check and remediate issues in a PDF, though not all issues can be corrected automatically. Manual remediation can take a lot of effort for long or complex documents.

Libraries’ staff will occasionally need to assess PDFs provided by a third-party to evaluate resources for license renewal. In this case a tool that only scans for accessibility, like PAC, is more appropriate. If the tool being used also does remediation, make sure that step is disabled or bypassed to get an accurate assessment of the document.

Accessible PDFs: What To Look For

Cornell’s resource Create Accessible PDFs is tremendously helpful in documenting what should be done, and how to do it using common software programs, to make PDFs accessible for users. Below is a summary  of considerations for assessing and remediating PDF documents with accessibility checker tools: 

  • Provide a meaningful title in the title field, e.g., the document title, or text provided in the primary heading (heading level 1).
  • Define the language the document is written in, and tag any sections written in different languages.
  • Tag elements in the document.
    • Ensure document headings are tagged appropriately. 
    • If you have bulleted or ordered lists, structure them properly so a user can navigate through them using a screen reader.
    • If you have tables, structure them properly so a user can navigate through them using a screen reader.
    • If your PDF is a form, each form field needs to be tagged as the correct form field type and has appropriate field description text. 
    • Tag links and provide  descriptive link text within PDFs so a screen reader can announce links properly.
  • Provide useful alternative text for images.
  • Use bookmarks in long documents to allow users to navigate through it. 
  • Defining reading order tells a screen reader in what order to follow text and how keyboards and other devices will interact with the document.
  • Ensure color contrast is appropriate for low-vision readers. 
  • You can use various device simulators to manually test how users will experience your document.

Content Reading Order

Scanned documents may include columned or other complex layouts that require an extra check to ensure they are presented in the correct order.

scan of a news article, "North Carolina Agriculture and Industry" from 1923.  A lithograph of a building is prominently featured beside an article titled "The Function and Aim of N.C. State College"

The example above includes a low quality, low contrast scan with tightly packed columns. The image also has distortions from the scanning process. In a multi-column layout, the OCR needs to correctly follow the reading order with the heading, then all of the text from the first column before the second column. When columns are tightly packed like this, there is a chance that the OCR may not detect multiple columns and encode the reading order left-to-right across both columns resulting in text that would be impossible to understand.

Manually correcting reading order issues like this is possible, but can be extremely difficult. Selecting the right tools for scanning, and rescanning tricky documents with up-to-date tools may yield better results.

The Limitations of Automated Remediation

While many accessibility issues can be detected automatically, every tool includes some manual checks. The quality of alternative text for images, reading order, and making sure elements like tables are properly detected are just a few of the examples of manual checks to perform after an accessibility scan.

Similarly, advanced tools like Adobe Acrobat can automatically fix a variety of issues such as performing OCR on a document when searchable text is absent but verifying the accuracy of the result is a manual process.

Contact

For many academic uses including course readings, the campus Disability Resource Office (DRO) can help students obtain resources that have already been remediated for accessibility. Libraries’ staff and students registered with the DRO can request accessible materials by providing the title, ISBN, and edition needed.

Resources

Additionally, please check out the Libraries’ Accessibility Committee’s other Targeted Accessibility Guides (TAGs) for recommendations on other ways to find and remediate accessibility barriers. 

Updated