[Athen] [EXT] Favorite PDF to HTML converter

Terrill Thompson tft at uw.edu
Thu Aug 13 09:48:27 PDT 2020


OK, I decided I would go ahead and take a risk for the benefit of science.

I tested several of the PDF-to-HTML conversion tools that appear on
independent reviewers' "Top 5", "Top 10", etc. lists of PDF-to-HTML
converters. These tools all seem to be focused solely on preserving
the look and feel of the PDF, not the underlying structure (if the PDF
is tagged). Some of them do an impressive job with visual accuracy,
but they do it all with <div> elements, so the HTML output, while it
may look good, is not at all accessible.

The only tools I tested that even attempt to preserve the underlying
structure of a tagged PDF are those that actively market themselves as
accessibility solutions, and they're not free. I was looking for a
free solution, but so far I've come up empty-handed.

The three conversion tools/methods I found that preserve tagged structure are:

1. Ally
This produces HTML output that preserves heading structure, alt text
on images, lists, and table headers. Ally is the *only* tool I found
that successfully converts table headers.
The only measure Ally failed on in our test was a complex table with
nested rows and columns. In the tagged PDF we used Adobe Acrobat's
Table Editor to assign rowspan and colspan, and explicitly associated
table data cells with relevant table headers. All this data is
reflected in the tagged PDF structure so I would expect that to show
up in the HTML as rowspan, colspan, id, and headers attributes. With
no rowspan or colspan attributes, Ally's HTML table was a bit of a
mess.

2. SensusAccess
This produces HTML output that preserves heading structure and alt
text on images, but not table headers, and list items get converted to
paragraphs.

3. Save As > Other > HTML or Export to HTML from Adobe Acrobat.
This produces HTML output that preserves heading structure and alt
text on images, but not table headers. In Adobe Acrobat DC, it also
correctly converts lists, which isn't true in earlier versions (I also
tested Acrobat Pro XI, which converted list items to paragraphs). Not
surprisingly, Adobe also attempts to preserve the look and feel by
adding a huge amount of inline CSS to the HTML file. If that's not
your intent, this is an annoyance and would require an extra step to
strip out all the bloat.

Also, since well-tagged PDFs are rare in the wild, I tested each of
these three tools/methods with an untagged document. Our sample
document included one image (a banner image, with text) and
SensusAccess was the only tool that performed OCR on the image and
included the converted text in the HTML document (as a paragraph).
However, SensusAccess was also the only tool of the three that made no
effort to add structure (no headings, lists, or table headers).

Exporting from Adobe Acrobat (either version) resulted in a
decently-tagged document, with h1 and h2 headings properly assigned,
and lists properly coded. Our sample document included both a simple
table and a complex table with nested rows and columns. Acrobat got
the rowspan and colspan right for the complex table so the table
looked correct. However, it didn't assign table header elements or any
other accessibility markup to either table.

Ally could do everything Acrobat did (headings and lists were properly
coded), but it one-upped Acrobat by properly coding column headers in
the simple table. It severely messed up the complex table though,
which more than anything might just reinforce the notion that complex
tables should be avoided whenever possible.

I suspect there are other tools out there that are comparable to the
three I found. Please share if you know of any.

Thanks!
Terrill

---
Terrill Thompson
Manager, IT Accessibility Team
UW-IT Accessible Technology Services
University of Washington
tft at uw.edu
On Wed, Aug 12, 2020 at 5:18 PM Krista Greear
<krista at inclusiveinstructionaldesign.com> wrote:

>

> If you are a Canvas school and use Ally, you can leverage the fantastic work of Utah State University with their File to Canvas Page tool, available on GitHub. Designed for PDFs in the course though, not necessarily content that lives outside the LMS. But you could bring outside content into the LMS, then leverage the tool, then post publicly.

>

> Christopher Phillips at USU is the likely the best point of contact for further questions.

>

>

> On Wed, Aug 12, 2020 at 3:05 PM Michael Nakai <michaelnakai at weber.edu> wrote:

>>

>> We have ABBYY FineReader that we use for processing textbooks. It has HTML Export and as an OCR program can read out text from images in a PDF.

>>

>> On Wed, Aug 12, 2020 at 12:48 PM Hunziker, Dawn A - (hunziker) <hunziker at arizona.edu> wrote:

>>>

>>> Hi all,

>>>

>>> Please reply to the list - We're also trying to move campus away from PDFs for sharing basic information that would be better in HTML format...

>>>

>>> Thank you,

>>> Dawn

>>>

>>> Dawn Hunziker

>>> IT Accessibility Consultant, Sr. | Disability Resources

>>> The University of Arizona | hunziker at arizona.edu

>>> drc.arizona.edu | itaccessibility.arizona.edu

>>> 520-626-9409

>>>

>>> -----Original Message-----

>>> From: athen-list <athen-list-bounces at mailman12.u.washington.edu> On Behalf Of Terrill Thompson

>>> Sent: Wednesday, August 12, 2020 11:33 AM

>>> To: Access Technology Higher Education Network <athen-list at u.washington.edu>

>>> Subject: [EXT][Athen] Favorite PDF to HTML converter

>>>

>>> External Email

>>>

>>> Hi All,

>>>

>>> We're actively encouraging folks to choose HTML over PDF, which in many cases means converting existing PDFs into HTML. To support that, does anyone have a favorite PDF to HTML converter? A Google search reveals a few zillion choices, and I'm afraid to try most of them as doing so could be a spyware risk.

>>>

>>> I know we can Save As > Other > HTML from Adobe Acrobat, but the output is less than ideal.

>>>

>>> Thanks for any suggestions!

>>>

>>> Terrill

>>>

>>> ---

>>> Terrill Thompson

>>> Manager, IT Accessibility Team

>>> UW-IT Accessible Technology Services

>>> University of Washington

>>> tft at uw.edu

>>> _______________________________________________

>>> athen-list mailing list

>>> athen-list at mailman12.u.washington.edu

>>> http://mailman12.u.washington.edu/mailman/listinfo/athen-list

>>>

>>> _______________________________________________

>>> athen-list mailing list

>>> athen-list at mailman12.u.washington.edu

>>> http://mailman12.u.washington.edu/mailman/listinfo/athen-list

>>

>>

>>

>> --

>> Michael Nakai

>> Adaptive Technology

>>

>> Weber State University Disability Services

>> 3885 West Campus Drive Dept 1129

>> Ogden, Utah 84408-1129

>> (801) 626-6413

>>

>> To slow or prevent the spread of COVID-19, Disability Services at both Ogden and Davis campuses are currently closed to visitors.

>>

>> We will remain open remotely through phone and email to provide services and accommodations. Be aware that any calls coming from our office may appear as Private or No Caller ID. For inquiries about testing accommodations call 801-626-6896 or email dsctesting at weber.edu. Our website weber.edu/disabilityservices for the most up-to-date information about our department. Visit the WSU website for the most up to date information concerning COVID-19, https://www.weber.edu/coronavirus/default.html

>> _______________________________________________

>> athen-list mailing list

>> athen-list at mailman12.u.washington.edu

>> http://mailman12.u.washington.edu/mailman/listinfo/athen-list

>

>

> _______________________________________________

> athen-list mailing list

> athen-list at mailman12.u.washington.edu

> http://mailman12.u.washington.edu/mailman/listinfo/athen-list



More information about the athen-list mailing list