[Athen] Title, tags, lang - where are they in a PDF document? Beginning or end?

Chagnon | PubCom chagnon at pubcom.com
Wed May 24 11:45:56 PDT 2017


Hi Corrine,

Maybe this can help.



“Title” can be several things.

* The visual text title on the cover.
* The H1 tag in a PDF’s tag tree.
* The Title field in the PDF’s meta data (File/Properties and select the 1st thumbtab at the top, Description)



The only one of the above that is stored “at the beginning” of the document is the visual text title. Everything else is contained within the code of the PDF file itself.



The tags are stored within the content (the concept of at the beginning or end doesn’t mean anything). Open the tags panel to view the tags tree and see if your scan produced tags, and if so, which tags and if they are in the correct, logical reading order. You can view the tags panel on the left side of Acrobat by View / Show-Hide / Navigation Panes / Tags.



Language attribute is also set and viewed in the PDF’s metadata. (File/Properties and select the last thumbtab at the top, Advanced.)

I think we’re all confused by your question!

“this is a scan using code not a physical scanner” doesn’t makes sense to me. Technically, a scan is the results of a tangible product (like a book or printed document) that is captured by a scanner. I’ve been in the publishing industry for decades (worked with original scanners) and have never heard of a program that scans a digital file.



Maybe you mean something like convert a digital file? There are several processes that can be run on a file:

* OCR of dead (graphical, scanned) text and create live, machine readable text.
* Add tags to a PDF that has live readable text.



I’m also wondering how you made a file out of the Moodle instance, as there are many ways to do that, too. Maybe another process would produce a better result for your needs.



--Bevi Chagnon



— — —

Bevi Chagnon | <http://www.pubcom.com/> www.PubCom.com

Technologists, Consultants, Trainers, Designers, and Developers

for publishing & communication


| Acrobat PDF | Print | EPUBS | Sec. 508 Accessibility |


— — —









From: athen-list [mailto:athen-list-bounces at mailman13.u.washington.edu] On Behalf Of Corrine Schoeb
Sent: Wednesday, May 24, 2017 11:17 AM
To: athen-list at u.washington.edu
Subject: Re: [Athen] Title, tags, lang - where are they in a PDF document? Beginning or end?



Thank you to everyone who has responded so far.



I think I need to clarify - this is a scan using code not a physical scanner. We've developed a scan for our Moodle instance. Right now, it can recognize text vs. an image of text but we are working on refining that scan further. Large documents take up a lot of cpu/memory so we are thinking we might be able to limit our scan the first 5-10 pages to see if there is a title, tags, etc. I'm just not sure where that data is stored - at the beginning or at the end of the PDF.



I know this is very technical question and a bit obscure but I figured this might be the right group.





On Wed, May 24, 2017 at 8:34 AM, Corrine Schoeb <kschoeb1 at swarthmore.edu <mailto:kschoeb1 at swarthmore.edu> > wrote:

We are working on creating a scan of PDF documents, some of which are 100+ pages. Rather than scan the full document to find out if it is tagged, has a title and language we thought we might be able to do the first 5-10 pages but I'm not sure where the title, tag, lang data is stored in a PDF.



So my question is, is title, tag, lang attributes of a PDF stored at the beginning of a PDF or at the end?



--



Corrine Schoeb
Technology Accessibility Coordinator, ITS

610-957-6208 <tel:(610)%20957-6208>



*** Swarthmore College ITS will never ask you for your password, including by email. Please keep your passwords private to protect yourself and the security of our network.



To learn more about web security visit http://www.swarthmore.edu/its/security









--



Corrine Schoeb
Technology Accessibility Coordinator, ITS

610-957-6208



*** Swarthmore College ITS will never ask you for your password, including by email. Please keep your passwords private to protect yourself and the security of our network.



To learn more about web security visit http://www.swarthmore.edu/its/security



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman12.u.washington.edu/pipermail/athen-list/attachments/20170524/a4fc129f/attachment.html>


More information about the athen-list mailing list