[thelist] PDF Summary Data...

Mon Feb 11 06:29:01 CST 2002

Hello,

Has anyone ever had the need to strip out summary data from a PDF file?
Well, I do... and it seems to be a little more tricky than I first thought  :)

With PDFs of version 1.4 the summary info is stored in an XML structure
(RDF) within the file.
eg.
  <rdf:Description about=''
   xmlns='http://ns.adobe.com/pdf/1.3/'
   xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
   <pdf:ModDate>2002-02-08T16:10:47Z</pdf:ModDate>
   <pdf:CreationDate>2002-02-08T16:09:57Z</pdf:CreationDate>
   <pdf:Producer>Acrobat Web Capture 5.0</pdf:Producer>
   <pdf:Title>This is a test title</pdf:Title>
   <pdf:Subject>My Subject</pdf:Subject>
   <pdf:Author>Mike King</pdf:Author>
  </rdf:Description>

Now, you can get at this information if you open the PDF in a text editor,
but when I try to ereg through it with PHP I can't get a match  :(
I've tried opening it as binary and text, still no difference. strstr can
find the start of the tags, but I want to ereg out all of the information!

I'm not worried about pre 1.4 version PDFs, 'cause hopefully all 1,035 will
be upgraded soon  :)

Anyone got any ideas?

Cheers
mk