[thelist] Offshore Outsourcing
nets at searchtools.com
nets at searchtools.com
Wed May 15 00:49:09 CDT 2002
Google can extract text that's already in PDFs that were generated
from applications such as Word or FrameMaker or Quark. But it takes
an OCR program to convert *scanned* images to text (whether they're
in PDF or not), and those are generally difficult to train with old
documents.
For those who are interested, I have a rant on PDF on the web,
especially as regards to searchability, which covers some of this as
well:
<http://www.searchtools.com/info/pdf.html>
Hope that helps,
Avi
At 9:48 PM -0700 5/14/02, Marcia Welter wrote:
>Mind you, I'm coming from a point of technical ignorance on this, but
>there's some vague stirring of recognition beginning to lurk with this
>issue. It's got something to do with Google's ability for indexing PDF
>documents, both in their regular SERPs and in their catalogs.
>
>To do that, wouldn't Google have to be able to parse the PDF documents? And
>if they can be parsed, wouldn't there be some technology possible to render
>a conversion to HTML?
--
Complete Guide to Search Engines for Web Sites and Intranets
<http://www.searchtools.com>
More information about the thelist
mailing list