[thelist] Offshore Outsourcing

nets at searchtools.com nets at searchtools.com
Wed May 15 00:49:09 CDT 2002


Google can extract text that's already in PDFs that were generated
from applications such as Word or FrameMaker or Quark.  But it takes
an OCR program to convert *scanned* images to text (whether they're
in PDF or not), and those are generally difficult to train with old
documents.

For those who are interested, I have a rant on PDF on the web,
especially as regards to searchability, which covers some of this as
well:

   <http://www.searchtools.com/info/pdf.html>

Hope that helps,

Avi

  At 9:48 PM -0700 5/14/02, Marcia Welter wrote:
>Mind you, I'm coming from a point of technical ignorance on this, but
>there's some vague stirring of recognition beginning to lurk with this
>issue. It's got something to do with Google's ability for indexing PDF
>documents, both in their regular SERPs and in their catalogs.
>
>To do that, wouldn't Google have to be able to parse the PDF documents? And
>if they can be parsed, wouldn't there be some technology possible to render
>a conversion to HTML?

--
Complete Guide to Search Engines for Web Sites and Intranets
    <http://www.searchtools.com>



More information about the thelist mailing list