[thelist] [SPAM] cron job for crawling web pages
Anthony Baratta
Anthony at Baratta.com
Thu Apr 6 02:04:50 CDT 2006
What are you going to process the pages for? If you use Swish-e, you can
spider the pages and index them into an index file or mySQL.
http://swish-e.org/
Either way, the perl script that does the spidering could be tweaked to
not run the HTML through Swish-e and store that for what every
processing you want to do later.
Hope that helps.
More information about the thelist
mailing list