[thelist] [SPAM] cron job for crawling web pages

Anthony Baratta Anthony at Baratta.com
Thu Apr 6 02:04:50 CDT 2006


What are you going to process the pages for? If you use Swish-e, you can 
spider the pages and index them into an index file or mySQL.

http://swish-e.org/

Either way, the perl script that does the spidering could be tweaked to 
not run the HTML through Swish-e and store that for what every 
processing you want to do later.

Hope that helps.



More information about the thelist mailing list