[thelist] cron job for crawling web pages

Thu Apr 6 00:44:10 CDT 2006

Hi List,

The current project I'm working on requires connecting to certain provider
sites and storing the raw html from those sites for later processing.

Note that what the project is not illegal or unethical, since the content
provider sites will accept us to crawl them.

Secondly, the provider sites does not give xml feed of any sort; no rss no
nothing.
We have to grab the data and mine it manually.

Now comes the questions:

Given that we will be using some *nix distribution (fedora core5 most
probably), mySQL for the storage part, and again most probably apache as the
web server and ruby on rails with ajax for the MVC structure (wow!)

What technology will be apropriate here?
imho, writing a servlet and calling it periodically with cron job would do
it.
can it done with ruby? If so how hard would that be?
I believe java is much better in terms of low-level networking capabilities
than ruby but I am not sure because I'm not as experienced in rails as I am
in java.

Secondly, though I theoretically can deduce what to do; I have not worked
with a unix distribution for a long long time (I'm more of a
win*server/sqlserver/.net guy)
How hard will my life be? What should I expect as the unexpected?

And finally, I will be running VMWare on my local machine (cuz I do not want
to mess my win2k os since I'm developing .net projects there) Have you used
it to deploy a *nix distribution. Did it cause any problems?

Thank you very much in advance,
--
Volkan Ozcelik
+>Yep! I'm blogging! : http://www.volkanozcelik.com/volkanozcelik/blog/
+> My projects/studies/trials/errors : http://www.sarmal.com/
<http://www.sarmal.com/>