[thelist] httrack or maybe wget help?

Sam-I-Am sam at sam-i-am.com
Thu Nov 6 15:24:52 CST 2003


I have a need to download a site I'm working on in order to take a 
static snapshot to archive and show the client. I have to do this every 
week or so, and sometimes a few times in succession (as the process 
usually throws up bugs I need to fix then re-download).
I've been using winHTTtrack.. but its quirks and lack of flexibility 
have me looking elsewhere. I thought maybe wget could help, but I don't 
see all the options I need. Here's the things I do like about httrack:

* it's fairly fast, (much faster than the perl script I wrote to do the 
same - being multi-threaded and all)
* it gives some useful options for the filenames and directory structure 
it creates, including creating a filename from the MD5 of the 
querystring - important for me as I have some urls like: 
/?context=codeView and so on.
* it does some re-writing of the html to fix paths like the above

Things I don't like
* i'd like more control over the url->filename conversion (e.g to 
convert /?context=codeView to codeView.html)
* it seems to rewrite the paths in the html whether I like it or not. 
I'd like more control over where and how it does so.
* Its quirky about re-mirroring just a section of the site - in a case 
where I just want to grab changes to one or 2 files, I end up 
re-scanning the whole thing. Files outside of the current project 
definition are treated as external, and all links to them get garbled.
* Sometimes I get duplicates of my files, where it doesn't spot that it 
really is the same thing, so I end up with lots of image-01.gif, again 
with the html re-written accordingly.

I drop the output onto a server, so I don't really need _all_ the paths 
relativized. The option to create index.html files from dir/ urls is a 
good one though.
Other options I'm pursuing are generating a url list as one step, 
possibly editing it, and feeding it to wget or httrack or some other 
program to do the downloading.
I only want to download the (server-generated) html files. All the other 
  stuff I can mirror across more efficiently with beyond compare or similar.

Any advice or thoughts? I'm on a windows box, with most of the cygwin 
tools installed. My perl isn't really up to this kind of task - I can do 
it, but the result is too slow to be practical. The site has around 2000 
unique urls/pages.. I'm shooting for 10-20 minutes max.

TIA,
Sam








More information about the thelist mailing list