[thelist] httrack or maybe wget help?
Sam-I-Am
sam at sam-i-am.com
Thu Nov 6 15:24:52 CST 2003
I have a need to download a site I'm working on in order to take a
static snapshot to archive and show the client. I have to do this every
week or so, and sometimes a few times in succession (as the process
usually throws up bugs I need to fix then re-download).
I've been using winHTTtrack.. but its quirks and lack of flexibility
have me looking elsewhere. I thought maybe wget could help, but I don't
see all the options I need. Here's the things I do like about httrack:
* it's fairly fast, (much faster than the perl script I wrote to do the
same - being multi-threaded and all)
* it gives some useful options for the filenames and directory structure
it creates, including creating a filename from the MD5 of the
querystring - important for me as I have some urls like:
/?context=codeView and so on.
* it does some re-writing of the html to fix paths like the above
Things I don't like
* i'd like more control over the url->filename conversion (e.g to
convert /?context=codeView to codeView.html)
* it seems to rewrite the paths in the html whether I like it or not.
I'd like more control over where and how it does so.
* Its quirky about re-mirroring just a section of the site - in a case
where I just want to grab changes to one or 2 files, I end up
re-scanning the whole thing. Files outside of the current project
definition are treated as external, and all links to them get garbled.
* Sometimes I get duplicates of my files, where it doesn't spot that it
really is the same thing, so I end up with lots of image-01.gif, again
with the html re-written accordingly.
I drop the output onto a server, so I don't really need _all_ the paths
relativized. The option to create index.html files from dir/ urls is a
good one though.
Other options I'm pursuing are generating a url list as one step,
possibly editing it, and feeding it to wget or httrack or some other
program to do the downloading.
I only want to download the (server-generated) html files. All the other
stuff I can mirror across more efficiently with beyond compare or similar.
Any advice or thoughts? I'm on a windows box, with most of the cygwin
tools installed. My perl isn't really up to this kind of task - I can do
it, but the result is too slow to be practical. The site has around 2000
unique urls/pages.. I'm shooting for 10-20 minutes max.
TIA,
Sam
More information about the thelist
mailing list