[thelist] Counting words on a site

Joe Crawford jcrawford at avencom.com
Thu Mar 28 18:58:01 CST 2002


I'd like to address counting words on a website --

I recently did this for a proposal we worked on here at AVENCOM. The
technique I used was this:

I ran a linklint <http://www.linklint.org/> of the site. Linklint is
nice because it will give you a full list of the urls on a site. It can
also help you identify javascript: linked urls which are otherwise
unspiderable. It will also help you identify multimedia content (flash
and the like) on a site which won't be countable in an automated way.
There's also a danger of words in graphics with no alt tags, which would
need to be translated as well. Those are the pitfalls of depending too
much on a robot to do a word count of the site.

So I took the results linklint generates and built a list of urls to
feed to lynx - which I commanded to do a -dump -nolist of each one,
which I then used the unix command wc of. I kept a running tally of each
url so I could see each one.

The site in question also had a bunch (over 200) pdfs as well, so I used
curl <http://curl.haxx.se/>to grab each one to local and then did an
xpdf which then reads the number of words.

Luckily the site was html and pdf almost exclusively, or this would have
been harder.

THEN I did a sanity check of my results. Note that a lynx dump will dump
  navigation as well, which strictly speaking if you have a template
driven site will recur. So based on looking at what lynx/wc against what
I saw the wordcount was in TextPad <http://www.textpad.com/> (paste text
from page into textpad, get properties to get another word count). I
basically removed about 150 words from each page which was global,
redundant stuff.

I did the same kind of sanity check on some representative PDF files,
and got an offset of about -10%.

I dumped the raw numbers and corrected numbers checked against each url
into excel, and did a total, voila - stats for each url on a word count,
with corrections for each, and totals for each.

Then in some final QA I checked some other URLs against my results and
it was pretty close, so I was happy.

When all was said and done it came out to about a million and a half
words, which told us that for what the prospective client wanted, we
would not be able to do the work for the budget in the RFP -- we just
couldn't figure how we could break even on it, let alone make money. We
told them so, so some other company will take that work.

I wish it had ended with us bagging the client, but sometimes it's good
to know just how much you can't afford to take a client.

And that's me,

	- Joe <personal: http://artlung.com/>
--
Joe Crawford | Web Development and Design for AVENCOM
      m: mailto:jcrawford at avencom.com
      p: 619.230.0241
      w: http://www.avencom.com/





More information about the thelist mailing list