[thelist] FW: Link Rot

Luther, Ron Ron.Luther at hp.com
Wed Aug 12 14:23:09 CDT 2009

Hi Gang,

I'm thinking of maybe expanding this into an article for evolt ... but I thought I would run it by y'all first to see if it was worth developing any further.

I was talking to some folks about a 'knowledge management system' yesterday and got a bit annoyed when I asked about the link rot issue and they immediately threw up their hands and gave up on even pretending to address the problem.  I really hate that "impossible" word, and anyway, it didn't seem like it should be that hard a thing to tackle.  ;-)

It's mostly at a conceptual level, but here is what I thought up this morning:  I think it should work for any large content system.

(1) Start by implementing an internal version of a system like Tinyurl.com.  These don't look very hard to build.  There are several of these around the web.  You input a 'long' URL.  They insert it into a DB table and give you a much shorter [indexed] URL.  When you click on the short URL you are taken to their website where an application cross references the longer URL from the DB and redirects you to that location. I suspect that very little coding is actually needed ... and that even less for an internal app where you can maim anyone abusing the system!

(2) Add some code to the 'validation' or 'publication' approval process to ensure that no non-tiny URLs can be loaded into your content system.  Again, this should be pretty easy to do {if you find "http://blahblah" and not "http://BigEvolt/xx" flag it as an error) and should be fairly easy to incorporate into the workflow 'approval' process.

(2B)  For extra credit you should be able to work up a spider to find all existing non-tiny URLs in your system, run them through the process and automagically replace them in your content.  Thus allowing retrofit to systems already containing a considerable amount of content.  {Yeah, yeah - there may be some pain while you work through excluding your naming convention for inclusion, graphic, and css files. I'm not convinced (unless y'all say so) that this is such a biggie.  It may force naming conventions to be a little more formally structured and organized - but for an internal content app that may not be a bad thing.  [It you're aggregating content from dozens of sources where you have no control over their structure it may not be a workable solution - la vie!] }

(3) Now the fun part.  The benefit of an external tinyurl service is that the 'tiny' link is unlikely to wrap and break in an email.  The benefit of an internal implementation of such a service is that it gives you a single database table containing *all* of the reference URLs in your entire content system.  That is a big plus.  Huge.  Once you have that information in one place you can write a routine that checks the links one-by-one and reads the http return status code.  That allows you to generate a report and, for example, send Joel an email containing all of the "404" (or whatever else you choose to test for) broken links, along with identification of the page containing the bad hyperlink, and the name of the person who entered that page so he can track them down and 're-educamate'them.

Yes?  No?  Any value here?  Or is everyone already using a much simpler procedure that I haven't been clued in on yet?


More information about the thelist mailing list