[thelist] Regex Riddle

Bill Moseley moseley at hank.org
Sat Aug 28 23:12:10 CDT 2010


On Sat, Aug 28, 2010 at 2:50 PM, Frank Marion <lists at frankmarion.com> wrote:

> I'm trying to figure out a regex that will enable me to simply do search
> engine safe urls, but it has to do it only on relative urls in a body of
> html text. Relative urls obviously belong to the domain, whereas,
> typically,
> a full url points to another server. I've been busting my butt on this for
> the last couple days with poor results.
>
> Essentially, what I want to do is to replace ampersands ( & and &amp;) and
> equal signs (=) with a forward slash (/). So essentially, I'm going from
> index.cfm?foo=bar&poo=bear to index.cfm?foo/bar/poo/bear
>

I guess I'd take a different approach.  I'd use code that knows about URLs
and then pull out the parts.   Not 100% clear what you are after, though --
what is a "search engine safe" url?

As an example, using Perl's URI module you could inspect if there's a host
or not (or pass in a base URL and determine if the host is local or not, if
that's what you are after).

$ perl -MURI -lwe 'print "absolute\n" if URI->new( "
http://example.com/foo/bar.html" )->can( "host" )'
absolute

And likewise you could pull out the query keys and values and join them with
a slash.


07 <a href="index.cfm?foo=bar&regex=fun">
> 08 <a href="/index.cfm?foo=bar&amp;regex=fun">
>

07 is incorrect, of course.  Not that it's not common practice to forget to
escape in hrefs.  Depending on the tools you use to extract the href from
the markup it may or may not be un-escaped already.  But if not, you should
do that first.



-- 
Bill Moseley
moseley at hank.org


More information about the thelist mailing list