[thelist] Regex Riddle

Frank Marion lists at frankmarion.com
Sat Aug 28 16:50:18 CDT 2010

I'm trying to figure out a regex that will enable me to simply do search
engine safe urls, but it has to do it only on relative urls in a body of
html text. Relative urls obviously belong to the domain, whereas,  
a full url points to another server. I've been busting my butt on this  
the last couple days with poor results.

Essentially, what I want to do is to replace ampersands ( & and &)  
equal signs (=) with a forward slash (/). So essentially, I'm going from
index.cfm?foo=bar&poo=bear to index.cfm?foo/bar/poo/bear

My regexes mostly match the overall string, but capture incorrectly to  
degree or another, the backreferences are all wrong.

Below are my three most (dubiously) sucessful regexes, the replace  
test I
use just to check out how it all parses, and some sample data to be  

Number 3 is probably the closes, but I can't seem to figure out how to  
(\w+)=(\w+)([&|&]*) to be greedy until the end of the query string.

How would you approach this problem?

Try #1: [/\.]*index\.cfm((\?)(([^=]+)=([a-zA-Z0-9]+))(([&|\b&\b]) 
Try #2: [/\.]*index\.cfm(\??)([^=]+)(=)([a-zA-Z0-9]+)([&|\b&\b])*
Try #3: [/\.]*index.cfm\??(\w+)=(\w+)([&|&]*)

Replace parse test:
/index.cfm [\1] [\2] [\3] [\4] [\5] [\6] [\7] [\8] [\9]

Sample data to match:

01 <a href="index.cfm">
02 <a href="/index.cfm">
03 <a href="../index.cfm">

04 <a href="index.cfm?foo=bar">
05 <a href="/index.cfm?foo=bar">
06 <a href="../index.cfm?foo=bar">

07 <a href="index.cfm?foo=bar&regex=fun">
08 <a href="/index.cfm?foo=bar&amp;regex=fun">
09 <a href="../index.cfm?foo=bar&regex=fun">

10 <a href="index.cfm?foo=bar&regex=fun&amp;really=itis">
11 <a href="/index.cfm?foo=bar&regex=fun&amp;really=itis">
12 <a href="../index.cfm?foo=bar&regex=fun&amp;really=itis">

13 <a href="index.cfm?method=contact">Contact</a>
14 <a href="/index.cfm?method=contact">Contact</a>
15 <a href="../../index.cfm?method=contact">Contact</a>

<a href="http://www.example.com/index.cfm?method=this&cat=12&amp;that=948 


Frank Marion
lists [_at_] frankmarion.com

More information about the thelist mailing list