[thelist] Regex Riddle

Frank Marion lists at frankmarion.com
Sat Aug 28 16:50:18 CDT 2010


I'm trying to figure out a regex that will enable me to simply do search
engine safe urls, but it has to do it only on relative urls in a body of
html text. Relative urls obviously belong to the domain, whereas,  
typically,
a full url points to another server. I've been busting my butt on this  
for
the last couple days with poor results.

Essentially, what I want to do is to replace ampersands ( & and &)  
and
equal signs (=) with a forward slash (/). So essentially, I'm going from
index.cfm?foo=bar&poo=bear to index.cfm?foo/bar/poo/bear

My regexes mostly match the overall string, but capture incorrectly to  
one
degree or another, the backreferences are all wrong.

Below are my three most (dubiously) sucessful regexes, the replace  
test I
use just to check out how it all parses, and some sample data to be  
matched.

Number 3 is probably the closes, but I can't seem to figure out how to  
get
(\w+)=(\w+)([&|&]*) to be greedy until the end of the query string.

How would you approach this problem?


Regex:
Try #1: [/\.]*index\.cfm((\?)(([^=]+)=([a-zA-Z0-9]+))(([&|\b&\b]) 
([^=]+)=([^\"]+)))?
Try #2: [/\.]*index\.cfm(\??)([^=]+)(=)([a-zA-Z0-9]+)([&|\b&\b])*
Try #3: [/\.]*index.cfm\??(\w+)=(\w+)([&|&]*)



Replace parse test:
/index.cfm [\1] [\2] [\3] [\4] [\5] [\6] [\7] [\8] [\9]


Sample data to match:

01 <a href="index.cfm">
02 <a href="/index.cfm">
03 <a href="../index.cfm">

04 <a href="index.cfm?foo=bar">
05 <a href="/index.cfm?foo=bar">
06 <a href="../index.cfm?foo=bar">

07 <a href="index.cfm?foo=bar&regex=fun">
08 <a href="/index.cfm?foo=bar&amp;regex=fun">
09 <a href="../index.cfm?foo=bar&regex=fun">

10 <a href="index.cfm?foo=bar&regex=fun&amp;really=itis">
11 <a href="/index.cfm?foo=bar&regex=fun&amp;really=itis">
12 <a href="../index.cfm?foo=bar&regex=fun&amp;really=itis">

13 <a href="index.cfm?method=contact">Contact</a>
14 <a href="/index.cfm?method=contact">Contact</a>
15 <a href="../../index.cfm?method=contact">Contact</a>

<a href="http://www.example.com/index.cfm?method=this&cat=12&amp;that=948 
">Moar
stuff</a>

Thanks!

--
Frank Marion
lists [_at_] frankmarion.com







More information about the thelist mailing list