[thelist] Regex Riddle
Frank Marion
lists at frankmarion.com
Sat Aug 28 16:50:18 CDT 2010
I'm trying to figure out a regex that will enable me to simply do search
engine safe urls, but it has to do it only on relative urls in a body of
html text. Relative urls obviously belong to the domain, whereas,
typically,
a full url points to another server. I've been busting my butt on this
for
the last couple days with poor results.
Essentially, what I want to do is to replace ampersands ( & and &)
and
equal signs (=) with a forward slash (/). So essentially, I'm going from
index.cfm?foo=bar&poo=bear to index.cfm?foo/bar/poo/bear
My regexes mostly match the overall string, but capture incorrectly to
one
degree or another, the backreferences are all wrong.
Below are my three most (dubiously) sucessful regexes, the replace
test I
use just to check out how it all parses, and some sample data to be
matched.
Number 3 is probably the closes, but I can't seem to figure out how to
get
(\w+)=(\w+)([&|&]*) to be greedy until the end of the query string.
How would you approach this problem?
Regex:
Try #1: [/\.]*index\.cfm((\?)(([^=]+)=([a-zA-Z0-9]+))(([&|\b&\b])
([^=]+)=([^\"]+)))?
Try #2: [/\.]*index\.cfm(\??)([^=]+)(=)([a-zA-Z0-9]+)([&|\b&\b])*
Try #3: [/\.]*index.cfm\??(\w+)=(\w+)([&|&]*)
Replace parse test:
/index.cfm [\1] [\2] [\3] [\4] [\5] [\6] [\7] [\8] [\9]
Sample data to match:
01 <a href="index.cfm">
02 <a href="/index.cfm">
03 <a href="../index.cfm">
04 <a href="index.cfm?foo=bar">
05 <a href="/index.cfm?foo=bar">
06 <a href="../index.cfm?foo=bar">
07 <a href="index.cfm?foo=bar®ex=fun">
08 <a href="/index.cfm?foo=bar&regex=fun">
09 <a href="../index.cfm?foo=bar®ex=fun">
10 <a href="index.cfm?foo=bar®ex=fun&really=itis">
11 <a href="/index.cfm?foo=bar®ex=fun&really=itis">
12 <a href="../index.cfm?foo=bar®ex=fun&really=itis">
13 <a href="index.cfm?method=contact">Contact</a>
14 <a href="/index.cfm?method=contact">Contact</a>
15 <a href="../../index.cfm?method=contact">Contact</a>
<a href="http://www.example.com/index.cfm?method=this&cat=12&that=948
">Moar
stuff</a>
Thanks!
--
Frank Marion
lists [_at_] frankmarion.com
More information about the thelist
mailing list