[thelist] [ColdFusion] Regular Expression Questions

.jeff jeff at members.evolt.org
Mon Feb 24 17:14:01 CST 2003


hey all,

got a regex question.  have a string of text that i'm doing some fixing of
links to append some session related information.  i'm looking for a couple
of different things -- relative links, root relative links, and absolute
links that match a list of pre-determined domain names.  i've got the logic
figured out how to handle the searching and replacing for each of these
separate entities.  right now, my difficulty is in getting the expressions
correct so that it doesn't match domains that have matching strings, but
aren't entirely correct.  let's assume the domain i'm looking for is
example.com.

i've used '<a href="(https?:\/\/[www\.]?example\.com[^"]+)', but don't like
it because it's too specific.  if, for whatever reason we were to add
another subdomain to the mix links that were added and used that new,
non-www subdomain would stop working.  for example, we already use 'secure.'
for one of the sites.  i really don't want a subexpression like
'[www\.|secure\.]?' either.

i've used '<a href="(https?:\/\/[a-z0-9\-\.]*example\.com[^"]+)', but don't
like it because it matches things like 'http://wwwexample.com' which is
obviously not the desired effect.

i've used '<a href="(https?:\/\/[[a-z0-9\-]+\.]?example\.com[^"]+)' and '<a
href="(https?:\/\/[[a-z0-9\-]*\.]?example\.com[^"]+)', but don't like either
because even though they match things like 'http://www.example.com', they
don't seem to match 'http://example.com'.

so, i guess i'm looking for a subexpression that says "if there are any of
the characters in the range of a-z, 0-9, and - (the only valid characters
for a subdomain), then they must be followed by a dot, but the existence of
the subdomain is optional".

so, next question is how to find relative and root relative links in the
document.  obviously, '<a href="(\/?[^"]+)' is far too general.

an added bonus would be a regex i could use to quote all attribute values.
i've already got one that'll swap quotes from single to double (find
'=''([^'']*)''' and replace with '="\1"').  if i don't clean up attribute
values first, i'm likely to miss some links that should be fixed.

using coldfusion, so non-cf explanations will still be helpful, but any
syntax will likely be frustrating if different from cf's.

thanks,

.jeff

http://evolt.org/
jeff at members.evolt.org
http://members.evolt.org/jeff/




More information about the thelist mailing list