[thelist] PHP RegEx
Anthony Baratta
anthony at baratta.com
Wed Jul 11 18:55:54 CDT 2007
Here's an option:
(?<=<a href="https?://)(.[^"]*)(" target="_new">)(.[^<]*)(?=</a>)
Returns:
group(0) www.Company.com" target="_new">Company
group(1) www.Company.com
group(2) " target="_new">
group(3) Company
-----Original message-----
From: Jon Molesa rjmolesa at consoltec.net
Date: Wed, 11 Jul 2007 10:25:58 -0700
To: Anthony Baratta anthony at baratta.com, "thelist at lists.evolt.org" thelist at lists.evolt.org
Subject: Re: [thelist] PHP RegEx
> The html I'm parsing looks like:
> <font face="Verdana, Arial, Helvetica" size="-2" class="small">
> <a href="http://www.Company.com" target="_new">Company</a>
> </font>
>
> The <a> is optional so 0 or more. But the company name will always be
> there.
>
> I don't really have a problem with my regex as it is working as is, but
> I'm sure it could be improved upon. My question really is why does:
>
> $pattern ='/(?:[.*]*class="small">[\s\n]*)<a\shref="(?:http|https)+(?::\/\/){1}(?P<domain>.*\..*\.(?:com|net|edu|biz|org|info|name|us|cc|tv|gov){1})(?:.*)"\starget="_new">(?P<bizname>.*)<\/a>/';
>
> if(preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER))
> print_r($matches);
> }
>
> return this:
>
> Array
> (
> [0] => Array
> (
> [0] => class="small">
> <a href="http://www.Company.com" target="_new">Company</a>
> [domain] => www.Company.com
> [1] => www.Company.com
> [bizname] => Company
> [2] => Company
> )
> }
>
> As you can see I'm getting back too much information. I could forget
> about it and just use the data I need, but I'm really looking for an
> understanding as to why this is happening. Especially the [0][0] part
> of the array. From the php manual and what I've read so far this
> doesn't make sense to me. I would expect to only get [0][domain] and
> [0][bizname]. Any insights or explanations will be appreciated. If
> you have improvements on the regex I'll happily take those too, but for
> now I'm really interested in understanding the php function
> preg_match_all and why it returns so much. Thanks all.
>
> *On Wed, Jul 11, 2007 at 09:50:48AM -0700 Anthony Baratta <anthony at baratta.com> wrote:
>
> > Date: Wed, 11 Jul 2007 09:50:48 -0700
> > From: "Anthony Baratta" <anthony at baratta.com>
> > To: thelist at lists.evolt.org <thelist at lists.evolt.org>
> > X-Mailer: IceWarp Web Mail 5.6.7
> > Subject: Re: [thelist] PHP RegEx
> >
> > Can you post the actual regEx and a few examples of strings you are searching?
> >
> > -----Original message-----
> > From: Jon Molesa rjmolesa at consoltec.net
> > Date: Wed, 11 Jul 2007 09:46:56 -0700
> > To: TheList thelist at lists.evolt.org
> > Subject: [thelist] PHP RegEx
> >
> > > Hey guys I posted this on regex at gogolegroups.com also, but not sure how
> > > used that list is.
> > >
> > > I can't understand why when I use
> > >
> > > preg_match_all('/(regex)/', 'text string', $matches, PREG_SET_ORDER)
> > >
> > > that the entire regex gets returned in the array $matches. Even if
> > > I use (?:regex) on the whole regex I still get the whole thing back.
> > > Is there and way to drop that part? I know I could just ignore it,
> > > when looping through the array, but I'd like a bit more control over it.
> > >
> > > It just seems as if I'm getting back more than I should.
> > > Thanks.
> > >
> > > --
> > > Jon Molesa
> > > rjmolesa at consoltec.net
> > > if you're bored or curious
> > > http://rjmolesa.com
> > > --
> >
> > --
> >
> > * * Please support the community that supports you. * *
> > http://evolt.org/help_support_evolt/
> >
> > For unsubscribe and other options, including the Tip Harvester
> > and archives of thelist go to: http://lists.evolt.org
> > Workers of the Web, evolt !
>
> --
> Jon Molesa
> rjmolesa at consoltec.net
> if you're bored or curious
> http://rjmolesa.com
> --
>
> * * Please support the community that supports you. * *
> http://evolt.org/help_support_evolt/
>
> For unsubscribe and other options, including the Tip Harvester
> and archives of thelist go to: http://lists.evolt.org
> Workers of the Web, evolt !
>
More information about the thelist
mailing list