[thelist] PHP RegEx

Jon Molesa rjmolesa at consoltec.net
Wed Jul 11 12:25:58 CDT 2007


The html I'm parsing looks like:
	<font face="Verdana, Arial, Helvetica" size="-2" class="small">
        <a href="http://www.Company.com" target="_new">Company</a>
    </font>

The <a> is optional so 0 or more.  But the company name will always be
there.

I don't really have a problem with my regex as it is working as is, but
I'm sure it could be improved upon.  My question really is why does:

	$pattern ='/(?:[.*]*class="small">[\s\n]*)<a\shref="(?:http|https)+(?::\/\/){1}(?P<domain>.*\..*\.(?:com|net|edu|biz|org|info|name|us|cc|tv|gov){1})(?:.*)"\starget="_new">(?P<bizname>.*)<\/a>/';

	if(preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER))
    	print_r($matches);
	}

	return this:

	Array
	(
		[0] => Array
			(
				[0] => class="small">
																					<a href="http://www.Company.com" target="_new">Company</a>
				[domain] => www.Company.com
				[1] => www.Company.com
				[bizname] => Company
				[2] => Company
			)
	}

As you can see I'm getting back too much information.  I could forget
about it and just use the data I need, but I'm really looking for an
understanding as to why this is happening.  Especially the [0][0] part
of the array.  From the php manual and what I've read so far this
doesn't make sense to me.  I would expect to only get [0][domain] and
[0][bizname].  Any insights or explanations will be appreciated.  If
you have improvements on the regex I'll happily take those too, but for
now I'm really interested in understanding the php function
preg_match_all and why it returns so much.  Thanks all.

*On Wed, Jul 11, 2007 at 09:50:48AM -0700 Anthony Baratta <anthony at baratta.com> wrote:

> Date: Wed, 11 Jul 2007 09:50:48 -0700
> From: "Anthony Baratta" <anthony at baratta.com>
> To: thelist at lists.evolt.org <thelist at lists.evolt.org>
> X-Mailer: IceWarp Web Mail 5.6.7
> Subject: Re: [thelist] PHP RegEx
> 
> Can you post the actual regEx and a few examples of strings you are searching?
> 
> -----Original message-----
> From: Jon Molesa rjmolesa at consoltec.net
> Date: Wed, 11 Jul 2007 09:46:56 -0700
> To: TheList thelist at lists.evolt.org
> Subject: [thelist] PHP RegEx
> 
> > Hey guys I posted this on regex at gogolegroups.com also, but not sure how
> > used that list is.
> > 
> > I can't understand why when I use 
> > 
> > preg_match_all('/(regex)/', 'text string', $matches, PREG_SET_ORDER) 
> > 
> > that the entire regex gets returned in the array $matches.  Even if 
> > I use (?:regex) on the whole regex I still get the whole thing back.  
> > Is there and way to drop that part?  I know I could just ignore it, 
> > when looping through the array, but I'd like a bit more control over it.
> > 
> > It just seems as if I'm getting back more than I should.
> > Thanks.
> > 
> > -- 
> > Jon Molesa
> > rjmolesa at consoltec.net
> > if you're bored or curious
> > http://rjmolesa.com
> > -- 
> 
> -- 
> 
> * * Please support the community that supports you.  * *
> http://evolt.org/help_support_evolt/
> 
> For unsubscribe and other options, including the Tip Harvester 
> and archives of thelist go to: http://lists.evolt.org 
> Workers of the Web, evolt ! 

-- 
Jon Molesa
rjmolesa at consoltec.net
if you're bored or curious
http://rjmolesa.com



More information about the thelist mailing list