[thelist] PHP RegEx

Anthony Baratta anthony at baratta.com
Wed Jul 11 18:55:54 CDT 2007


Here's an option:

(?<=<a href="https?://)(.[^"]*)(" target="_new">)(.[^<]*)(?=</a>)

Returns:

group(0)  	www.Company.com" target="_new">Company
group(1)  	www.Company.com
group(2)  	" target="_new">  	
group(3)        Company

-----Original message-----
From: Jon Molesa rjmolesa at consoltec.net
Date: Wed, 11 Jul 2007 10:25:58 -0700
To: Anthony Baratta anthony at baratta.com, "thelist at lists.evolt.org" thelist at lists.evolt.org
Subject: Re: [thelist] PHP RegEx

> The html I'm parsing looks like:
> 	<font face="Verdana, Arial, Helvetica" size="-2" class="small">
>         <a href="http://www.Company.com" target="_new">Company</a>
>     </font>
> 
> The <a> is optional so 0 or more.  But the company name will always be
> there.
> 
> I don't really have a problem with my regex as it is working as is, but
> I'm sure it could be improved upon.  My question really is why does:
> 
> 	$pattern ='/(?:[.*]*class="small">[\s\n]*)<a\shref="(?:http|https)+(?::\/\/){1}(?P<domain>.*\..*\.(?:com|net|edu|biz|org|info|name|us|cc|tv|gov){1})(?:.*)"\starget="_new">(?P<bizname>.*)<\/a>/';
> 
> 	if(preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER))
>     	print_r($matches);
> 	}
> 
> 	return this:
> 
> 	Array
> 	(
> 		[0] => Array
> 			(
> 				[0] => class="small">
> 																					<a href="http://www.Company.com" target="_new">Company</a>
> 				[domain] => www.Company.com
> 				[1] => www.Company.com
> 				[bizname] => Company
> 				[2] => Company
> 			)
> 	}
> 
> As you can see I'm getting back too much information.  I could forget
> about it and just use the data I need, but I'm really looking for an
> understanding as to why this is happening.  Especially the [0][0] part
> of the array.  From the php manual and what I've read so far this
> doesn't make sense to me.  I would expect to only get [0][domain] and
> [0][bizname].  Any insights or explanations will be appreciated.  If
> you have improvements on the regex I'll happily take those too, but for
> now I'm really interested in understanding the php function
> preg_match_all and why it returns so much.  Thanks all.
> 
> *On Wed, Jul 11, 2007 at 09:50:48AM -0700 Anthony Baratta <anthony at baratta.com> wrote:
> 
> > Date: Wed, 11 Jul 2007 09:50:48 -0700
> > From: "Anthony Baratta" <anthony at baratta.com>
> > To: thelist at lists.evolt.org <thelist at lists.evolt.org>
> > X-Mailer: IceWarp Web Mail 5.6.7
> > Subject: Re: [thelist] PHP RegEx
> > 
> > Can you post the actual regEx and a few examples of strings you are searching?
> > 
> > -----Original message-----
> > From: Jon Molesa rjmolesa at consoltec.net
> > Date: Wed, 11 Jul 2007 09:46:56 -0700
> > To: TheList thelist at lists.evolt.org
> > Subject: [thelist] PHP RegEx
> > 
> > > Hey guys I posted this on regex at gogolegroups.com also, but not sure how
> > > used that list is.
> > > 
> > > I can't understand why when I use 
> > > 
> > > preg_match_all('/(regex)/', 'text string', $matches, PREG_SET_ORDER) 
> > > 
> > > that the entire regex gets returned in the array $matches.  Even if 
> > > I use (?:regex) on the whole regex I still get the whole thing back.  
> > > Is there and way to drop that part?  I know I could just ignore it, 
> > > when looping through the array, but I'd like a bit more control over it.
> > > 
> > > It just seems as if I'm getting back more than I should.
> > > Thanks.
> > > 
> > > -- 
> > > Jon Molesa
> > > rjmolesa at consoltec.net
> > > if you're bored or curious
> > > http://rjmolesa.com
> > > -- 
> > 
> > -- 
> > 
> > * * Please support the community that supports you.  * *
> > http://evolt.org/help_support_evolt/
> > 
> > For unsubscribe and other options, including the Tip Harvester 
> > and archives of thelist go to: http://lists.evolt.org 
> > Workers of the Web, evolt ! 
> 
> -- 
> Jon Molesa
> rjmolesa at consoltec.net
> if you're bored or curious
> http://rjmolesa.com
> -- 
> 
> * * Please support the community that supports you.  * *
> http://evolt.org/help_support_evolt/
> 
> For unsubscribe and other options, including the Tip Harvester 
> and archives of thelist go to: http://lists.evolt.org 
> Workers of the Web, evolt ! 
> 



More information about the thelist mailing list