[thelist] php parse/reg ex

Tue Apr 25 14:39:14 CDT 2006

Marshall Wood wrote:
> I have a field in my database that contains a block of data.  The data
> is formatted so it can be parsed.
> 
> Here is an example of data I'll give the names of the values instead
> of the values themselves, there is usually 30 of them in one field. 
> Name:::::Value~|||||, the ::::: seperates the Name from the Value, the
> ~ seperates the Value from the Value, if there is more then one Value
> per Name, and ||||| seperates the Name from the Next name, but not in
> all cases.  This is an example of one that has no value.
> Name:::::~|||||.  Occaasionally they get a bit out of wack, down
> further in the field, here is an example of one that is out of wack,
> but I still need the values that it holds. 
> Name:::::|||||||||||||||Value~|||||Value~|||||Value~|||||~|||||.
> 
> I have been trying explode() but that does no good, I think its fine
> for the first "Name", the the Value ends up being the rest of the
> data.  :(  Any and all help would be appreciated, I am thinking RegEx
> might be the best bet but a help push start would be awesome.

Interesting puzzle.

This could be a hairy regular expression, so it would help to have a 
little more information:

--You didn't mention line endings? Do they occur and do they mean 
anything? Are we to assume there can be more than one Name on a line?
Generally regular expressions deal with one line at a time. One 
limitation is that they can only deal with 100 pattern matches at a 
time. (Although the preg_match_all() function might not be subject to 
that limitation. I'm not sure.)

--Do Name and Value follow any other rules we can parse by? Perhaps Name 
is alphanumeric? Or maybe Value is numeric? Anything like that would 
make things a lot simpler. If you can specify the character set used for 
for Name and Value, it would be much easier than using only ':::::' and 
'~' and '|||||' as delimiters.

--Can you give us some sample data to play with? :)

Generally, I would build a regular expression one step at a time, 
testing it as I go on real data.

Barring that, I'll take a stab at it.

Let's start with parsing for 'Name'. It looks like we can identify a 
Name as a sequence of characters preceded by '|||||' and followed by 
':::::'. Since both '|' and ':' have special meaning in regular 
expressions, we have to escape them.

(?<=\|\|\|\|\|)Name(?=\:\:\:\:\:)

or (?<=\|\|\|\|\|)(.+)(?=\:\:\:\:\:)

and to find all occurrences of Name, you'd make that into a repeating 
pattern:

or (:(?<=\|\|\|\|\|)(.+)(?=\:\:\:\:\:))+
or ((?<=\|\|\|\|\|).+(?=\:\:\:\:\:))+

But I believe the final '+' is redundant on the preg_match_all() function.

So we plug this into php:

$MyRegEx = '((?<=\|\|\|\|\|).+(?=\:\:\:\:\:))';
preg_match_all ( $MyData, $MyRegEx, $MyResultArray );
var_dump($MyResultArray);

See if you can get that to run [pray] and, if so, if it correctly picks 
out all the Names and nothing else.

Then check back here... We'll be waiting.

--John