[thelist] reg exp for last whack

John Hicks johnlist at gulfbridge.net
Sun Apr 16 23:39:13 CDT 2006


Canfield, Joel wrote:
> on won xp pro trying to parse a list of file paths to find dups; about
> 6,000 files in an extensive directory structure
> 
> i have a text dump with one full path and file name per line. my
> thinking was to split the path from the filename, then dump it to SQL
> and query for dups.
> 
> tried this in textpad for a reg exp and it says it's invalid:
> 
>     \\\([a-z0-9]*\)\n

This works in Textpad:

Find:  \\\([^\\]+\)$

Replace with:  \t\1

(Use $ instead of \n)
(Identify the filename as being the string of non-slashes from the last 
slash to the end of the line.)

This simply inserts a tab to separate the path from the file name, which 
is what I believe you wanted to do.

Or you can flip the path and filename like so:

Find:  ^\(.+\)\\\([^\\]+\)$

Replace with:  \2\t\1

This renders from:

C:\music\+SortedByYear\2000\Don Henley Workin It mp3
C:\music\+SortedByYear\2000\Don Henley Taking you Home mp3
C:\music\+SortedByYear\2000\Don Henley Inside Job mp3
C:\music\10,000 Maniacs\Blind Man's Zoo\Dust Bowl wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Eat for Two wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Hateful Hate wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Headstrong wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Jubilee wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Please Forgive Us wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Poison in the Well wma

to:

Don Henley Workin It mp3	C:\music\+SortedByYear\2000
Don Henley Taking you Home mp3	C:\music\+SortedByYear\2000
Don Henley Inside Job mp3	C:\music\+SortedByYear\2000
Dust Bowl wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Eat for Two wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Hateful Hate wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Headstrong wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Jubilee wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Please Forgive Us wma	C:\music\10,000 Maniacs\Blind Man's Zoo
Poison in the Well wma	C:\music\10,000 Maniacs\Blind Man's Zoo

Then you can sort (hmmm... strange that your example data is already 
sorted by filename) and check for dupes.

I would think the checking for dupes should be done programmatically 
(e.g. via perl or bash or php). How were you planning on doing it via SQL?

--John

> 
> what i'm trying to write is
> 
>     find a whack (escaped)       \\
>     tagged expression opening    \(
>     any number of alphanumerics  [a-z0-9]* 
>         (textpad is not case sensitive unless you specify)
>     tagged expression closing    \)
>     newline                      \n
> 
> and replace it with
> 
>     \t\1\n
> 
> a tab, the tagged expression, a newline
> 
> i've stripped all non-alphanumerics from the file names, trying to avoid
> having to include every possible special character in the reg exp.
> 
> here's a sample of the data
> 
> C:\music\+SortedByYear\2000\Don Henley Nobody Else in the World But You
> mp3
> C:\music\+SortedByYear\2000\Don Henley The Genie mp3
> C:\music\+SortedByYear\2000\Don Henley They're Not Here, They're Not
> Coming mp3
> C:\music\+SortedByYear\2000\Don Henley Workin It mp3
> C:\music\+SortedByYear\2000\Don Henley Taking you Home mp3
> C:\music\+SortedByYear\2000\Don Henley Inside Job mp3
> C:\music\10,000 Maniacs\Blind Man's Zoo\Dust Bowl wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Eat for Two wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Hateful Hate wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Headstrong wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Jubilee wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Please Forgive Us wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Poison in the Well wma
> 
> i'm stumped; just dug out my mastering reg exp book to finally go
> through it; any help in the meantime?
> 
> spinhead




More information about the thelist mailing list