[thelist] reg exp for last whack
John Hicks
johnlist at gulfbridge.net
Sun Apr 16 23:39:13 CDT 2006
Canfield, Joel wrote:
> on won xp pro trying to parse a list of file paths to find dups; about
> 6,000 files in an extensive directory structure
>
> i have a text dump with one full path and file name per line. my
> thinking was to split the path from the filename, then dump it to SQL
> and query for dups.
>
> tried this in textpad for a reg exp and it says it's invalid:
>
> \\\([a-z0-9]*\)\n
This works in Textpad:
Find: \\\([^\\]+\)$
Replace with: \t\1
(Use $ instead of \n)
(Identify the filename as being the string of non-slashes from the last
slash to the end of the line.)
This simply inserts a tab to separate the path from the file name, which
is what I believe you wanted to do.
Or you can flip the path and filename like so:
Find: ^\(.+\)\\\([^\\]+\)$
Replace with: \2\t\1
This renders from:
C:\music\+SortedByYear\2000\Don Henley Workin It mp3
C:\music\+SortedByYear\2000\Don Henley Taking you Home mp3
C:\music\+SortedByYear\2000\Don Henley Inside Job mp3
C:\music\10,000 Maniacs\Blind Man's Zoo\Dust Bowl wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Eat for Two wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Hateful Hate wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Headstrong wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Jubilee wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Please Forgive Us wma
C:\music\10,000 Maniacs\Blind Man's Zoo\Poison in the Well wma
to:
Don Henley Workin It mp3 C:\music\+SortedByYear\2000
Don Henley Taking you Home mp3 C:\music\+SortedByYear\2000
Don Henley Inside Job mp3 C:\music\+SortedByYear\2000
Dust Bowl wma C:\music\10,000 Maniacs\Blind Man's Zoo
Eat for Two wma C:\music\10,000 Maniacs\Blind Man's Zoo
Hateful Hate wma C:\music\10,000 Maniacs\Blind Man's Zoo
Headstrong wma C:\music\10,000 Maniacs\Blind Man's Zoo
Jubilee wma C:\music\10,000 Maniacs\Blind Man's Zoo
Please Forgive Us wma C:\music\10,000 Maniacs\Blind Man's Zoo
Poison in the Well wma C:\music\10,000 Maniacs\Blind Man's Zoo
Then you can sort (hmmm... strange that your example data is already
sorted by filename) and check for dupes.
I would think the checking for dupes should be done programmatically
(e.g. via perl or bash or php). How were you planning on doing it via SQL?
--John
>
> what i'm trying to write is
>
> find a whack (escaped) \\
> tagged expression opening \(
> any number of alphanumerics [a-z0-9]*
> (textpad is not case sensitive unless you specify)
> tagged expression closing \)
> newline \n
>
> and replace it with
>
> \t\1\n
>
> a tab, the tagged expression, a newline
>
> i've stripped all non-alphanumerics from the file names, trying to avoid
> having to include every possible special character in the reg exp.
>
> here's a sample of the data
>
> C:\music\+SortedByYear\2000\Don Henley Nobody Else in the World But You
> mp3
> C:\music\+SortedByYear\2000\Don Henley The Genie mp3
> C:\music\+SortedByYear\2000\Don Henley They're Not Here, They're Not
> Coming mp3
> C:\music\+SortedByYear\2000\Don Henley Workin It mp3
> C:\music\+SortedByYear\2000\Don Henley Taking you Home mp3
> C:\music\+SortedByYear\2000\Don Henley Inside Job mp3
> C:\music\10,000 Maniacs\Blind Man's Zoo\Dust Bowl wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Eat for Two wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Hateful Hate wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Headstrong wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Jubilee wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Please Forgive Us wma
> C:\music\10,000 Maniacs\Blind Man's Zoo\Poison in the Well wma
>
> i'm stumped; just dug out my mastering reg exp book to finally go
> through it; any help in the meantime?
>
> spinhead
More information about the thelist
mailing list