[thelist] Friday Freebie

Daniel J. Cody djc at starkmedia.com
Fri Mar 30 15:07:47 CST 2001

Thought I'd help out with the Friday Freebie this week since Scott is
apparently umm, 'occupied', today.. 

<tip type="using shell scripts to parse your referer logs">

Lets say that you've got an extra five minutes on your hands, but you
wanna see *quickly* what kind of search terms people are using to get to
your site from google(google.com, google.yahoo.com, etc). Instead of
having to run your logs through a full blown analyzer for that info,
heres a couple nice little tricks to use:

First, lets get all the entries from the referer_log into a file:

[djc at leo logs]$ grep google referer_log > google.ref

Now we have a file that *only* contains lines that contained the google
name and here are a couple lines from that file for comparison's sake:

-> /
http://www.google.com/search?q=logitech+mousewheel+freeze ->
-> /

Now we want to scrape off the crap at the beginning of the URL, and get
what page google led them
to. We use a combinartion of sed and awk(gawk) here. The first part uses
sed to find everything before the first equal sign and replace it with a
space in the google.ref file(our stripped down referer_log), then it
passes that stuff on to awk who only prints out the second column of
text, which is the good stuff, then it gets piped back into sed a couple
more times(i'm sure theres a better way to do it than this) to remove
anoying characters:
(watch for wrap)
[djc at leo logs]$ sed -e "s/.=/ /g" google.ref | awk ' { print $2}' | sed
-e "s/\&./ /g" | sed -e "s/+/ /g"

So when it comes through the tube, we got something like this:

browsers %2bDownload
logitech mousewheel freeze
web browser

As you can see(or will see) some of the really wack chars will stick
around, but you should be able to get a decent idea.. If you throw a
sort -u on the end of that like so:
[djc at leo logs]$ sed -e "s/.=/ /g" google.ref | awk ' { print $2}' | sed
-e "s/\&./ /g" | sed -e "s/+/ /g" | sort -u

and it will even alphabitize them for you :)



More information about the thelist mailing list