[thelist] Setting up a search engine in Perl, CGI or PHP and MySQL

Thu Dec 11 02:32:49 CST 2003

Sorry about the length of this email, but I'm trying to be clear about my needs so that I can get the answer questions on a specific need.

I am trying to set up a site that is a different type of search engine.  Instead of a top-down approach of redirecting users to other websites based on conventional search criteria, I'm looking at actually cataloguing the pages of different websites, and structuring them in a directory format (a bit like the Open Directory Project, but using computers rather than people for most of the cataloguing process), which is essentially a bottom-up approach.

Basically, items will be catalogued by having direct access to data on other people's servers, and by various scripts/algorithms.  Initially I will be organising information on commercial sites with affiliate programs, so they should allow direct (albeit restricted) access to their databases.  For each site included in our catalogue, I will need to know information about what I'm cataloguing.  

For example, if I am cataloguing a poster of the Mona Lisa from AllPosers.com, I will need to store information such as the artist, price of the poster, what artistic style it is etc, so that a user on my site could get it it via a section on posters, or on Renaissance art, or on Leonardo Da Vinci etc..  I will also want to make sure that posters from other online stores are correctly placed with the one from AllPoster.com when a request for posters of the Mona Lisa is placed by a user of the site - I don't want the search to do a search of posters with the name Mona Lisa as such, I want it to have already been catalogued.  This means that accurate price comparisons on any item could be made.

As I see it, there will need to be three scripts/algorithms.  The first will extract the relevant information from a site that is to be catalogued, and return various information that will then be put into the database (This will probably be different for different sites, though some sites will have similar scripts, and generic scripts will be written for this process).  These will then pass on their results to a second algorithm.

The second algorithm will then input the relevant data in the database, including such things as URLs, image locations, various site structure information (such as how the information could be reached through the directory structure).  Comparisons will be made with items already catalogued, so that the same item from different stores can be correctly placed together (this may require some human checking as well, though hopefully not much).  

The third script is the search engine itself.  This will not just return a page like a conventional search engine, but might redirect a user to an individual page if there is a very high chance that that is the page a user wanted (e.g. if David Blane was entered, then the user would be redirected to .../david_blane.html).  If there are several different possible results, such as a search for pool might yeild results relating to a swimming pool, but also for the game.  All results related to the former could be found through links on a page called swimming_pool.html, and the latter on pool.html.  These two possibilities would be listed on one page, along with some of their sub-categories underneath the main headings, and any other possible derivations of the word pool.

My question is this:  Is MySQL suitable for such a task (my gut instinct is yes)?  And more importantly, what language should I use for the search engine script?

Obviously, being a search engine, it needs to be quick at retreiving results from a database.  I am intending on having the rest of the site in PHP (Linux/Apache/MySQL/PHP/Turck MMCache is my current idea for the setup) - Turck MMCache speeds up PHP processing by caching compiled scripts.  I'm thinking that Perl might be quicker to have as a search engine script, but I just don't know.  What about CGI?  Is there anyone with experience of writing search engines that can make some suggestions?  

Considering that there will potentially be results catalogued for anything you could think of, how should I structure the tablespace of the search engine part of the database?  This tablespace could potentially reach terabytes in size.  Should I have just one table with all the possible results, or would it be more efficient to somehow split up the results into multiple tables?  If so, are there any suggestions for how I should split up the tables?  

One possibility that I've thought of is to have one table for each search result (so potentially tens of millions of tables), with a list of all the tables being kept in the data-definition of the database?  Each table could then contain data about what should be shown for each page, including the links to other pages.  If a search did not result in an actual page that had already been structured, then suggested results pages could be made for pages with similar names.  

Also, would Perl, CGI or PHP (or something else) be better for the other two algorithms/scripts, which will input data into the database?

Any responses to these questions will be gratefully received.  Thanks in advance.

Thanks,

Marcus.