[thesite] thetips database

Seth Bienek seth at sethbienek.com
Mon Jun 18 13:35:52 CDT 2001


Hey Dan,

> I thinnk I may have mentioned this before seth, but rather than parsing 
> that big ass mbox file, there are all those little weekly text files you 
> can parse. e.g. http://lists.evolt.org/archive/Week-of-Mon-20010611.txt

That's the first time I can remember seeing it mentioned.  But after having a look at it, I'd like to keep it open as a 'plan b' option.  I'd like to keep the components of the message intact, including the header information, but if this doesn't pan out then I will definitely go to the .txt files.  Sure looks easier.  But then how much fun is 'easy'? :)  Uh.. Unless you're talking about, like, chicks and stuff.

> this would also solve having to parse the 50Mb file everytime you wanted 
> to update the DB. just a thought..

I have a semi-solution for this - storing the cursor position, but the file would still have to be read into memory every time.  What produces this text file?  Could it's format be altered?  Is it produced in a batch or on-the-fly? (this is going somewhere, I promise).

There's a little light bulb floating over my head...

Seth

> -----Original Message-----
> From: thesite-admin at lists.evolt.org
> [mailto:thesite-admin at lists.evolt.org]On Behalf Of Daniel J. Cody
> Sent: Monday, June 18, 2001 12:13 PM
> To: thesite at lists.evolt.org
> Subject: Re: [thesite] thetips database
> 
> 
> 
> 
> Seth Bienek wrote:
> 
> > Hey Dean,
> > 
> > 
> >>Okay, so the problem that you are having is getting the whole e-mail
> >>message into the database?
> >>
> > 
> > Nope.  The database is fine. The problem is parsing the entire 
> .mbox archive (over 50 meg) without any errors.  And since it's 
> so memory and processor-intensive, I can only run the template 
> against the entire archive every couple of hours, or else there 
> ends up being overlapping threads and other issues.. Once the 
> initial database population is done, there shouldn't be any more 
> problems, but that first step is the big one.
> > 
> > I have some ideas that I will test today, and I'll let you know 
> if I can't get it squared away by this evening.
> 
> 
> I thinnk I may have mentioned this before seth, but rather than parsing 
> that big ass mbox file, there are all those little weekly text files you 
> can parse. e.g. http://lists.evolt.org/archive/Week-of-Mon-20010611.txt
> 
> these are split up into easy to digest 500Kb - 1Mb weekly files. they're 
> also what deans tip harvester is using to extract shit now.
> 
> this would also solve having to parse the 50Mb file everytime you wanted 
> to update the DB. just a thought..
> 
> 
> > As far the structure of 'thetips' as it is now, I haven't 
> looked at it but I'm sure it will need to be reworked.
> 
> 
> CREATE TABLE THETIPS (
>    TIP_ID     NUMBER (8)    NOT NULL,
>    TIP_DATE   DATE          NOT NULL,
>    AUTHOR_ID  NUMBER (8),
>    TIP_TYPE   VARCHAR2 (200),
>    AUTHOR     VARCHAR2 (50),
>    BODY       LONG,
>    PRIMARY KEY ( TIP_ID )
> 
> if you guys need anything else, please lemme know :)
> 
> .djc.
> 
> 
> 
> 
> _______________________________________________
> http://lists.evolt.org/thesitearchive/
> and new & improved kentucky fried old archives:
> http://lists.evolt.org/thesitearchive/old/
> 
> 






More information about the thesite mailing list