[thesite] Tip Harvester question (was: [***] Formatting tips )

Sat Mar 31 14:41:33 CST 2001

> How exactly will this query find duplicate tips, especially when
> quotes in emails tend to be lead by any number of line-begin
> tokens?  If this actually can work, I'd love to use it in a number
> of applications, so believe me, I'm not trying to be argumentative.

hi joshua

you are right, there would be a difference between the tip harverster's
processing of this tip --

<tip>
Epsum factorial non deposit quid pro quo hic escorol.
Olypian quarrels et gorilla congolium sic ad nauseum.
</tip>

and this one --

><tip>
>Epsum factorial non deposit quid pro quo hic escorol.
>Olypian quarrels et gorilla congolium sic ad nauseum.
></tip>

if the harvester just picks up the text, line feeds and quote tokens
and all, then these two tips will end up having different bodies

this is an area where the front end harvester logic must be very careful

note that in order to attribute a tip to the appropriate author, it is
important to detect when the tip is being quoted, and not include those
instances

my only point was that expecting "<tip" to be in columns 1-4 wasn't
going to catch them all

as far as what you would do if both versions did get stored, you could use
regexp (i'm guessing) to strip out instances of "\n>" and "\n:" and then
try the GROUP BY again

another strategy for detecting duplicates is to search for a particular
known phrase

e.g.
         select contentid, contentname, body
            where body like '%user%control%fonts%'

which gets around the problem that line breaks and quote tokens might be
between those words

note that this method doesn't work in our case because you can't use LIKE
on an oracle LONG datatype

come to think of it, the GROUP BY might not work either

:o(

rudy