[thelist] Text Comparison

John Hicks johnlist at gulfbridge.net
Thu Jun 8 14:25:48 CDT 2006


Hershel Robinson wrote:
> Let's say this is the input:
> 
> 1 To be or not to be that is the question.
> 2 To or not that's the question.
> 3 To been or to bend is the question.
> 
> The output is like this, but I will use _ instead of spaces:
> 
> 1_To___be_or_not_to_be_that_is_the_question.
> 2_To______or_not_______that's__the_question.
> 3_To_been_or_____to_bend____is_the_question.
> 
> So now you can easily see that all texts begin with 'To' and then have 
> 'or' and end with 'the question.' Now you can also see where some texts 
> have more words or less words or different words than the other texts. 
> All the algorithm must do is figure the common words and then output 
> them lined up vertically so one can visually scan and see the 
> differences easily.

I'm not a mathematician, but my intuition tells me this is not always 
solvable and often solvable with multiple solutions.

I have a hunch you'll need to devise a heuristic based on the nature of 
your inputs and desired outputs (i.e. you'll need to tweak it a lot).

--John



More information about the thelist mailing list