[thelist] Text Comparison
John Hicks
johnlist at gulfbridge.net
Thu Jun 8 14:25:48 CDT 2006
Hershel Robinson wrote:
> Let's say this is the input:
>
> 1 To be or not to be that is the question.
> 2 To or not that's the question.
> 3 To been or to bend is the question.
>
> The output is like this, but I will use _ instead of spaces:
>
> 1_To___be_or_not_to_be_that_is_the_question.
> 2_To______or_not_______that's__the_question.
> 3_To_been_or_____to_bend____is_the_question.
>
> So now you can easily see that all texts begin with 'To' and then have
> 'or' and end with 'the question.' Now you can also see where some texts
> have more words or less words or different words than the other texts.
> All the algorithm must do is figure the common words and then output
> them lined up vertically so one can visually scan and see the
> differences easily.
I'm not a mathematician, but my intuition tells me this is not always
solvable and often solvable with multiple solutions.
I have a hunch you'll need to devise a heuristic based on the nature of
your inputs and desired outputs (i.e. you'll need to tweak it a lot).
--John
More information about the thelist
mailing list