Continuing from my previous blog (Data preparation – Normalization subsystem – Clustering Text using Fingerprinting) in this blog I will examine the distance approach a.k.a nearest neighbor to clustering text strings.
The distance approach to clustering provides better flexibility in finding clusters compared to the fingerprinting method. This enables finding clusters by fine tuning the distance parameter and finding acceptable clusters. However, the distance approach has a drawback – Performance – It requires n*(n-1)/2 comparisons. Techniques such as blocking can be used to reduce the number of comparisons – with blocking enabled the number of comparisons become n*m (m-1)/2 where n is the number of blocks and m is the average size of the block. In this approach interesting clusters can be found by varying the distance and block counts.
The following sections discusses the various types of distance approaches with examples.
Levenshtein distance measures the numbers of edits that are required to change one string to other. This is useful in clustering data which have spelling mistakes. For example, setting a distance of 3 will cluster names such as Mississippi and Misisipi.
In this approach the similarity between 2 strings STR1 and STR2 is evaluated by compressing STR1 and then compressing STR1 plus STR2. If the resulting difference between these compressed strings is minimal or zero then STR1 and STR2 can be considered to be similar.
In order for the Kolmogorov approach to work optimally, my understanding is that Prediction by Partial matching (PPM) seems like the most effective compression algorithm. PPM is form of higher order arithmetic coding – Arithmetic coding replaces a input symbol by a specific code. The following link provides a great explanation of Arithmetic coding.
This approach is useful for finding sequences in relatively larger strings (e.g. DNA sequences). A simple example using this approach with a distance set to 2 will cluster strings – Jon Foo, KR Jon Foo, Jon Foo KR, Jon K R Foo.
In my next blog post I will discuss token based clustering approaches.
It would be great if you can share what type of clustering technique has been useful for your data set.