Data preparation – Normalization subsystem – Clustering Text using Distance Methods

Continuing from my previous blog (Data preparation – Normalization subsystem – Clustering Text using Fingerprinting) in this blog I will examine the distance approach a.k.a nearest neighbor  to clustering text strings.

The distance approach to clustering provides better flexibility in finding clusters compared to the fingerprinting method.  This enables finding clusters by fine tuning the distance parameter and finding acceptable clusters. However, the distance approach has a drawback – Performance –  It requires n*(n-1)/2 comparisons.  Techniques such as blocking can be used to reduce the number of comparisons – with blocking enabled the number of comparisons become n*m (m-1)/2 where n is the number of blocks and m is the average size of the block. In this approach interesting clusters can be found by varying the distance and block counts.

The following sections discusses the various types of distance approaches with examples.

Levenshtein Distance

Levenshtein distance measures the numbers of edits  that are required to change one string to other.  This is useful in clustering data which have spelling mistakes. For example, setting a distance of 3 will cluster names such as  Mississippi and  Misisipi.

Kolmogorov Distance

In this approach the similarity between 2 strings STR1 and STR2 is evaluated by compressing STR1 and then compressing STR1 plus STR2. If the resulting difference between these compressed strings is minimal or zero then STR1 and STR2 can be considered to be similar.

In order for the Kolmogorov approach to work optimally, my understanding is that Prediction by Partial matching (PPM) seems like the most effective compression algorithm. PPM is form of higher order arithmetic coding – Arithmetic coding replaces a input symbol by a specific code. The following link provides a great explanation of Arithmetic coding.

http://cotty.16×16.com/compress/nelson1.htm

This approach is useful for finding sequences in relatively larger strings (e.g. DNA sequences). A simple example using this approach with a distance set to 2 will cluster strings – Jon Foo, KR Jon Foo, Jon Foo KR, Jon K R Foo.

In my next blog post I will discuss token based clustering approaches.

It would be great if you can share what type of clustering technique has been useful for your data set.

Advertisements

About atiru

Product Strategist and architect for harnessing value from data.
This entry was posted in Topics related to Organizational Behavior. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s