Continuing on the subject related to clustering text to facilitate normalizing a data set in this blog post I will examine clustering using tokens. Token based clustering uses tokens to evaluate similarity between two string and determine membership into a cluster.
The following section discusses the various types of token based approaches. In each of the approaches specifying the tokenization function is a key step. Tokenizing a string can be straight forward or a involved task – This subject requires a blog post in itself.
Depending upon the use case, prior to evaluating the similarity, the strings can be scrubbed to remove stop words. Also, as required stemming and lemmatization (e.g. men’s -> men, are->be) can be used. Depending on the use case performing the above steps can improve finding quality clusters.
Jaccard Coefficient is one of the basic approaches in determining similarity between strings. This approach computes the ratio of data (tokens) that is shared (or the intersection) between two strings and the union of tokens between the strings.
Arkansas, Mississippi, Tennessee (These 3 states are connected by the Mississippi river)
Arkansas, Missouri, Tennessee (also connected by Mississippi river)
Jaccard Coefficient = 2/4
Prof. Dumbledore, Albus
Jaccard coefficient = 2/3
Jaccard coefficient is good in finding similarities on the existence of token rather than the position of the token in the string – useful in finding similarities in strings where word swaps are common. (e.g Names, Places, Things).
However, Jaccard Coefficient is sensitive to presence of a additional word – In the above example, the presence of ‘Title’ reduced the similarity. Jaccard Coefficient is also not suitable when there are spelling mistakes. In the E.g 1, if Mississippi and Tennessee were wrongly spelled in one of the rows then Jaccard’s coefficient will reduce to 1/5.
Cosine Similarity using TF-IDF (Token Frequency- Inverse Document Frequency)
Cosine similarity which is widely used in information retrieval use cases measures the angle between two n dimensional vectors and uses this information to determine the similarity between the strings.
The vectors could be for each of the two (or n) strings (in some cases one of the string is a query string) or two (or n) documents. If the angle between the vectors is closer to zero then the two (or n) vectors may be considered similar (cos (0) =1), conversely if the angle is greater than zero or close to 90 degrees then the string or the documents may be considered dissimilar (cos (90) = 0).
Now how we define the vectors. After we have obtained the tokens, we compute the TF-IDF of the terms we have extracted from the a document or string. TF measures the frequency of the occurrence of a term, intuitively it computes the importance a term in the document.
IDF on the other hand computes the frequency of a term across the corpus of the documents or strings and assigns higher weights for terms which occur less frequently and vice-versa.
The combination of TF-IDF normalizes a term and assigns higher weight to discriminating terms in the document corpus.
tf-idf score is computed using the following formula
tf -idf score = log (tf +1 ) and idf = log(N/df)
where tf is the term frequency – number of times t occurs in d.
df is the document frequency – number of documents that contain the term
N is the number of documents
E.g. Lets say a column of data set contained the following names of hospitals:
- Alameda Hospital
- Alta Bates Summit Medical Center
- Children’s Hospital
- Eden Medial Center
- Fairmont Hospital
- Washington Hospital
- John George Psychiatric Pavilion
- Highland Hospital
- Valley Fair Medical Center
Lets say we want to find similarity between Fairmont Hospital and Washington Hospital
tf-idf (hospital) = log (1 + 1) * log (10/5) = 0.0906
tf-idf (fairmont) = log (1 +1) * log (10/1) = 0.3010
tf-idf(washington) = log (1 + 1) * log (10/1) = 0.3010
Cosine similarity = ((0.0906) * (0.0906))/ sqrt ( (0.0906) (o.o906)) =