In this blog I will examine the normalization sub-system which is one of the sub-systems I called in my earlier blog – Data Preparation Sub-Systems. A key objective of this step is to ensure the data consistency. For example, when working with tabular data, it is important to ensure columns have consistent values. A key goal of being consistent is to ensure similar data values are grouped together thus ensuring no duplicates are present in the data. This is an important requirement in various verticals – e.g. customer data, product inventory data etc.
Faceting is a common technique used to summarize a column of data. Faceting breaks up a column in to multiple groups, typically showing counts for each group. Users can then combine groups or apply changes that impact a particular group.
In this blog I will discuss the use of clustering to identify groups for simple text data. I will be focusing on methods which facilitate identifying similar data based on syntactic representation of the data. This approach is useful grouping data which have inconsistencies that includes differences in spelling, spaces, data representation (San Jose and san jose or Alfred Hitchcock and Hitchcock, Alfred). In my upcoming blogs I will discuss semantic matching which will extend matches to semantic similarity – San Francisco and SFO or Malibu (beach or car model)
Broadly speaking, clustering text refers to finding groups of values that represent the same thing. Depending upon the application the data can be clustered based on strict or lenient matches.
There are several techniques that can be used to cluster text data. Broadly speaking there are two methodologies for clustering text data – Key Collision Method and Nearest Neighbor Method.
In this blog I will discuss the various types of Key Collision methods. These methods are typically have a linear performance characteristics. The following sections discuss the various approaches and their use case scenario.
Fingerprinting method is fast and simple and has good applicability in many scenarios where strict contextual information can be ignored. E.g. Name, Places, Organization, and Things. In this method, the most varying parts of the text string are eliminated – punctuation, control characters, white spaces. Further the text string is normalized – converted to lowercase representation, rearrange and the sort the tokens in the string, remove duplicates, covert text string to ASCII representation. Doing the above in a proper order will result in clustering text strings such as San Jose and san jose, Alfred Hitchcock and Hitchcock Alfred.
The above method is very use when collating data from multiple non-standardized data sets.
The approach of character N-grams is similar to fingerprinting with the key difference being the tokens from a text string are created using n-grams, where n= 1, 2, 3….. This approach is useful in finding clusters of text strings which have small differences. For example using 1-gram the the strings “Massachusetts” and “Masachusets” can be clustered together. Compared to fingerprinting method this approach can generate false positives.
In this approach the tokens are generated based on pronunciation. This approach is useful to cluster similar sounding words. For example – “Sudan” and “Sweden”. Both the fingerprinting and n-gram approaches will not work in this scenario.
The Metaphone method is commonly used for indexing words by their English pronunciation.
In my upcoming blog I will discuss Distance approaches to clustering.