In this blog I will discuss some of the key tasks related to data preparation for Big Data. This is a general list of sub-systems that apply when doing data preparation hence not all the tasks listed will apply in all circumstances. Each task listed below requires a separate blog by itself.
Exploration – Big Data comes in all shapes, getting to know about the content of the data is the first step. For example, an application log file or system log file can have multiple columns of information with each column having a scalar values, free form text, complex text that may or may not confirm to a regular expression. The ability to visualize and come to a conclusion on the shape of the data is the focus of this step. Also, specifying column names, column data types (where ever appropriate) can be performed in this step.
Profiling – Now we have idea of the shape, in this step we capture the measurements. For example if we are working with a system log file which is a TSV file with 10 columns then providing a summary of each column will be helpful. Summary can include showing the distribution of values or clustering of values. The ability to visualize the content and make decisions about what values can be combined, altered, deleted, can be performed in this step.
Transform – In this stage we have a good idea about the shape and content and would like to transform the content to more granular form, combine the content to a more richer data or standardize content. For example, we may have a column with multiple values which are comma separated we may want to split the column. Alternatively we may have column with dates and time in non-standard form and would like to convert it to canonical format (ISO 8601 format).
Extraction – When data in column is unstructured and has useful information which could be used as either dimensions or facts then extracting this information would be useful. There are many ways to extract information each having pros and cons. These include using regular expressions, NLP (Named entity extraction).
Enrichment – Richer analysis on the data can be achieved with data that is enriched. For example if a column having zip code code is enriched to include city, county, latitude and longitude information can result in better slicing and dicing of the data, visualization and richer parameter to data mining algorithms (e.g. classification).
Data Quality – Data quality was an important aspect of data preparation for structured data, especially in MDM – customer data hub, product data hub. Features such as name and address verification, product name disambiguation were important. The same can be applied when working on data preparation for Big Data.
Publish – The publish aspect of data preparation for Big Data can vary based on the down stream analysis desired. The following lists some of the options.
- Publishing in a format suitable for data mining algorithms to consume.
- Publishing as a property graph that is suitable for executing graph algorithms – GraphX, GraphLab
- Publishing to Graph DB such as Neo4J and using Cypher for graph analytics
- Publishing as RDF – useful when integrating multiple data sources and using SPARQL for querying
- Publishing to BI systems as dimensions and facts which can then be used products such as OBIEE
- Publishing to discovery systems such as Endeca
- Publishing to search systems such as Solr / Lucene.
I will get in to depth on the above topics in the upcoming blogs.