A key aspect of successfully implementing analytics for Big Data is Data Preparation. In this post I will examine the some of the important differences between preparing data for Big Data and for structured data. In both the scnearios data preparation is the key for enabling a successful BI system, with out which BI systems will confirm to the adage – Garbage-in-Garbage-out.
It is important to get a good understanding of what data preparation means from multiple dimensions. I feel data preparation is at a stage what ETL was prior to being formalized by stalwarts like Dr. Kimball. Also, the process of data preparation the roles involved in this process has some key differences compared to the structured ETL systems.
Lets get to it….
Typically in a business the key stake holders specify the business metrics that would like to get in order to evaluate the performance and determine the next steps for their respective groups. For example VP of sales will like to get sales numbers across geographical region, products for the quarter. The VP of marketing will like get demographic information about their customer base to better understand the segments in order to target campaigns.
The requirements is then passed on to BI analyst who then prepares the requirements for BI system and the data warehouse which in turn determines the requirements for the ETL. Typically, the requirements involves validating data constraints (null, not null, FK relationship, transformations, fact and dimension management — Please refer to Dr. Kimball’s 34 sub-systems which provides a great explanation of data preparation requirements and corresponding ETL tasks).
Unlike in the case of traditional BI where the stake holders knew what they wanted, with Big Data BI it is all about getting data to speak and using this to make decisions. Also, unlike with the structured data where the requirements are bounded, the size and shape of data is well understood, the requirements for preparing Big Data is more complex and open.
The key to working with big data is exploration and defining the requirements on the go. As such it would be preferable for the BI analyst to directly work with the data and prepare it in a manner that would facilitate downstream analysis.
In order to meaningfully work with Big Data the BI analyst should be intuitively able to:
- Transform and Normalize
- Data Quality
I will discuss the details of the above in another post.
Roles and responsibilities
In my post, BI in the era of Big Data, I identify a variety of roles for enabling traditional BI. Given the high degree of uncertainty of requirements the type roles and responsibilities is different for enabling Big Data BI. Since the focus is primarily on exploring the data and tinkering with the data to get valuable nuggets the business analyst and or a person who is familiar with the domain data is most suited to do this job. Then depending on what type of downstream analysis is required the processed data is presented to an appropriate analysis group.
In traditional BI the goal of data preparation is straightforward – Publish the prepared data to a structured data warehouse for analysis. However, in the case Big Data the goal setting depends on what type of analysis can be applied on the data. Prior to, or while getting a handle on the data it shape, size, profile, it is important to define the goal so appropriate analysis can be performed. The following list provides some thoughts on the analysis that can performed.
- Traditional data warehouse like analysis
- Data mining – classification, clustering, associations
- Recommendations, collaborative filtering
- Semantic querying – SPARQL
- Graph analytics
- Log mining
The list can go on. Also, another dimension is whether the analysis should be done in real-time or batch or both.
The requirements for both human and technology resources are different when compared to traditional data preparation task.
First Human Resources –
- Requires some one who is knowledgeable about the data being prepared and potential outcomes of the analysis. This person has the knowledge to have a dialog with the data.
- Some one who is knowledgeable about managing Big Data technologies (Hadoop, Spark, Graph databases, etc.).
- Data scientist – This is a broad category. Depending upon the analysis desired having the data scientist with the right skill set will be helpful. For example if the data is text intensive then having some one with skills with NLP will help. However, If the objective is to classify then some one with data mining skills with knowledge in applying classification algorithms will be helpful
- Overall architect – This is someone who can glue the various pieces together which includes – other fellow team members contribution, technology and take the responsibility to delivering a finished product.
- Platform to handle Big Data (volume, velocity and variety)
- Data preparation is a iterative task and hence will a processing platform that is iterative – hint SPARK. SPARK streaming and STORM /Kafka for real-time use cases.
In the following posts I will discuss in depth the activities involved in data preparation. Meanwhile I would like to pose the following questions –
What is your strategy for data preparation ? What is your use case ? Do you prefer building one yourself or love to see a product or service which will do this for you ?