In this blog post I will discuss the role of data preparation when working with Batch data set or Real-time data sets.
Irrespective of whether the analysis of data is happening real-time or in batch some aspects of data preparation will be required to ensure better analytics – Cleaner, Richer and structured the data sets yield better analytics. However, the extent to which data needs to be prepared depends on the data set and the application performing the analytics.
Looking at data preparation as a continuum, on one end of the spectrum we have minimal data preparation and as we move through the continuum we have increased requirement data preparation – Refer to my blog post Data Preparation Sub-Systems for a description of sub-systems that make up data preparation processing pipeline.
At the bottom of it, data preparation is a series of actions performed on the data set which transforms the data set into a more usable form. Determining what actions to perform on the data set requires a good understanding of the data set and the downstream use cases for the prepared data – This implies the data domain expert will need explore samples of data (that is anticipated to flow through real-time framework or that is captured in a batch store) and determine the appropriate data preparation sub-systems to be deployed in the processing pipeline.
As we discussed earlier, richer the meta-data and more structured the data enables better insights. Depending upon the requirements, data domain experts will explore the data and determine the what sub-systems need to be applied to the data. For example, a trivial case may be the data set requires simple transformations to produce a format that is suitable for a machine learning algorithm. More complex scenarios can include cases where the data set has columns that have complex structures that needs to be further broken down into more granular dimension OR facts or the data sets has duplicates OR the data needs to enriched…this list is endless.
The determined sub-systems can then be deployed in the data processing pipeline. Using the Lambda Architecture by as one of the reference architecture for building Big Data systems, the data processing sub-systems can be deployed in either the batch layer (e.g. as a Spark process processing historical data on HDFS ) or speed layer (e.g. as bolts in Storm processing network processing feeds from Kafka).
In many use cases the results from analyzing batch layer is used to drive speed layer, as such ensuring consistent data preparation across them is important. For example, classification models are typically generated from modeling samples and testing them using the data from batch layer. The resulting model is then deployed in the speed layer to classify incoming data stream. In such cases it is important to ensure the shape of the data fed to the model is consistent with shape of the data that is used to build the model.
Concluding, determining the scope of data preparation and ensuring that the scope is consistently applied across batch and speed layers where ever applicable is important to ensure accurate analytics.
Share your experience – What is your scope of data preparation on a scale of 1 to 10 1 being minimal and 10 being complex.