Data Preparation for Batch and Real-time data

In this blog post I will discuss the role of data preparation when working with Batch data set or Real-time data sets.

Irrespective of whether the analysis of data is happening real-time or in batch some aspects of data preparation will be required to ensure better analytics – Cleaner, Richer and structured the data sets yield better analytics.  However, the extent to which data needs to be prepared depends on the data set and the application performing the analytics.

Looking at data preparation as a continuum, on one end of the spectrum we have minimal data preparation and as we move through the continuum we have increased requirement data preparation – Refer to my blog post Data Preparation Sub-Systems for a description of sub-systems that make up data preparation processing pipeline.

At the bottom of it, data preparation is a series of actions performed on the data set which transforms the data set into a more usable form. Determining what actions to perform on the data set requires a good understanding of the data set and the downstream use cases for the prepared data – This implies the data domain expert will need explore samples of data (that is anticipated to flow through real-time framework or that is captured in a batch store) and  determine the appropriate data preparation  sub-systems to be deployed in the processing pipeline.

As we discussed earlier, richer the meta-data and more structured the data enables better insights.  Depending upon the requirements, data domain experts will explore the data and determine the what sub-systems need to be applied to the data.  For example, a trivial case may be the data set requires simple transformations to produce a format that is suitable for a machine learning algorithm.  More complex scenarios can include cases where the data set has columns that have complex structures that needs to be further broken down into more granular dimension OR facts or the data sets has duplicates OR the data needs to enriched…this list is endless.

The determined sub-systems can then be deployed in the data processing pipeline.  Using the Lambda Architecture by as one of the  reference architecture for building Big Data systems,  the  data processing sub-systems can be deployed in either the batch layer (e.g. as a Spark process processing historical data on HDFS ) or speed layer (e.g. as bolts in Storm  processing network processing  feeds from Kafka).

In many use cases the results from analyzing batch layer is used to drive speed layer,  as such  ensuring consistent data preparation across them is important.  For example, classification models are typically generated from modeling samples and testing them using the data from batch layer. The resulting model is then deployed in the speed layer to classify incoming data stream.  In such cases it is important to ensure the shape of the data fed to the model is consistent with shape of the data that is used to build the model.

Concluding,  determining the scope of data preparation and ensuring that the scope is consistently applied across batch and speed layers where ever applicable is important to ensure accurate analytics.

Share your experience – What is your scope of data preparation on a scale of 1 to 10 1 being minimal and 10 being complex.



About atiru

Product Strategist and architect for harnessing value from data.
This entry was posted in Data Management and Analytics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s