Before discussing the data preparation platform for Big Data lets look at the some of the requirements –
- Since there is no apriori knowledge of the data content, the data preparation process is highly interactive and a visual process.
- Getting a profile of the data is important. Understanding the layout the data and basic statistics such as distribution of values, nulls etc. is important. For example if we have tabular data providing profile information about each column is important.
- Data preparation is highly iterative. The ability to make a series of changes and view the transformation of data as the changes are being made is important.
- As important it is to make a series of changes, it is also important to have the ability to undo these changes.
- The latency in applying a transformation and viewing the results should be minimal.
- The ability to merge data sets by doing exact or fuzzy joins is important.
- The ability to seamlessly use the results of data processing and perform analytics.
Looking at the requirements, it is clear that we need a processing platform that is highly performant (in-memory), support highly iterative processing, seemlessly chain the output of data prepration step with analytic processing. The following diagram illustrates an example creating a value chain consisting of data preparation and graph analysis.
The platform that is most suited to support data preparation type of work is Apache Spark. The following diagram illustrates the Spark stack which provides the capabilities for ingesting, processing and publishing data for downstream analytics.
Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms (wikipedia).
In my upcoming blog posts I will illustrate processing of data set using Spark and perform analysis on the resulting data set using GraphX.