Manipulating Graphs

In my earlier post on Graph Database I discussed two types of Graph Models – Property Graph Model and RDF Model.

In this post I will discuss the APIs standards that are available for manipulating graph model.

Property Graph API a.k.a Blueprint API

Blueprint API is seen as JDBC for Property Graph databases. It provides a collection of interfaces for graph databases providers to implement.

This API which is specified as part of the TinkerPop stack which is an open source software in the Graph space. The current version of Tinkerpop is 2.x.  The key supporters of Tinkerpop are Neo4J, Titan among others.

Example using Blueprint API

Graph graph = new Neo4jGraph(“/tmp/my_graph”);
Vertex a = graph.addVertex(null);
Vertex b = graph.addVertex(null);
Edge e = graph.addEdge(null, a, b, “knows”);
e.setProperty(“since”, 2006);

The Tinkerpop stack is illustrated below.



Jena is a Java API which can be used to create and manipulate RDF graphs. Graph database providers, who support native or non-native graph store, implement Jena to support creating and manipulation of RDF graph.

Please refer to my post on RDF Serialization and Triplestores which discusses the many ways the RDF can be serialized and persisted.

A parallel framework to Jena is Sesame which is also open source Java framework for storage and querying of RDF data. Some providers implement Sesame also known as Open RDF – SAIL (Storage And Interface Layer).

Example using RDF API

String personURI = “http://somewhere/JohnSmith”;
String givenName = “John”;
String familyName = “Smith”;
String fullName = givenName + ” ” + familyName;

// create an empty Model
Model model = ModelFactory.createDefaultModel();

// create the resource
// and add the properties cascading style
Resource johnSmith
= model.createResource(personURI)
.addProperty(VCARD.FN, fullName)
.addProperty(VCARD.Given, givenName)
.addProperty(VCARD.Family, familyName));

The following table illustrates a sample of vendors and their support for Graph APIs. Please check with the vendors on the exact version where the APIs are supported.



*Note – SAIL Graph which implement Blueprint interface ‘TransactionalGraph’ (turns a SAIL based RDF store into BluePrint Graph, alternatively, Graph SAIL turns any IndexableGraph implementation (another BluePrint interface) into RDF store (BluePrint refers to them ouplementations)

Posted in Topics related to Graph Databases and Compute, Linked Data (RDF) | Tagged , , | Leave a comment

RDF Serialization and Triplestores

In my earlier post  on Graph Databases I introduce RDF.  Briefly, RDF  is a language for expressing data models using statements expressed as triples. Each statement is composed of a subject, a predicate, and an object. RDF adds several important concepts that make these models much more precise and robust. These additions play an important role in removing ambiguity when transmitting semantic data between machines that may have no other knowledge of one another.  I find RDF very useful for integrating streams of related data.

In this post I will discuss the many ways RDF can be represented or serialized.

RDF Serialization

RDF serialization is the standard way to represent and share semantic data.  There are 5 serializations formats.

  • N-Triples, the simplest of notations;
  • N3, a compaction of the N-Triple format;
  • Turtle – subset of N3. Unlike full N3, which has an expressive power that goes much beyond RDF, Turtle can only serialize valid RDF graphs
  • RDF/XML, one of the most frequently used serialization formats
  • RDF in attributes (known as RDFa), which can be embedded in other serialization formats such as XHTML.

Example of Serialization using N3

@prefix foaf: <;.
@prefix rdf: <;.
@prefix rdfs: <;.
@prefix semperp: <;.
@prefix tobes: <;.

atiru:ts a foaf:Person;
foaf:homepage <;;
foaf:interest <;;
foaf:knows semperp:Jon,
[ a foaf:Person;
foaf:mbox <>;
foaf:name “Jon Foo”];
foaf:mbox <>;
foaf:name “Ananth Tiru”;
foaf:nick “atiru”.


Triplestores are Database Management Systems (DBMS) for data modeled using RDF. Triplestores can be broadly classified in three types categories: Native triplestores, RDBMS-backed triplestores and NoSQL triplestores.

Native triplestores are those that are implemented from scratch and exploit the RDF data model to efficiently store and access the RDF data. Examples of these include: 4Store, AllegroGraph, BigData, Jena TDB, Sesame, Stardog, OWLIM and uRiKa.

RDBMS-backed triplestores are built by adding an RDF specific layer to an existing RDBMS. Examples of these include : Jena SDB, IBM DB2, Oracle Database with Spatial and Graph option turned on and Virtuoso.

NoSQL triplestore are build on top of so called NoSQL databases which includes key-store, document databases.   Example includes Oracle NoSQL database.

Posted in Topics related to Graph Databases and Compute, Linked Data (RDF) | Tagged , | Leave a comment

Data Preparation for Batch and Real-time data

In this blog post I will discuss the role of data preparation when working with Batch data set or Real-time data sets.

Irrespective of whether the analysis of data is happening real-time or in batch some aspects of data preparation will be required to ensure better analytics – Cleaner, Richer and structured the data sets yield better analytics.  However, the extent to which data needs to be prepared depends on the data set and the application performing the analytics.

Looking at data preparation as a continuum, on one end of the spectrum we have minimal data preparation and as we move through the continuum we have increased requirement data preparation – Refer to my blog post Data Preparation Sub-Systems for a description of sub-systems that make up data preparation processing pipeline.

At the bottom of it, data preparation is a series of actions performed on the data set which transforms the data set into a more usable form. Determining what actions to perform on the data set requires a good understanding of the data set and the downstream use cases for the prepared data – This implies the data domain expert will need explore samples of data (that is anticipated to flow through real-time framework or that is captured in a batch store) and  determine the appropriate data preparation  sub-systems to be deployed in the processing pipeline.

As we discussed earlier, richer the meta-data and more structured the data enables better insights.  Depending upon the requirements, data domain experts will explore the data and determine the what sub-systems need to be applied to the data.  For example, a trivial case may be the data set requires simple transformations to produce a format that is suitable for a machine learning algorithm.  More complex scenarios can include cases where the data set has columns that have complex structures that needs to be further broken down into more granular dimension OR facts or the data sets has duplicates OR the data needs to enriched…this list is endless.

The determined sub-systems can then be deployed in the data processing pipeline.  Using the Lambda Architecture by as one of the  reference architecture for building Big Data systems,  the  data processing sub-systems can be deployed in either the batch layer (e.g. as a Spark process processing historical data on HDFS ) or speed layer (e.g. as bolts in Storm  processing network processing  feeds from Kafka).

In many use cases the results from analyzing batch layer is used to drive speed layer,  as such  ensuring consistent data preparation across them is important.  For example, classification models are typically generated from modeling samples and testing them using the data from batch layer. The resulting model is then deployed in the speed layer to classify incoming data stream.  In such cases it is important to ensure the shape of the data fed to the model is consistent with shape of the data that is used to build the model.

Concluding,  determining the scope of data preparation and ensuring that the scope is consistently applied across batch and speed layers where ever applicable is important to ensure accurate analytics.

Share your experience – What is your scope of data preparation on a scale of 1 to 10 1 being minimal and 10 being complex.


Posted in Data Management and Analytics | Leave a comment

Data Preparation Platform for Big Data

Before discussing the data preparation platform  for Big Data lets look at the some of the requirements –

  • Since there is no apriori knowledge of the data content, the data preparation process is highly interactive and a visual process.
  • Getting a profile of the data is important. Understanding the layout the data and basic statistics such as distribution of values, nulls etc. is important. For example if we have tabular data providing profile information about each column is important.
  • Data preparation is highly iterative.  The ability to make a series of changes and view the transformation of data as the changes are being made is important.
  • As important it is to make a series of changes, it is also important to have the ability to undo these changes.
  • The latency in applying a transformation and viewing the results should be minimal.
  • The ability to merge data sets by doing exact or fuzzy joins is important.
  • The ability to seamlessly  use the results of data processing and perform analytics.

Looking at the requirements, it is clear that we need a processing platform that is highly performant (in-memory), support highly iterative processing,  seemlessly chain the output of data prepration step with analytic processing. The following diagram illustrates an example creating a value chain consisting of data preparation and graph analysis.



The platform that is most suited to support data preparation type of work is Apache Spark. The following diagram illustrates the Spark stack which provides the capabilities for ingesting, processing and publishing data for downstream analytics.


Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms (wikipedia).

In my upcoming blog posts I will illustrate processing of data set  using Spark and perform analysis on the resulting data set using GraphX.


Posted in Data Management and Analytics | Leave a comment

Data preparation – Cloud or not to cloud

Both from a consumer and a producer perspective the decision to go cloud or not is an important and sometimes  one. The following are some points to consider when making the decision.

  • Data locality – Where is the majority of the data being collected that needs to be prepared ?  High speed networks,  convenient reliable and durable storage capabilities on the cloud, strategies to copy across the WAN are minimizing the effects of data locality.  I came across a product called WANDISCO which enables bi-replication for HDFS, HBase between two data centers. If any one reading this blog  is using this send me your experience and use case !
  • Real-time decisions – If decisions needs to be made in real – time the then preparation needs to happen close to where the data is gathered.
  • Complementary Services – Typically, Data Preparation is part of a value chain that sits between sources of data and processed data consumers such as BI systems, Discovery systems, Graph DBs, Analytic Applications etc.  The source data system could include applications (CRM, HCM, SCM etc),  applications logs etc.  Need to evaluate the optimal location for data preparation based on the locality upstream and downstream application in the value chain.
  • Security – Cloud services provide good support for security – Encryption, access control, data isolation   However, on-premises business which are primarily focused on security,  strict data governance will need run though their security checklist if moving data to the cloud for data preparation and other downstream services is the right approach.
  • Business Reasons – Strategic decisions demand moving to the cloud or staying on-premises.  Global businesses may find moving to the cloud a strategic investment in the long run.  Small and medium business may find cloud as an economical alternative and faster to market strategy.

Where would you do your data preparation and why ?

Posted in Data Management and Analytics | Leave a comment

Data Preparation – Make or Buy Decision

Make or Buy decision is always in the minds of the executives in any product or services areas especially when working on cutting edge products.  The key question I always ask is what is the core-competency  of the Organization – Refer to the ground breaking paper ‘The core competency of an Organization by Prof. CK Prahalad, Prof. Gary Hammel.

If the focus of the organization is understanding their business better, making  predictions about their customer behavior or business, improving  customer engagement,  optimizing their supply chain, integrating their financial assets for better governance and a plethora of other use cases then Buying a service or product which does data preparation seems like a right choice.   While there may be slight learning curve in using such products the returns can be very valuable.   The following are some of the points where buying can have an positive impact than making:

  • Human resource – Hiring people with with right skills can be a challenge. Having a army of resources can be actually counter productive.
  • Communicating requirements – Data preparation for Big Data is better done when the data domain expert has dialog with the data. Since most of the time you are in a discovery mode having someone to do it can be difficult and in many cases does not make sense. It is like Mr. Monk, the detective, having some one look for clues !
  • Time to Decision – With everything moving at the speed of light spending time in areas which is not your core competency can be counter productive
  • Hardware and software choices – These can be difficult.  POC can be never ending and without proper objectives for evaluating the functionality or performance it can end up as  a research project with  pile of documents leading to no decisions.

The strategy to make depends on the following factors –

  • The product or service does not meet the Pareto’s rule (80-20) in functionality or performance.
  • The  organization is well versed with data preparation and the process is so customized that migrating to another product does not meet the ROI.

One important question that typically comes up whether it is buy or make is whether to go cloud or on-premises – I will discuss this in my upcoming blog.

Also, refer to my other related blogs on this subject ….

Data Preparation for Big Data

Data Preparation sub-systems for Big Data

BI in the era of Big Data

What do you like to do – build or buy ? and why ? Feedback – Discussion appreciated !


Posted in Data Management and Analytics | Leave a comment

Data preparation sub-systems for Big Data

In this blog I will discuss some of the key tasks related to data preparation for Big Data. This is a general list of sub-systems that apply when doing data preparation hence not all the tasks listed will apply in all circumstances. Each task listed below requires a separate blog by itself.

Exploration – Big Data comes in all shapes, getting to know about the content of the data is the first step.  For example,  an application log file or system log file  can have multiple columns of information with each column having a scalar values, free form text, complex text that may or may not confirm to a regular expression. The ability to visualize and come to a conclusion on the shape of the data is the focus of this step.  Also,  specifying column names, column data types (where ever appropriate) can be performed in this step.

Profiling – Now we have idea of the shape, in this step we capture the measurements. For example if we are working with a system log file which is a TSV file with 10 columns then providing a summary of each column will be helpful.  Summary can include showing the distribution of values or clustering of values.  The ability to visualize the content and make decisions about what values can be combined, altered, deleted, can be performed in this step.

Transform – In this stage we have a good idea about the shape and content and would like to transform the content to more granular form, combine the content to a more richer data or standardize  content.  For example,  we may have a column with multiple values which are comma separated we may want to split the column. Alternatively we may have column with dates and time in non-standard form and would like to convert it to canonical format (ISO 8601 format).

Extraction – When data in column is unstructured and has useful information which could be used as either dimensions or facts then extracting this information would be useful. There are many ways to extract information  each having pros and cons. These include using regular expressions, NLP (Named entity extraction).

Enrichment – Richer analysis  on the data can be achieved with data that is enriched. For example if a column having zip code code is  enriched to include city, county, latitude and longitude information can result in better slicing and dicing of the data, visualization and richer parameter to data mining algorithms (e.g. classification).

Data Quality – Data quality was an important aspect of data preparation for structured data, especially in MDM – customer data hub, product data hub. Features such as name and address verification,  product name disambiguation were important.  The same can be applied when working on data preparation for Big Data.

Publish – The publish aspect of data preparation for Big Data can vary based on the down stream analysis desired. The following lists some of the options.

  • Publishing in a format suitable for data mining algorithms to consume.
  • Publishing as a property graph  that is suitable for executing graph algorithms – GraphX, GraphLab
  • Publishing to Graph DB such as Neo4J and using Cypher for graph analytics
  • Publishing as RDF – useful when integrating multiple data sources and using SPARQL for querying
  • Publishing to BI systems as dimensions and facts which can then be used products such as OBIEE
  • Publishing to discovery systems such as Endeca
  • Publishing to search systems such as Solr / Lucene.

I will get in to depth on the above topics in the upcoming blogs.


Posted in Data Management and Analytics | Leave a comment