Platform Flexibility – Key for realizing IoT solutions

One of the realities of realizing an IoT solution (Enterprise, Commercial, Industrial) is that the requirements so drastically different.  Among the several dimensions that are used to evaluate a platform for realizing an IoT solution, a key dimension is the flexibility of the platform. In the context of an  IoT solution flexibility can be viewed as the ability to integrate with other technologies and having the ability to seamlessly fit into the solution environment. Let us examine each of these requirements.

Integration  with other technologies:

A solution typically  involves integrations of various technologies to realize outcomes.  Given the scope of IoT solutions, integrations are an essential part of the solution. At every stage in an IoT solution stack,  technologies need  to integrate on the inbound and outbound.

Fit in the solution environment

Every vertical specific IoT solution has its own set of specific requirements. The solution architecture is varied – The ingestion sources,  the consumer targets, the deployment location,  the end to end weaving the solution is  different.   Depending upon the vertical solution being realized, technologies need to fit into the solution environment.

HSDP’s Flexibility

One of the key design decision for the HSDP platform was to keep the platform as flexible as possible.  Given our involvement in building a  variety of vertical specific solution with the earlier versions of HSDP,  we realized that a key feature for the platform is to provide flexibility.

The following are some of the vertical solutions  where HSDP was used to realize  the solution

  • Finance – Computing in real-time the index for Tokyo stock exchange
  • Utilities – Monitoring and managing HVAC system
  • IT Ops – Real-time analysis of  log files
  • Banking – Analyzing transactions
  • Performance tuning of Race cars
  • Traffic Management in large cities.
  • Crowd management in public places

The sources and targets for each of the above application are vastly different. The data source, volume, velocity  of ingesting stock market data, for example,  is very different from ingesting from a sensor in a transportation vehicle.   The location of gathering the data,  analysing the data and taking action on the insight is also very different. The consumers of the insights are different. The actions they will take are different.  The latency  requirements are very different.  The list goes on…..

HSDP  3.0 platform includes the following features that enables it to  be flexible to meet the demands of a vertical IoT  solution.

  • HSDP SDK enables developing  vertical specific ingest and publish adapters. The ingest adapter SDK allows ingesting  data from a variety of sources, perform transformations and publish it on a stream. The publish adapter SDKs allows subscription of insights from a stream, transformation of insights to actions.
  • HSDP  provides deployment options that enable solutions  to run in environments where resources are limited.
  • HSDP  provides the capability of providing insights, taking actions at the edges and seamless cascading of insights from edges to core.

The following diagram illustrates at a high level view of HSDP’s  integration capability and fit in an IoT solutions environment.





Posted in IoT | Leave a comment

Introducing HSDP – A platform for realizing streaming analytics for an IoT solution

Since the last one year or so I have been involved in defining the platform vision, features and go to market strategy for HSDP – Hitachi’s Streaming Data Analytics Platform.  Streaming analytics is not something new, however, the range of applications and use cases they can be used has increased in scope.  As noted by Forrester –  The market for streaming analytics platforms is growing far beyond its roots in industrial operations and financial services.

Since 2009 the HSDP was used to realize streaming solutions for a variety of verticals which included Finance, Energy, Utilities, Banking, IT ops, Transportation,  Smart Cities. Many of the solution that were built were IoT type solutions.

Based on our experience with building customized streaming solutions for our customers  the following are some of the requirements  that we repeatedly observed

  • Ability to ingest from a variety of data sources and publish insights to a variety of targets
  • Ability to perform multi-stage geo distributed computations and the ability to provide and  cascade insights from edges to center
  • Ability to process  (for example, filtering, enriching), perform descriptive  analysis  (for example, aggregations, statistical computations) and predictive analysis (for example, classifications, clustering) at the edges and center and anywhere inbetween.
  • Ability to securely handle large volumes and velocity of data with minimal or no disruptions
  • Ability to be deployed on hardware with limited resources and  on commodity hardware in a clustered environment deployed on-premises or in the cloud
  • Flexibility for integrating with a variety of technologies which includes OSS, visualization systems, decision systems, control systems, data stores etc.
  • Support for solution development, deployment, management and monitoring

Realizing the the current market trend and based on our experience in building custom streaming solution, the following were some of the key goals that I set for the streaming platform

  • Transform the platform into an enterprise class platform that addresses the requirements we observed.
  • Simplify the user experience when using the platform for building solutions
  • Keep the platform flexible so it can blended with the choices the solution provider or integrator makes when building and deploying specific IoT solution.

With the above goals in mind, I collaborated with the engineering architects @ Hitachi and defined the features and the user experience for the platform.

We are now on track for releasing the product on May 3rd, 2016.An introduction to the platform and its capabilities can be found here.


Posted in IoT | Leave a comment

Data Preparation – Normalization Subsystem – Clustering using Tokens

Continuing on the subject related to clustering text to facilitate normalizing a data set in this blog post I will examine clustering using tokens. Token based clustering uses  tokens to evaluate similarity between two string and determine membership into a cluster.

The following section discusses the various types of token based approaches.  In each of the approaches specifying the tokenization function is a key step.  Tokenizing a string can be straight forward or a involved task  – This subject requires a blog post in itself.

Depending upon the use case,  prior to evaluating the similarity, the strings can be scrubbed to remove  stop words. Also, as required  stemming and lemmatization     (e.g. men’s -> men,  are->be) can be used.  Depending on the use case performing the above steps can improve finding quality clusters.

Jaccard Coefficient

Jaccard Coefficient is one of the basic approaches in determining similarity between strings. This approach computes the ratio of data (tokens) that is shared  (or the intersection) between two strings and the union of tokens  between the strings.

E.g #1

Arkansas, Mississippi, Tennessee     (These 3 states are connected by the Mississippi river)

Arkansas, Missouri, Tennessee  (also connected by Mississippi river)

Jaccard Coefficient = 2/4

E.g #2

Albus Dumbledore

Prof. Dumbledore,  Albus

Jaccard coefficient = 2/3

Jaccard coefficient is good in finding similarities on the existence of token rather than the position of the token in the string – useful in finding similarities in strings where word swaps are common. (e.g Names, Places, Things).

However, Jaccard Coefficient is sensitive to presence of a additional word  – In the above example, the presence of ‘Title’ reduced the similarity.  Jaccard  Coefficient is also not suitable when there are spelling mistakes.  In the E.g 1, if Mississippi and Tennessee were wrongly spelled in one of the rows then Jaccard’s coefficient will reduce to 1/5.

Cosine Similarity using TF-IDF (Token Frequency- Inverse Document Frequency)

Cosine similarity which is widely used in information retrieval use cases measures the angle between two n dimensional vectors and uses this information to determine the similarity between the strings.

The vectors could be for each of the two (or n)  strings (in some cases one of the string  is a query string)  or  two (or n) documents.  If the angle between the vectors is closer to zero then the two (or n) vectors may be considered similar (cos (0) =1), conversely if the angle is greater than zero or close to 90 degrees then the string or the documents may be considered dissimilar (cos (90) = 0).

Now how we define the vectors.  After we have obtained the tokens, we compute the TF-IDF of the terms  we have extracted from the a document or string.  TF measures the frequency of the occurrence of a term, intuitively it computes the importance a term in the document.

IDF on the other hand computes the frequency of a term  across the corpus of the documents or strings and assigns higher weights for terms  which occur less frequently and vice-versa.

The combination of TF-IDF  normalizes a term and assigns higher weight to discriminating terms in the document corpus.

tf-idf score is computed using the following formula

tf -idf score = log (tf +1 ) and idf = log(N/df)

where tf is the term frequency – number of times t occurs in d.

df is the document frequency  – number of documents that contain the term

N is the number of documents

E.g. Lets say a column of data set contained the following names of hospitals:

  • Alameda Hospital
  • Alta Bates Summit Medical Center
  • Children’s Hospital
  • Eden Medial Center
  • Fairmont Hospital
  • Washington Hospital
  • John George Psychiatric Pavilion
  • Highland Hospital
  • Kaiser
  • Valley Fair Medical Center

Lets say we want to find similarity between Fairmont Hospital and Washington Hospital

tf-idf (hospital) =  log (1  + 1) * log (10/5)  = 0.0906

tf-idf (fairmont) = log (1 +1) * log (10/1)  = 0.3010

tf-idf(washington) = log (1 + 1)  * log (10/1) = 0.3010

Cosine similarity =  ((0.0906) *   (0.0906))/ sqrt ( (0.0906) (o.o906)) =


Posted in Data Management and Analytics | Leave a comment

Data preparation – Normalization subsystem – Clustering Text using Distance Methods

Continuing from my previous blog (Data preparation – Normalization subsystem – Clustering Text using Fingerprinting) in this blog I will examine the distance approach a.k.a nearest neighbor  to clustering text strings.

The distance approach to clustering provides better flexibility in finding clusters compared to the fingerprinting method.  This enables finding clusters by fine tuning the distance parameter and finding acceptable clusters. However, the distance approach has a drawback – Performance –  It requires n*(n-1)/2 comparisons.  Techniques such as blocking can be used to reduce the number of comparisons – with blocking enabled the number of comparisons become n*m (m-1)/2 where n is the number of blocks and m is the average size of the block. In this approach interesting clusters can be found by varying the distance and block counts.

The following sections discusses the various types of distance approaches with examples.

Levenshtein Distance

Levenshtein distance measures the numbers of edits  that are required to change one string to other.  This is useful in clustering data which have spelling mistakes. For example, setting a distance of 3 will cluster names such as  Mississippi and  Misisipi.

Kolmogorov Distance

In this approach the similarity between 2 strings STR1 and STR2 is evaluated by compressing STR1 and then compressing STR1 plus STR2. If the resulting difference between these compressed strings is minimal or zero then STR1 and STR2 can be considered to be similar.

In order for the Kolmogorov approach to work optimally, my understanding is that Prediction by Partial matching (PPM) seems like the most effective compression algorithm. PPM is form of higher order arithmetic coding – Arithmetic coding replaces a input symbol by a specific code. The following link provides a great explanation of Arithmetic coding.


This approach is useful for finding sequences in relatively larger strings (e.g. DNA sequences). A simple example using this approach with a distance set to 2 will cluster strings – Jon Foo, KR Jon Foo, Jon Foo KR, Jon K R Foo.

In my next blog post I will discuss token based clustering approaches.

It would be great if you can share what type of clustering technique has been useful for your data set.

Posted in Topics related to Organizational Behavior | Leave a comment

Data preparation – Normalization subsystem – Clustering Text using Fingerprinting

In this blog I will examine the normalization sub-system which is one of the sub-systems I called in my earlier blog – Data Preparation Sub-Systems. A key objective of this step is to ensure the data consistency.  For example, when working with tabular data, it is important to ensure columns have consistent values.  A key goal of being consistent is to ensure similar data values are grouped together thus ensuring no duplicates are present in the data.  This is an important requirement in various verticals – e.g. customer data, product inventory data  etc.

Faceting is a common technique used to summarize a column of data. Faceting breaks up a column in to multiple groups, typically showing counts for each group. Users can then combine groups or apply changes that impact a particular group.

In this blog I will discuss the use of clustering to identify groups for simple text data.  I will be focusing on methods which facilitate identifying similar data based on syntactic representation of the data. This approach is useful grouping data which have inconsistencies that includes differences in spelling,  spaces,  data representation (San Jose and  san jose  or  Alfred Hitchcock and Hitchcock, Alfred).   In my upcoming blogs I will discuss semantic matching which will extend matches to semantic similarity – San Francisco and SFO  or  Malibu  (beach or car model)

Broadly speaking,  clustering text refers to finding groups of values that represent the same thing. Depending upon  the application the data can be clustered based on strict or lenient matches.

There are several techniques that can be used to cluster text data.  Broadly speaking there are two methodologies for clustering text data – Key Collision Method and Nearest Neighbor Method.

In this blog I will discuss the various types of Key Collision methods. These methods are typically have a linear performance characteristics.  The following sections discuss the various approaches and their use case scenario.


Fingerprinting method is fast and simple and has good applicability in many scenarios where strict contextual information can be ignored.  E.g.  Name,  Places, Organization, and Things.  In this method, the most varying parts of the text string are eliminated – punctuation, control characters,  white spaces.  Further the text string is normalized – converted to lowercase representation,  rearrange and the sort the tokens in the string, remove duplicates, covert text string to ASCII representation.  Doing the above in a proper order will result in clustering text strings such as San Jose and san jose, Alfred Hitchcock and Hitchcock Alfred.

The above method is very use when collating data from multiple non-standardized data sets.

Character n-grams

The approach of character N-grams is similar to fingerprinting with the key difference being the tokens from a text string are created using n-grams, where n= 1, 2, 3….. This approach is  useful in finding clusters of text strings which have small differences.  For example using 1-gram the the strings  “Massachusetts” and “Masachusets” can be clustered together.  Compared to fingerprinting method this approach can generate false positives.

 Phonetic Fingerprinting

In this approach the tokens are generated based on pronunciation.  This approach is useful to cluster similar sounding words.  For example – “Sudan” and “Sweden”.  Both the fingerprinting and n-gram approaches will not work in this scenario.

The Metaphone method is commonly used for indexing words by their English pronunciation.

In my upcoming blog I will discuss Distance approaches to clustering.

Posted in Data Management and Analytics | Tagged , | Leave a comment

Ensuring data consistency between cloud and on-premises

Enterprises today have greater flexibility in determining whether investing in applications, platforms and infrastructure should be a capital expenditure or operational expenditure or both.  As such enterprises are increasingly using a mix of public cloud, private cloud and on-premises strategy to gain sustainable competitive advantage.  However, enterprises have to ensure that business processes run effectively and reliably irrespective of whether the applications and its associated data are on the on-premises instances or in the cloud.

For example, an enterprise which has adopted a CRM strategy could be relying on an on-premises based marketing application used for developing and nurturing leads and could be using a SaaS based Sales application to create opportunities and quotes.  The sales and the marketing teams which use these systems need to be able to access and share the data in a reliable and cohesive way.  This example can be extended to other applications areas such as HR, Supply Chain, and Finance and the demands the users place on getting a consistent view of the data.

Another example, an enterprise may have established on-premises based Business Intelligence and Reporting platform which is used by employees in various roles to retrieve their respective reports and perform analysis. Typically, a Business Intelligence platform requires data from various sources to be aggregated so it can provide rich capabilities to slice and dice the data.  Enterprises which have a mix of public cloud, private cloud and on-premises will need to ensure that the relevant data from these sources are made available to the Business Intelligence system.

Most enterprises have spent years avoiding the data “silos” that inhibit productivity. IT has had its fill of new integration paradigms, from CORBA to Client/Server to Web services, EAI, SOA and replicating databases.  After decades of locking down critical issues such as interface definitions, governance, reliability, transaction management, exception handling, and transaction monitoring, it is imperative to extend these solutions into environments which has a mix of cloud and on-premises applications and its associated data.

Refer to my blog and an associated technical white paper on this subject –

Share your comments on how your organization data is distributed between the cloud and on-premises and what solutions are you adopting to keep it consistent in real-time ?

Posted in Data Management and Analytics | Leave a comment

Graph Computation and Analytics

The Graph APIs (BluePrint, Jena, SAIL)  discussed in my post Manipulating Graph are good for creating and updating the graph databases (Property Graph and RDFs).  At a level higher than the Graph API’s, technology such as Gremlin (or Cypher for Neo4J) which is considered a domain specific language (DSL) can be used for creating graph analytical applications.  Several Graph Algorithms (e.g. Ranking and Centrality algorithms) can be implemented using Gremlin. Gremlin is built on Tinkerpop2.x BluePrint API.

The efficiency, performance and features of using DSL such as Gremlin which operates directly on the graph storage layer can be limiting in functionality and performance especially when working large graphs. As such we need to use specialized Graph Processing system.

In this post I will discuss 3 types graph compute systems which I got exposed during my evaluation of  graph computation engines for my project.


Green-Marl is a domain-specific language (DSL) for graph data analysis originated at Stanford. Green-Marl allows the users to describe their algorithms in intuitive ways while the performance is delivered by the compiler. In specific, the compiler translates the given DSL program into an equivalent, parallelized, high-performing program written in a general purpose language.

Currently, their compiler can produce parallel C++ code targeting multi-core/multi-socket shared-memory environment also it can generate Java code with Map-Reduce like framework, targeting distributed execution. Green-Marl claims using their DSL is intuitive, concise and improves productivity.

Green-Marl Process

  • Write code in GM DSL
  • Compile
  • Invoke the C++ or Java code generated by GM
  • GM provides loaders for loading persisted graphs into memory for processing.


GraphLab  is a graph parallel system that enables advanced analytics and machine learning on graphs. Graph parallel systems (GraphLab, Pregel)) address the drawback of Data Parallel system (e.g. Hadoop) when performing computations on a Graph.  They are specialized graph systems with APIs to capture complex graph dependencies and exploit graph structure to reduce communication and facilitate parallel computations.  Graph Parallel system reduces both the resource and time required to perform graph analytics.  The following table lists a comparison between GraphLab versus Hadoop when doing Triangle Counting on Twitter (40 Million users and 1.4 Billion links – Info obtained from GraphLab presentation)



Looking at GraphLab API it supports loading Graph from a text file or previously saved graph binary file. he text file is typically generated by an ETL process and which generates content in a format suitable for GraphLab to use.

GraphLab shrinks the amount of resource and time for graph computation by orders of magnitude compared to Graph algorithms written on Hadoop.  However it does not address the big picture of data processing pipeline which includes Graph Creation and Post Processing.


GraphX is the Spark API which combines both data parallel and graph-parallel computation.  GraphX addresses the big picture of data processing which includes Graph creation, Computation and Post Processing.

The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API. The GraphX API enables users to view data both as a graph and as collections (i.e., RDDs) without data movement or duplication. By incorporating recent advances in graph-parallel systems, GraphX is able to optimize the execution of graph operations.



Share your favorite graph compute system  and use case !

Posted in Topics related to Graph Databases and Compute, Linked Data (RDF) | Tagged , , | Leave a comment