The Graph APIs (BluePrint, Jena, SAIL) discussed in my post Manipulating Graph are good for creating and updating the graph databases (Property Graph and RDFs). At a level higher than the Graph API’s, technology such as Gremlin (or Cypher for Neo4J) which is considered a domain specific language (DSL) can be used for creating graph analytical applications. Several Graph Algorithms (e.g. Ranking and Centrality algorithms) can be implemented using Gremlin. Gremlin is built on Tinkerpop2.x BluePrint API.
The efficiency, performance and features of using DSL such as Gremlin which operates directly on the graph storage layer can be limiting in functionality and performance especially when working large graphs. As such we need to use specialized Graph Processing system.
In this post I will discuss 3 types graph compute systems which I got exposed during my evaluation of graph computation engines for my project.
Green-Marl is a domain-specific language (DSL) for graph data analysis originated at Stanford. Green-Marl allows the users to describe their algorithms in intuitive ways while the performance is delivered by the compiler. In specific, the compiler translates the given DSL program into an equivalent, parallelized, high-performing program written in a general purpose language.
Currently, their compiler can produce parallel C++ code targeting multi-core/multi-socket shared-memory environment also it can generate Java code with Map-Reduce like framework, targeting distributed execution. Green-Marl claims using their DSL is intuitive, concise and improves productivity.
- Write code in GM DSL
- Invoke the C++ or Java code generated by GM
- GM provides loaders for loading persisted graphs into memory for processing.
GraphLab is a graph parallel system that enables advanced analytics and machine learning on graphs. Graph parallel systems (GraphLab, Pregel)) address the drawback of Data Parallel system (e.g. Hadoop) when performing computations on a Graph. They are specialized graph systems with APIs to capture complex graph dependencies and exploit graph structure to reduce communication and facilitate parallel computations. Graph Parallel system reduces both the resource and time required to perform graph analytics. The following table lists a comparison between GraphLab versus Hadoop when doing Triangle Counting on Twitter (40 Million users and 1.4 Billion links – Info obtained from GraphLab presentation)
Looking at GraphLab API it supports loading Graph from a text file or previously saved graph binary file. he text file is typically generated by an ETL process and which generates content in a format suitable for GraphLab to use.
GraphLab shrinks the amount of resource and time for graph computation by orders of magnitude compared to Graph algorithms written on Hadoop. However it does not address the big picture of data processing pipeline which includes Graph Creation and Post Processing.
GraphX is the Spark API which combines both data parallel and graph-parallel computation. GraphX addresses the big picture of data processing which includes Graph creation, Computation and Post Processing.
The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API. The GraphX API enables users to view data both as a graph and as collections (i.e., RDDs) without data movement or duplication. By incorporating recent advances in graph-parallel systems, GraphX is able to optimize the execution of graph operations.
Share your favorite graph compute system and use case !