What is the Most Widely-used Graph Query Language in 2018?

A newly unified property graph query language (GQL) vs. Gremlin...and SPARQL

For sometime now Emil Eifrem and the team at Neo4j have been advocating for a single unified Property Graph query language to rule them all.

The idea is to push this unified PG query language to an industry standard for Graph DMBS (analogous to SQL for RDBMS). Since reading Eifrem's post last month, I have been looking into what is the most widely used graph query language and what the new GQL means for that language and graph databases in general. Here's what I've found.

The most common and perhaps widely-used graph query language is Gremlin. This is the query language of Apache TinkerPop graph computing framework (which a community project written in Groovy). Gremlin is widely adopted and supported by nearly all graph databases supporting Property Graphs (PG). A close second in Cypher from Neo4j. Cypher, PGQL and G-CORE are being fused into GQL a single unified Path Property Graph (PPG) query language. For RDF Graphs, SPARQL is the de facto standard defined right the RDF spec. In recent years GraphQL (open sourced by Facebook) has become a wildly popular API query language that can also be used to query graph data.

To give you a proper overview each of the graph query languages mentioned in the above answer, the rest of this article goes over at each of these query languages and presents a brief syntax example.

Let's get started.

What is a Graph Query Language

A Graph Database Query Language, or a graph query language for short, is a concrete mechanism for creating, manipulating and querying graph data in a graph database.

Graph query languages are SQL equivalents for Graph DBMS. Most developers will be familiar with some variant of SQL (such as PostgreSQL and MySQL), making it the de facto database query language.

In a previous article, I discussed the differences between graph databases and relational databases at length, but in a nutshell; graph databases tend to have lower latency at scale; tend to represent high-value relationships more accurately, and; tend to aggregate queries more efficiently than relational databases.

This post discusses the two most popular and widely supported Graph database query languages; Gremlin and SPARQL and introduces a third contender to the mix - GQL - a combination of Cypher, PGQL, and G-CORE.

Wait, what about GraphQL?

Is GraphQL a Graph Query Language?

GraphQL is, strictly, not a graph query language. Even with its most popular use case being querying graph data, GraphQL is an API Query Language while Gremlin, SPARQL and now GQL are all Query Languages for Graph Databases. There is nothing in GraphQL specification that warrants the use of graph in its name or requires the use of a graph data structure and GraphQL can be used to query other data structures besides graphs.

GraphQL calls itself a data query language and runtime not just a graph query language. Much more accurately, GraphQL is really an alternative to REST for APIs.

Neo4j calls GraphQL a specification for querying a slice of an application graph. GraphQL queries then return data in a tree structure, that perfectly matches the front-end view regardless of where the data was fetched from.

From this, it is safe to say that GraphQL is largely agnostic of the underlying data structure in your API, but it return your data as a tree.

GraphQL can therefore be used as a graph query language but this is not its only use case. In fact, with Neo4j you can either query your graph data using GraphQL directly or have it sitting on top of Cypher.

What is the Path Property Graph (PPG) Model?

As one of the popular method of representing graphs, the Property Graph (PG) Data Model is a directed graph with labels on both nodes and edges, as well as (property, value) pair associations with both.

PG is a widely adopted by Graph Databases such as AgensGraph, Amazon Neptune, ArangoDB, Blazegraph, CosmosDB, DataStax Enterprise Graph, HANA Graph, JanusGraph, Neo4j, Oracle PGX, OrientDB, Sparksee, Stardog, TigerGraph and Titan amongst others.

Because the nodes, edges and paths are all first-class citizens in G-CORE, paths have identity and can also have labels and (property, value ) pairs associated with them. This is the basis for an extended property graph model called Path Property Graph (PPG) Model which is backward compatible with the Property Graph (PG) Model.

As a result, G-CORE as a language aims to balance path query expressiveness and evaluation complexity with the PPG model.

Neo4j logo

GQL (Cypher + PGQL + G-CORE)

The new GQL (Graph Query Language), is a fusion of three property graph query languages:

  1. Neo4j's Cypher (and the openCypher derivative from its community);
  2. PGQL from Oracle, and;
  3. G-CORE a research language proposal from Linked Data Benchmark Council (LDBC Graph Query Language Task Force).

The idea behind open sourcing Cypher in October 2015 (in the openCypher project) was to make it a vendor independent language. This idea has not caught in the Graph community for the most part and vendor lock-in remains a huge issue with nearly all graph databases today.

Cypher Query Language

Cypher is an SQL-inspired, declarative query language for describing graphs visually using an ASCII-art syntax. Cypher syntax is very SQL-like. Developers transitioning into graph databases from their relational counterparts often find Cypher intuitive and easy to pick-up.

In Cypher syntax, a node or vertex representing a person will be wrapped with parenthesis and assigned a variable person i.e. (person:Person). This way the properties of a person can be accessed later with dot notation person.name.

The relationships between nodes are represented with --> between two nodes. Additional information is then tucked inside with square brackets -[...]-> such as -[:KNOWS|:LIKES]->.

Here's a sample of Cypher syntax.

              
MATCH (nl)
WHERE relationship.property > {value}
RETURN rel.property, type(relationship)
              
            

Property Graph Query Language (PGQL)

Property Graph Query Language (PGQL) describes itself as a language that combines the power of graph pattern matching and SQL. PGQL like Cypher is very SQL-like.

The language heavily relies on graph pattern matching which allows patterns to be matched against vertices and edges in a data graph.

SQL constructs such as GROUP BY ORDER BY and many others are also found in PGQL.

PGQL also has regular path expressions for reachability analysis.

Here's a sample of PGQL syntax.

              
/* Devices and switches are connected by two edges. */
  PATH connects_to AS (:Device|Switch) <- (:Connection) -> (d:Device|Switch)

/* Only consider switches with OPEN status. */
 WHERE d.status IS NULL OR d.status = 'OPEN'
SELECT d1.name AS source, d2.name AS destination
  FROM electric_network

/* We match the connects_to pattern one or more (+) times. */
 MATCH (d1:Device) -/:connects_to+/-> (d2:Device)
 WHERE d1.name = 'DS'
 ORDER BY d2.name
              
            

Like Cypher PGQL also uses ASCII-art syntax for matching nodes, edges and paths:

  1. (n:Person) matches a node (vertex) n with label Person.
  2. -[e:friend_of]-> matches a connection (edge) e with label friend_of.
  3. -/:friend_of+/-> matches a path consisting one or more edges each, with label friend_of.

PQGL also allows subqueries for comparing data from different graphs.

The folowing example from PGQL finds people who are on Facebook but not on Twitter.

              
SELECT p1.name
  FROM facebook_graph
 MATCH (p1:Person)                           /* Match persons in the Facebook graph.. */
 WHERE NOT EXISTS (                          /* ..such that there doesn't exists..    */
                    SELECT p2
                      FROM twitter_graph
                     MATCH (p2:Person)       /* ..a person in the Twitter graph..     */
                     WHERE p1.name = p2.name /* ..with the same name.                 */
                  )
              
            

G-CORE

G-CORE is a community effort between industry and academia to shape the future of graph query languages.

G-CORE argues that graph DBMS should support two key characteristics:

G-CORE is designed by the LDBC Graph Query Language Task Force which was founded in 2012 to establish standard benchmarks for Graph DBMS.

  1. They should be composable, meaning that graphs the input and output of queries.
  2. They should treat paths as a first-class citizen meaning that paths are the outputs of certain queries.

A G-CORE graph query looks like this:

              
CONSTRUCT (n)
    MATCH (n:Person)
       ON social_graph
    WHERE n.employer = 'Acme'
              
            

And a multi-graph query would look like this:

              
CONSTRUCT (c)<-[:worksAt]-(n)
    MATCH (c:Company) ON company_graph,
          (n:Person) ON social_graph
    WHERE c.name = n.employer
    UNION social_graph
              
            

In a nutshell the new fused GQL language combines:

  1. CRUD from Cypher
  2. RPQs
  3. Graph Construct from G-CORE and Neo4j
  4. Composability from G-CORE and Cypher
gremlin-mascot

Gremlin

Gremlin is the graph traversal language of Apache TinkerPop. It is designed developed and distributed by the Apache Tinkerpop community project.

Apache TinkerPop is an open source, vendor agnostic, graph computing framework of both graph databases (OLTP) and graph analytics systems (OLAP).

Once a dataset is TinkerPop-enabled, the data can be modeled with graphs which can then be traversed with Gremlin.

Gremlin, like GQL works on the Property Graph, in fact, Gremlin describes itself as a functional, data-flow language that enables users to succintly express complex traversals on (or queries of) their application's property graph.

Gremlin is most widely-used and supported graph database query language today and TinkerPop is right at the center of it with the world's largest graph development ecosystem extending its core.

The most popular graph systems supporting Gremlin and Apache TinkerPop are:

  1. Fully-managed graph database services such as Amazon Neptune and IBM Graph.
  2. Distributed graph databases such as Cosmos DB, GRAKN.AI, JanusGraph and Titan.
  3. In-memory graph databases such as Bitsy and TinkerGraph.
  4. OLTP graph batabases such as HGraphDB, Neo4j, OrientDB, Apache S2Graph and Unipop.
  5. OLAP graph databases such as Hadoop (Spark).
  6. RDF graph databases such as BlazeGraph and Stardog.

Gremlin is turing complete. As a graph traversal machine, Gremlin is composed of a graph G, a traversal Ψ and traversers T. These traversers (T) move about the graph according to instructions in the traversal (Ψ).

Gremlin is typically implemented the user's native programming language. The user's native programming language defines the Ψ of the Gremlin machine which means it has support for:

  1. Imperative and declarative querying: Imperative Gremlin traversals tells the traverser how to proceed at each traversal, while a declarative allows the traverser to select a pattern to execute from a collection of usually nested patterns.
  2. Host language agnosticism: Although Gremlin is Groovy-based, Gremlin traversals can be written in any programming language that supports composing and nesting, instead of writing traversal in both a database query language (say, SQL) and then a programming language (say, Groovy, Java etc). The traversals will then get advantages of the host language and tooling.
  3. User-defined DSLs.
  4. An extensible compiler.
  5. Multi-machine (distributed) execution models.
  6. Hybrid DFS and BFS and evaluation.
              
public class GremlinTinkerPopExample {
  public void run(String name, String property) {

    Graph graph = GraphFactory.open(...);
    GraphTraversalSource g = graph.traversal();

    double avg = g.V().has("name",name).
                   out("knows").out("created").
                   values(property).mean().next();

    System.out.println("Average rating: " + avg);
  }
}
              
            

Gremlin opens a lot of possibilities for graph databases with modern programming languages. A Gremlin traversal can:

  1. Open an embedded graph database
  2. Open a remote graph database by serializing itself across the network
  3. Send itself for cluster-wide distributed analysis (with an OLAP processor)

Gremlin has a REPL console with many neat shorthands. The Gremlin REPL is useful for understanding how graph traversals work and ad-hoc analysis. The REPL also supports plugins.

SPARQL

SPARQL is an SQL-like declaractive query language created by W3C to query RDF graphs. Like Property Graphs, RDF Graphs (or, Triple stores) are directed, and labeled with nodes and edges.

SPARQL (SPARQL Protocol And RDF Query Language) is a W3C standard designed to meet the use cases identified by the RDF Data Access Working Group. Even though its a protocol, for most use cases SPARQL's greatest value is a query language for RDF graphs (another W3C standard).

RDF Data is described in a collection of three part statements (a triple) with:

  1. a subject (entity identifier);
  2. a predicate (attribute name), and;
  3. an object (attribute value)

The object can also be a triple. By connecting triples into networks of data, we form RDF graphs.

The subject and predicate are described using URIs, URIs are similar to URLs except they are just identifiers.

URIs, which are easy to confuse with URLs are made simpler to write using RDF's Turtle Syntax. This often shortens the URIs by adding an abbreviated prefix before the last part of the URI e.g. sn:employee01 vcard:name "Jack" here @prefix vcard and @prefix sn are shorted from URIs.

RDF makes it easy to mix standard vocabulary such as vcard and custom vocalubary such as sn in the example above.

Data in a relational database table can be represented in RDF as a collection of triples, the row identifier is assigned to the subject, the column identifier to the predicate and the value as object.

Here's an example of a simple SPARQL query:

              
PREFIX vcard: //www.w3.org/2006/vcard/ns#>

SELECT ?Person
WHERE
{
  ?person vcard:first-name "Jack"
  ?person vcard:last-name ?lastName
}
              
            

SPARQL uses triple patterns to select data. such as with the WHERE clause above. Triple patterns are like triples but with variables as wildcards substituted into a triple part.

SPARQL allows PREFIX definitions so we do not have to define URIs inside graph queries.