What is a Graph Query Language
A Graph Database Query Language, or a graph query language for short, is a concrete mechanism for creating, manipulating and querying graph data in a graph database.
Graph query languages are SQL equivalents for Graph DBMS. Most developers will be familiar with some variant of SQL (such as PostgreSQL and MySQL), making it the de facto database query language.
In a previous article, I discussed the differences between graph databases and relational databases at length, but in a nutshell; graph databases tend to have lower latency at scale; tend to represent high-value relationships more accurately, and; tend to aggregate queries more efficiently than relational databases.
This post discusses the two most popular and widely supported Graph database query languages; Gremlin and SPARQL and introduces a third contender to the mix - GQL - a combination of Cypher, PGQL, and G-CORE.
Wait, what about GraphQL?
Is GraphQL a Graph Query Language?
GraphQL is, strictly, not a graph query language. Even with its most popular use case being querying graph data, GraphQL is an API Query Language while Gremlin, SPARQL and now GQL are all Query Languages for Graph Databases. There is nothing in GraphQL specification that warrants the use of graph in its name or requires the use of a graph data structure and GraphQL can be used to query other data structures besides graphs.
GraphQL calls itself a data query language and runtime not just a graph query language. Much more accurately, GraphQL is really an alternative to REST for APIs.
Neo4j calls GraphQL a specification for querying a slice of an application graph. GraphQL queries then return data in a tree structure, that perfectly matches the front-end view regardless of where the data was fetched from.
From this, it is safe to say that GraphQL is largely agnostic of the underlying data structure in your API, but it return your data as a tree.
GraphQL can therefore be used as a graph query language but this is not its only use case. In fact, with Neo4j you can either query your graph data using GraphQL directly or have it sitting on top of Cypher.
What is the Path Property Graph (PPG) Model?
As one of the popular method of representing graphs, the Property Graph (PG) Data Model is a directed graph with labels on both nodes and edges, as well as (property, value) pair associations with both.
PG is a widely adopted by Graph Databases such as AgensGraph, Amazon Neptune, ArangoDB, Blazegraph, CosmosDB, DataStax Enterprise Graph, HANA Graph, JanusGraph, Neo4j, Oracle PGX, OrientDB, Sparksee, Stardog, TigerGraph and Titan amongst others.
Because the nodes, edges and paths are all first-class citizens in G-CORE, paths have identity and can also have labels and (property, value ) pairs associated with them. This is the basis for an extended property graph model called Path Property Graph (PPG) Model which is backward compatible with the Property Graph (PG) Model.
As a result, G-CORE as a language aims to balance path query expressiveness and evaluation complexity with the PPG model.
GQL (Cypher + PGQL + G-CORE)
The new GQL (Graph Query Language), is a fusion of three property graph query languages:
The idea behind open sourcing Cypher in October 2015 (in the openCypher project) was to make it a vendor independent language. This idea has not caught in the Graph community for the most part and vendor lock-in remains a huge issue with nearly all graph databases today.
Cypher Query Language
Cypher is an SQL-inspired, declarative query language for describing graphs visually using an ASCII-art syntax. Cypher syntax is very SQL-like. Developers transitioning into graph databases from their relational counterparts often find Cypher intuitive and easy to pick-up.
In Cypher syntax, a node or vertex representing a person will be wrapped with parenthesis and assigned a variable
The relationships between nodes are represented with
Here's a sample of Cypher syntax.
Property Graph Query Language (PGQL)
Property Graph Query Language (PGQL) describes itself as a language that combines the power of graph pattern matching and SQL. PGQL like Cypher is very SQL-like.
The language heavily relies on graph pattern matching which allows patterns to be matched against vertices and edges in a data graph.
SQL constructs such as
PGQL also has regular path expressions for reachability analysis.
Here's a sample of PGQL syntax.
Like Cypher PGQL also uses ASCII-art syntax for matching nodes, edges and paths:
PQGL also allows subqueries for comparing data from different graphs.
The folowing example from PGQL finds people who are on Facebook but not on Twitter.
G-CORE is a community effort between industry and academia to shape the future of graph query languages.
G-CORE argues that graph DBMS should support two key characteristics:
G-CORE is designed by the LDBC Graph Query Language Task Force which was founded in 2012 to establish standard benchmarks for Graph DBMS.
A G-CORE graph query looks like this:
And a multi-graph query would look like this:
In a nutshell the new fused GQL language combines:
Gremlin is the graph traversal language of Apache TinkerPop. It is designed developed and distributed by the Apache Tinkerpop community project.
Apache TinkerPop is an open source, vendor agnostic, graph computing framework of both graph databases (OLTP) and graph analytics systems (OLAP).
Once a dataset is TinkerPop-enabled, the data can be modeled with graphs which can then be traversed with Gremlin.
Gremlin, like GQL works on the Property Graph, in fact, Gremlin describes itself as a functional, data-flow language that enables users to succintly express complex traversals on (or queries of) their application's property graph.
Gremlin is most widely-used and supported graph database query language today and TinkerPop is right at the center of it with the world's largest graph development ecosystem extending its core.
The most popular graph systems supporting Gremlin and Apache TinkerPop are:
Gremlin is turing complete. As a graph traversal machine, Gremlin is composed of a graph G, a traversal Ψ and traversers T. These traversers (T) move about the graph according to instructions in the traversal (Ψ).
Gremlin is typically implemented the user's native programming language. The user's native programming language defines the Ψ of the Gremlin machine which means it has support for:
Gremlin opens a lot of possibilities for graph databases with modern programming languages. A Gremlin traversal can:
Gremlin has a REPL console with many neat shorthands. The Gremlin REPL is useful for understanding how graph traversals work and ad-hoc analysis. The REPL also supports plugins.
SPARQL is an SQL-like declaractive query language created by W3C to query RDF graphs. Like Property Graphs, RDF Graphs (or, Triple stores) are directed, and labeled with nodes and edges.
SPARQL (SPARQL Protocol And RDF Query Language) is a W3C standard designed to meet the use cases identified by the RDF Data Access Working Group. Even though its a protocol, for most use cases SPARQL's greatest value is a query language for RDF graphs (another W3C standard).
RDF Data is described in a collection of three part statements (a triple) with:
The object can also be a triple. By connecting triples into networks of data, we form RDF graphs.
The subject and predicate are described using URIs, URIs are similar to URLs except they are just identifiers.
URIs, which are easy to confuse with URLs are made simpler to write using RDF's Turtle Syntax. This often shortens the URIs by adding an abbreviated prefix before the last part of the URI e.g.
RDF makes it easy to mix standard vocabulary such as
Data in a relational database table can be represented in RDF as a collection of triples, the row identifier is assigned to the subject, the column identifier to the predicate and the value as object.
Here's an example of a simple SPARQL query:
SPARQL uses triple patterns to select data. such as with the