When (and Why) to Choose Graph Databases over Relational Databases

Having worked with Neo4J for a little over 4 years now I noticed that most people coming from years working with Relational Database tables often find grasping how Graph Databases work rather daunting. To determine why and when to use graph databases instead of relational databases I research by compared Neo4J and Amazon Neptune with PostreSQL. Here's what I found.

Most data that fits a relational data structure also fits a graph data structure. Graph databases such as Amazon Neptune and Neo4j are NoSQL databases. You will get the most from Graph Databases if your data is huge, has intricately structured high-value relationships, and is contantantly evolving (real-time). Graph Databases will also make data visualization and aggregation of queries a breeze. You might want to hold off if your data is not related at all.

In November 2017, AWS launched Amazon Neptune, their first Graph Database into general availability covering their long conscpicous absence in the Graph Community. Neo4J was and is still the front runner in this space - certainly one of the best known.

What is a Graph Database?

A graph database is a type of NoSQL database that uses graph theory (graph data models) stores, map and query relationships.

In graph theory, a graph comprises of vertices (nodes) connected by edges (arcs).

A graph database is thus, essentially a collection of vertices and edges. A vertice represents entity such as a person, place or event, a discrete object, while an edge represents a relationship such as between vertices, such as a person known to another, having been involved in an event at a certain place.

A vertice in a graph database has a unique identifier, followed by a set of edges. Both vertices and edges can have an arbitrary number of key/value pairs i.e properties.

Properties typically express non-relational information about vertices and edges.

A graph as used in graph databases is often referred to as a propery graph.

When a graph is undirected, it means that any two vertices connecting an edge are not different.

A graph database models vertices and edges in the relational graph as first-class entities. This allows to complex interactions which mimic a more natural form of data modeling and representation.

What is a Graph Data?

Most data can be represented as graphs.

Data that is composed of heterogenious sets of objects (which can be represented as vertices) and that can be related to one another in a complex ways (which can be represented as edges) is a perfect fit for a graph data model.

While data in a tables can also be related, as represented in relational databases, the relationship are somewhat simplistic when contrasted to graph data. Data that submits itself to complex many-to-many relationship is more rightly represented with graphs.

Apache TinkerPop is a popular supported graph computing framework that uses Gremlin graph query language.

Gremlin traverses property graphs using a sequences of map-step, filter-steps or sideEffect-steps in queries.

Is My Data a Graph?

There really aren't a lot of true hierarchies in data. Those adon't really exist.

Graph data is a much better representation of how data actually works in the real world.

Here are three common pointers are to whether your data is better of with graphs than relational or hierarchical databases.

  1. If data is best represented by many-to-many relationships.
  2. If these complex relationships between data change often (highly flexible but important relationships).
  3. If data has unstructured relationships (complex but non-hierachical - much more closer to an unstructured network).

How do Graph Databases work?

On an abstracted level, graph databases see data from a completely model with relational databases. A graph database sees your data as vertices related with edges while a relational database sees your data as a set of tables connected by the primary-key in each table.

At a lower level a graph database is just a huge index of data vertices. A graph query targets clear, explicit vertices never touching the others. There are ho hidden assumptions. A relational data, by contrast, sweeps across large dataset only to collect a single field such with FROM clause.

When to Use Graph Databases instead of Relational Databases (The Pros)

Graph databases are a better fit for some problems than others. Generally, data than can be modeled on a graph database can also be modeled on a relational database. Using graph databases offers the following advantages over relational databases.

  1. Low-latency at Large Scale
  2. A unique value proposition of graph databases is superior performance when querying huge datasets.

    Relational databases have a somewhat limited ability to handle multiple joins, especially on big data datasets without introducing an unnecessary level of complexity. The complex relational join query is a back-breaker.

    Graph databases excel at querying huge related data. Graphs flow from an relational data structure which stores data in its natural relationship as opposed to a adapting it to fit a tabular model. Data is accessed exactly as defined in the schema at raw loading time. Query processing is faster because non-relevant data is easily bypassed.

    A key niche curved out by graph databases is shaping out to be real-time big data, particularly because of the flexibility of queries, coupled with their efficiency.

    Some of the perfomance bottlenecks of relational databases can be directly attributed to such inefficient design concepts such as sequential scans.

  3. Intricately Structured High Value Relationships
  4. Graph databases are an AI favorite because of their ability to model complex data relationship.

    Data relationships are intimately structured to accomodate inference of things such as indirect facts and tangentically related information. The edges are just as important and detailed as the vertices.

    The capability of graph databases to accomodate rich relationship data is virtually unmatched by any other database technology available today by a wide margin. Google cites one of the key advantages of graph-based semi-supervised machine learning approach as the ability to model labeled and unlabeled data JOINTLY during learning by leveraging the graph data structures - this then allows them to combine multiple signals into a single graph and use graph learning over it.

    Capabilities for knowledge inference from graph data structures relationship has also been emphasized by DeepMind, especially as an optimizationa and configuration for neural networks.

  5. Near Perfect Data Visualization
  6. Data visualization is a notable graph database forte. Graph data structures are the industry standard here.

    Combining multiple dimensions to visualize large datasets such as time series, demographics etc. is one of the default use cases. Graph data structures are perfectly suited for model natural intuitive data relationships.

  7. Aggregating Queries
  8. Aggregating queries in a tabular data structure is a pain because tables already dictate how data is grouped. A relational database simply will not group data from a specific selection of data points.

    Schema evolvement in graph queries is a key advantage in this regard. You can aggregate and manipulate your data by simply dropping or adding vertices that extend or shrink your data.

  9. Constantly Evolving Real-time Data
  10. Relational databases to do not easily adapt to constantly chaning object types that are ccmmon in realtime and live-update applications.

    Highly expressive graph query languages are very adaptive to querying constantly changing underlying schema.

    NoSQL databases are just as adaptive to constantly changing object types.

When to NOT to Use Graph Databases (The Cons)

As with any popular technology, there is a tendency towards solving every database problem with graph databases.

When paired with the right use case, graph databases are a viable solution

Some edge cases are a good fit for graph databases especially those that have no need for advanced data structures such as graphs.

Let's look at some of these.

  1. Unrelated data
  2. If your data is not related, graphs might not be the best fit. Most data is connected is naturally connected in some way, but sometimes a dataset can have no connections via properties and have no connections at runtime via queries.

    A scenario is conceivable that data fits into, say, exactly one object type. This is not an ideal use case for a graph database.

  3. Standard query language
  4. NoSQL database generally have no standard query language (SQL). This can be both and advantage and a drawback. Graph databases are NoSQL. There are a host of different query languages with no central authority. e.g GraphQL, Gremlin and so on.

    This results to few languages will having support and tooling outside their immediate ecosystem. This often derails enterprise adoption because of the need for skilled inhouse developer teams.

  5. No proficient graph developers
  6. Because the graph data model is fairly new and now as mature or as popular as relational data models. The ecosystem is still catching up and it is still hard to find and keep talent.

Which the best Graph Database?

This is a bit hard to answer this...

Mostly because Neo4j is pretty much what most people using graph data are familiar with.

A good place to start would probably be db-engines

.

For multi-model graph databases Datastax seems to be following a close second after Microsoft Azure Cosmos DB. Orient DB and ArangoDB are also popular alternatives to Neo4j.

The top 5 Graph Databases: Alternatives to Neo4j

  1. OrientDB
  2. ArangoDB
  3. AllegroGraph
  4. Datastax
  5. Virtuoso

As I said earlier, Neo4j is a front runner by a wide margin right now.

It's still seems a little to early to tell what impact Amazon Neptune with have in the Graph community.

How good is Neo4j as a graph database?

Andrew Nikishaev had an interesting take on Neo4j after using it for a year.