Amazon Neptune or Neo4j: The Quick Blow-by-blow Starter Guide

What is going on in Graph database world right now may be history on the making. There is strong indication that Graph Database technology is no longer a fringe domain.

Graph databases are now going mainstream and vendors are vying for domination amid claims graphs are the most natural way to model data in the real world. And it is not the Oracles and IBMs of the world that are leading this space right now.

AWS says their new fully-managed Graph Database service is optimized to handle billions of relationships and run queries in milliseconds.

Neo4j says that the latest version 3.3 is 50% faster at writes than version 3.2, that it supports realtime transactions and graph traversal applications and that is massively scalable.

Graph databases are known for their brilliance at representing many-to-many relationships that are so hard to model in NoSQL and Relational Databases.

Here is the definitive side-by-side comparison of the most popular Graph Database (Neo4j) and Amazon Neptune - AWS's first foray into Graph Databases to help you understand how you can now benefit from more choices and more features.

Amazon Neptune

Neo4j

Amazon Neptune is a fully-managed cloud-based high-performance graph database that is generally available on AWS. You can use open and popular graph query languages such as Gremlin and SPARQL to query connected data.

AWS handles provisioning, patching, backup, recovery, failure detection and repair for you.

Neo4j is the world's leading native graph database platform. It is written in Java and Scala and is accessible with the Cypher Query Language developed internally at Neo4j and later opened sourced through the openCypher project.

Neo4j has been working exclusively with graph technology since 2007 and it boasts the world's largest ecosystem for Graph DBMS.

Open Source

Nothing is mentioned about the availability of open source version. Safe to assume Amazon Neptune is closed source.

Neo4j 1.0 first released both open source and commercial packages back in 2007.

Neo4j is currently available in a GPL3-licensed open source community edition with online backup and high availability extensions licenced under the Affero GPL.

High Availability and Replication

Amazon Neptune divides your data into 10GB "chunks" spread across many disks. Each chunk is replicated six ways across three availability zones.Loss of up to to two copies does not affect writes while the loss of up to three copies does affect reads.

Neptune supports up to 15 read replicas, replicated asynchronously with automated failovers (replica instances share similar underlying storage as the primary instance).

Although there is no support for cross-region replicas, it is possible to prioritize and modify certain replicas as failover targets by assigning a promotion priority.

High Availability on Amazon Neptune boils down to the number of replicas and their priority tiers.

To increase availability, you increase replicas - Amazon Neptune is vertically scalable instead of being horizontally scalable, and there is no sharding!

Neo4j instances have master-slave cluster replication in high availability (HA) mode, a Master maintains a master copy of each data object and replicates this to each Slave (the full dataset is replicated across the entire cluster).

Updates are typically made from the master which has no regard for the number of instances that fail as long as it remains available.

Neo4j doesn't have master-master replication and there is no way to set master priority for instances.

Although writes are synced with the elected master, reads can be done locally on each child which means read capacity increases linearly with instances.

Neo4j supports in-memory sharding of the graph along natural "chunks" that can be kept hot on specific instances. Queries that map to those chunks can then be routed.

Neo4j supports full and incremental backups from running clusters.

ACID Transactions

It has been implied that Amazon Neptune has not support for ACID transactions. This is simply incorrect.

Amazon Neptune provides a similar ACID transactional model as Aurora and DynamoDB, featuring a write master (for immediate consistency) of transactions commited on distributed replicas (once at least four of them complete updates).

A possible exception to this might however be the bulk upload feature which might conceivably suspend ACID guarantees to enable higher write throughput rates.

Neo4j like most graph databases uses ACID instead of Base consistency model meaning;

All operations in a transaction succeed or every operation is rolled back (Atomicity);

The database is structurally sound on completion of a transaction (Consistency);

Transactions run sequentially without conflicting with one another (Isolation), and;

Results of a transaction are permament (Durability).

While maintaining fully ACID transactions, Neo4j is able to commit tens of thousands of transactions per second.

Graph Visualization

Amazon Neptune conspicously lacks graph data visualization. Considering that this is a core feature of formulating queries and exploring graphs visually, this should be a standard feature.

Neptune offers visualization via partners which come with add-on costs.

The default Neo4j server comes with a powerful customizable graph visualization (in-browser) tool. The Neo4j browser is based on the built-in D3.js library

.

On top of being an easy way to visualize graph data, it can be used for querying, adding data and creating relationships amongst other things. Queries run in the Neo4j Browser are rendered either as in a visual graph, in a table format or an ASCII-table result

.

Advanced Data Analytics

Amazon Neptune does not support advanced data analytics with solutions such as Spark and GraphX

. While it should be trivial to apply advanced analytics to you graph, you will struggle to integrate and move your graph data around.

Neo4j allows specific delegation for ad-hoc reporting and analytics instances. Analytics jobs can be run without compromising capacity

.

Query Languages and API Support

Amazon Neptune lets you choose between the Property Graph (PG) model and its open source query language - Apache TinkerPop Gremlin which is guided by DataStax's Marko Rodriguez or the W3C standard Resource Description Framework (RDF) model and its standard query language, SPARQL.

AWS cites different use cases for both of data models. Customers in domains using triple entities with subject-predicate-object such as knowledge graphs or clinical data stores prefer RDF while customers from variably structured data sources such as social media prefer PG.

Once you declare which choice to use, the two models are not interoperable. This does not come as a suprise as bridging the two models is not trivial. While getting two graph databases for the price of one looks attractive, having to extract, transform and load (ETL) data from one to the other is not.

AWS chose not to include RDF inference in Amazon Neptune citing it's impact on scalability. If you want support for these, you have to use a reasoner engine in addition to Neptune.

Inference provides the ability to process RDFS or OWL rules used to declare schema when adding data. Schema declared often includes classes, inheritances, types, restrictions for nodes, edges and properties.

There is no proper tooling for importing or exporting data.

Even though Amazon Neptune has tools for ingesting data in CSV, RDF and GraphML, these tools are simple static files. Even with DynamoDB streams for dynamic data import you still have to write the ingestion code.

Same goes for exporting, although it's possible with Gremlin and SPARQL, there is no proper tooling around this.

Property Graph solutions fall into two major camps, Gremlin and Cypher.

Rather than use either RDF / SPARQL or PG / Tinkerpop Gremlin, Neo4j chose to build its own custom query language - Cypher for use with it's Property Graphs.

Cypher was largely the invention of Andrés Taylor while working for Neo4j in 2011, then called Neo Technology.

At later open sourced in October 2015 with the openCypher project. OpenCypher has some industry support, most prominently by SAP HANA Graph, Redis and AgensGraph.

In October 2017, Neo4j announced added support for Cypher on Spark. Cypher on Spark is just a wrapper around GraphFrames (which Cypher already delegates under the hood), data is processed in-memory while Cypher is used as an interface. Eifrem admits that this approach incurs a performance penalty and besides, if you really wanted to use graph data on Spark there was always GraphX

Apparently, there is now also Cypher for Gremlin.

One of the pain points of Graph technology might actually be too much of a choice when it comes to query languages. It's almost as if every platform has it's own query language.

Security

Amazon Neptune isolates your Graph Data in Virtual Private Clouds or encrypted IPsec VPNs on-premises. There are also firewall settings and network access controls to database instances.

Permission are managed with standard AWS IAM roles. Specific resources such as database instances and snapshots can be grouped and assigned to specicic roles.

Amazon Neptune instances can be encrypted with AWS KMS. This encrypts data stored at rest together with automated backups and replicas of in the same cluster. Encrypting an existing unencrypted Neptune instance is not supported, you have to migrate your data to a new encrypted instance to achieve this.

As a managed service, AWS keeps your Amazon Neptune instances up-to-date with the latest patches (you can control when patches are applied).

Native user role management is only available with Neo4j Enterprise Edition. Although it's possisble to create multiple users in the community edition, all users assume the priviledge of an admin for the available functionality.

Neo4j supports pluggable authentication with the LDAP protocol which allows integration with Active Directory, OpenLDAP, and other LDAP-compatible authentication services such as Kerberos.

Native Neo4j's user and role management can thus be completely turned off and LDAP groups mapped to native roles.

Neo4j supports subgraph access control making it possible to restrict a user's access to specified portions of a graph.

Neo4j recommends volume encryption by tools such as Bitlocker, manual patches and survey of the neo4j.conf file amongst other security guidelines.