SuttaCentral

Using Graph Database (ArangoDB)


#1

I decided to look into using a Graph Database, which is able to elegantly handle relationships and data which doesn’t fit well into simple tables. Fundamentally a Graph Database thinks in terms of vertexes (things) and edges (connections between things).


A graph database can model complex relationships: for example here the Pali Nikayas are children of “su” (Suttas) and “pi” (Pali). There is no need for one of these to be the root of the data.

I found a database called ArangoDB that thoroughly exceeds expectations. ArangoDB is a multi-model database, which means it can act as a document store (like MongoDB), key: value store (like Redis) and a graph database (like neo4j). Not only that but it caters to providing JSON data to client side rendering frameworks - in that sense it is highly practical.

Some highlights:

  1. Natively JSON, basically it speaks, thinks in and understands JSON.
  2. Can act as a standalone backend server, presenting a REST API. It has authentication and permissions.
  3. Is a turnkey application. You just install it and it’s ready to go - no dependencies or complications.
  4. The devs really care about performance, it has a fast c/c++ core and a V8 javascript layer on top of the core (Foxx microservices), rather than being a java monstrosity.
  5. It has a web interface with a built-in visualizer and query analyzer.
  6. There is a consistently clear emphasis on ease of use and ease of learning.
  7. ArangoDB decreases the number of things you need to know, rather than increases.

Some things it does right:

  1. Plugins are implemented in javascript, they are close to the core so are ideal for implementing custom logic available on API endpoints. It is 1000000% easier to implement a plugin for ArangoDB than for one of the java monstrosities like elasticsearch.
  2. Database for a modern PWA, it is practically a dropin replacement for something like MySQL if you want to use JSON and have a REST API - and that’s to say nothing of it’s multi-model capabilities.
  3. Microservices are a big buzzword but they can also be a nightmare to manage. But with ArangoDB instead of running a bunch of different services, you run just one: ArangoDB. The microservices are js plugins that run inside it - snug up against the data where there is no communication overhead. Instead of proliferation of services you get consolidation. At first it seemed like a weird bolted on feature. Now I recognize it’s bloody brilliant, in fact it’s practically common sense.
  4. The Arango Query Language (AQL) resembles python and javascript by using for ... in ... constructions and such. The flow of AQL is decidedly straightforward and it is far more readable than SQL. A good language provides expressive power, that is relatively few words are required to express your will and make the software carry it out. Like SQL, AQL fulfills this promise while being more readable.
  5. Pragmatic. A “pure” Graph DB can be rendered impractical by adhering to an ideology of graphy purity, a multi-model database does not suffer from ideology, by also acting as a document store and key:value store it can enjoy extremely high performance for operations which a graph model is ill-suited for.
  6. Import/Export is as JSON files, cleanly separated into structure and data.

There are a lot of things that traditional databases do wrong now… to be fair, MySQL is over 20 years old now, it comes from a completely different era of computing. ArangoDB is only 5 years old, it has grown up in the modern era of computing, and it shows. It also seems to be a product of some very fine German engineering (it is an opensource project, but backed by a German database company)

I believe that ArangoDB can be a complete backend solution for delivering data to the frontend, using custom endpoints implemented in javascript for any logic too fancy for the standard API functions. So we could use it, but what are the compelling reasons?

  • To get a consistency guarantee. When all the data is loaded into a graph database, and you try and link everything together with links, you know if there are problems like typos in uids.
  • To get data wrangling functions and a REST API for free - also has a nice web interface baked in.
  • To have the data in a format which can be exported, and then imported into other applications like visualizers.
  • A farewell to Python (at least in the web server). While I love Python, there are advantages in using one language consistently.
  • Infinite Possibilities, ArangoDB offers both great flexibility and ease of expression with blazing fast performance (contrast Elasticsearch: which is powerful but too sluggish for many tasks I can imagine) and it is easy to implement highly performant plugins for custom logic.
  • A clear winner: ArangoDB is clearly better than alternatives - the closest competitor is OrientDB, but ArangoDB offers a host of side benefits, most of which come from the c/c++ core + V8 architecture.

Why not:

  • The more a service on a server is leveraged, the harder it might be to implement functionality into offline mode. Flipside: with ArangoDB any custom logic is implemented in javascript which can be shared with client, and at the end of the day internally the data is basically JSON so can be understood readily by javascript - just some of the data-wrangling functions we got for free earlier have to be reproduced.
  • Not all beer and pizzas. For example vinaya parallels in the graph is basically a bomb (if there are 20 things all parallel of each other, that’s 400 links if expressed in the most naive way as a graph). Altough the fact that ArangoDB is a multi-model database with a powerful plugin architecture entirely mitigates this issue - that is to say even though it doesn’t solve all problems, it also doesn’t get in the way of solutions.
  • Graph databases aren’t very well known and the most well known one (neo4j) is kind of esoteric, ArangoDB is the youngest of those to appear on the radar.
  • Technology lock in. This is obviously unavoidable, you have to use some technology or another. A good strategy is probably to minimize how esoteric your technologies of choice are. ArangoDB isn’t popular I believe mainly due to being new, but it’s also very straightforward in every way.

I’m surprised I didn’t look into a Graph database before now - altough that’s because I barely even know they exist, no Graph DB has risen to prominence and it could easily be assumed they are only suitable for esoteric purposes. Multi-model databases are even more unusual and even newer. It’s probably only in the past 2 or 3 years that it would have started looking like a good idea.


#2

Thanks Blake for that clear and helpful description. Even with my limited understanding I can see some exciting benefits here.

Regarding offline use, would it be right to say that we can implement the core functions first, then enhance it by dedicated JS on the client side to provide additional functions?


#3

In short, yes, it can be done incrementally. In long, this is what I imagine implementing offline functionality will entail:

Caching:
Save requests that have already been made, allowing to go back through history and avoid making repeated requests. No problems here, it’s just a fancy caching layer with some rules for expiry. There’s barely need for any custom code here because it’s a stock pattern in PWAs. Very easy to implement in fact I’m sure polymer-cli will do it for you.

Site Data:
Download the site data (i.e. everything except texts), that would come to I think no more than 10mb uncompressed, so a few mb compressed.
My thinking is the simplest way to do this is as a big bundle of javascript which is ready to be consumed by Polymer and only has to be sliced up by the service worker, probably the data would be division-centric with some more stuff on the side.
Oh, and you need some parallels calculating logic because the vinaya parallels are too big to “unfold” in advance. At the moment I am thinking of calculating the parallels in javascript anyway and that would be very straightforward to do isomorphically - that is the parallels calculation code is first implemented on the server, then straight up reused on the client. Or maybe even just send the basic parallels data across the wire and always calculate the parallels on the client - that would save bandwidth and parallels aren’t hard to calculate. Atm the whole parallels data is only 140kb gzipped so we could just send the whole lot and not feel too guilty (Google doesn’t feel guilty about sending 500kb across the wire each time I load gmail).

It is important to have some way to synchronize the bundle of JSON on the client, with the latest from the server, I was thinking of using json-diff-patch - that is periodically a new bundle of site data is generated and is assigned a MD5 style version. If a client wants to download the offline site then the data is downloaded, if the client already has it, it periodically sends a message to the server saying “hey, I have this version, got updates?”, and if it is out of date the server then sends a diff-patch with only the changes between the two versions, this requires keeping a history from the last month or so, the server needs to know what it used to look like so it can calculate a diff - but the history doesn’t need to go back forever because the server can just say “I don’t recognize what you have, it’s too old or it’s corrupt, here, just take the complete data”. The main gain to using a diff-patch approach is it means updates are (generally) miniscule and there’s no need to feel bad about not getting user consent.

ArangoDB is an ideal position to facilitate this, as it can run the same json-diff-patch library as the client and it can also manage the patch history as it is a database. Very little of the data wrangling would need to be reproduced on the client (because the data is pre-wrangled and delivered in bulk), but it does require some extra code on both the server (to bundle the data) and client (to unbundle the data) and both (to synchronize).

(by sheer coincidence: one of ArangoDB’s cookbook recipes is a JSON diff algorithm)

Texts:
The texts are by far the largest amount of data. It is important to split the body of texts up into parts that can be made available offline - perhaps by division+language (an individual text can always be made available offline by just browsing to it).
On the client side there needs to be the interface to select content for being made available offline. It might not be necessary to have any bulk/batch delivery of texts, one request per text might well work out fine - after all each time I go to gmail the browser performs about 200 requests and SN has about 900 texts, so the request volume would at least be within an order of magnitude of acceptable. There does need to be some way to keep the downloaded texts up to date but there are a considerable number of possible approaches - redownloading in the case of a change probably makes the most sense.
Clearly there should be some kind of user consent to download/update texts, and ideally informed consent i.e. an estimate for how much data will be downloaded.


#4

Okay, thanks.

For offline use, I suggest the following for the UI.

There’s no need for excessive granularity in selecting texts. Language + collection is quite enough. English Pali suttas are currently only 6.7 MB zipped.

Offline use

Please select from the following options for offline use. You can change these at any time.

  • Select main language English ⏷
    • Language for site interface + translations. Default setting is the current site language.
  • Select additional translation languages (None) ⏷
    • Download translations in additional languages. (Optional)
  • Select Collection
    • Discourses [Y/n]
    • Monastic Code [y/N]
    • Abhidhamma [y/N]
  • Include root texts? [y/N]
    • Download texts in Pali, ancient Chinese, Sanskrit, and Tibetan.
  • Update automatically? [Y/n]
    • Recommended. Updates are incremental and will usually be very small.
  • Download only when connected to WiFi? [Y/n]
    • Recommended especially if you are on a mobile network with limited data.

Estimated download size: 100 MB
Estimated download time on current connection: 5 minutes

DOWNLOAD


#5

I downloaded the documentation mobis to my kindle and had a look through it back at my hut. There is a huge amount to digest there, but generally:

No binaries are available for Raspbian, but if it compiles on Windows, linux/arm should not be a problem. Version 3.2.3 is available here - https://www.arangodb.com/download-major/source/ - let me know if you’re working/playing with another version. Of course with a web-interface we can share a playground server - I just have latency issues here due to the satellite connection.

Foxx looks cool - especially since we can scale up to “Cluster-FOXX” (tee-hee, maybe renders differently in German). My Javascript is pretty much limited to alert('You clicked the button!'); but I’m sure I can find a good tutorial.

The graph mode for data modeling would definitely fill the role of our old IMM module.

All in all, yes you’re going to be locked in to a specific server and API. I kinda think any database system that isn’t SQL based may well be obsolete in 5 years, but that’s what rewrites are for. :slight_smile:


#6

For those interested in the graph features of ArongoDB, this free course is available:


#7

Hi JR,

Just to let you know, since then, we haven’t updated here, but we have gone ahead with ArangoDB. Blake is very happy with it!

The main structure of the backend is falling in place, using arango, docker, nginx, flask, and swagger, and keeping elasticsearch. The frontend was mostly built by Vimala in Polymer 2.0, and now the team is working on marrying the front and back together. The end result will be a PWA with full offline capability (except search!)

As well as our two normal developers, we’re now working with a team in Poland to ensure it’ll be done on time, and done well.


#8

Oh! That would be Jakub & Hubert. I’ve been getting notifications of their pull requests. Hi folks!

What’s a PWA?


#9

No, wait!

You can set up the latest version for yourself if you like:


#10

Yeah, I’ll definitely do that - just over 2 weeks till the end of rains. We seem to have a large number of git repos for SC, I’ll need to see what I actually need. My plan is:

  1. Install the dependencies - I’m thinking I’ll leave ElasticSearch for now.
  2. Get the code.
  3. Work through some of the JavaScript materials on https://developer.mozilla.org
  4. Start playing around with the SuttaCentral code and see where I might be able to help.

#11

You’ll need next-sc and nextdata.


#12

I already had them cloned and SSH set up, so two git pull commands later I’ve got the lot. Happy days.