AWS Neptune DB vs. Dynamo DB for entity lineage - database

I'm trying to evaluate what might work best for the following use-case:
There exists a set of entities that can be represented as a graph. Each vertex in the graph represents an entity, and each (uni-directional edge) represents a child-to-parent relationship. An entity may have multiple parents, and a parent may have multiple child entities. Usually, there is a "master" entity to which all entities can trace back. No entity can be removed. The requirement is that it should be easy to trace all the ancestors of any entity. The following are some conditions on the basis of which I'd like to evalute:
deep trees (the highest ancestor can be far away) vs. shallow trees (the highest ancestor is usually not far away)
broad traversal paths (a vertex can have many parents) vs. narrow traversal paths (a vertex usually does not have many parents)
any other important conditions that I've missed
Using this graph as an example:
In a regular DynamoDB-like database, this would be represented as:
-------------------
entity | parents |
-------------------
A | [] |
-------------------
B | [A] |
-------------------
C | [A] |
-------------------
D | [A] |
-------------------
E | [B, C, D]|
-------------------
F | [C, D] |
-------------------
A pre-existing condition is:
I'm far more familiar with DynamoDB, but have only very basic familiarity with NeptuneDB or any graph database, and therefore DynamoDB requires lesser up-front time investement. On other hand, NeptuneDB is of course better suited for relationship-graph storage, but under what conditions is it worth the technical overhead?

There are of course many ways to model and store connected data. As you have observed you could store a graph using adjacency lists as in your example. When working with highly connected data, where a Graph Database such as Amazon Neptune can really help is with the creation and execution of queries. For example, using the Gremlin query language (Neptune supports both TinkerPop/Gremlin and RDF/SPARQL), finding the most distant ancestor of vertex 'E" can be as simple as:
g.V('E').repeat(out()).until(__.not(out()))
No matter how deep the tree gets, the query stays the same. If you were to model the data using adjacency lists you would have to write code to traverse the "graph" yourself. A graph database engine like Amazon Neptune, is optimized to efficiently execute these types of query.
So in summary, you could do it using Dynamo or using Neptune but if the graph becomes complex then using a Graph Database with a built in set of graph querying capabilities should make the work you have to do a lot easier when writing queries to traverse the graph. The decision will come down to, as you note, the trade off between reusing what you already know well versus learning something new to gain the ability to easily write and execute queries no matter how complex the connected data becomes. I hope this helps you make that decision.
You will find a simple example of using Gremlin to model and traverse a tree here:
http://www.kelvinlawrence.net/book/PracticalGremlin.html#btree

Related

Why don't most graph databases support bidirectional edges?

I have to choose a graph database system and am very surprised that the mainstream ones don't support this feature ?
Why is it such a no-go for database systems ? And why developers out there don't seem to ask for it ? There should be a reason I'm not aware of.
Thanks for your help.
To my understanding, a "pure" bidirectional graph database cannot support cases where there are also unidirectional relationship, Twitter for example.
So the question becomes "why there are no hybrid (bidirectional and unidirectional) graph databases?" There are two problems with this solution:
It might not save storage as you expected because for bidirectional relationship, a hybrid graph database would need to store three edges instead of just one: A -> B, B -> A, and A <-> B. The reason is that some very common queries involve unidirectional relationship.
The cost of some basic queries is rather high. For example, there are two frequently asked questions in graph databases:
Find all friends of A
Find all friends of B
Commonly a graph database saves all friends of A as edges adjacent (AB, AC, AD, …). To find all of A's friends they just need to locate A and skim to the first edge whose prefix is not A. Suppose A has m friends and there are n. records in database in total, then the query complexity is O(log(n)) + O(m). The same logic applies to B. However, in case bidirectional edge is used, say A<->B, the cost of query for A's friends is the same but query for B's friends would be O(n) because a full database scan is required.

How should I store tag hierarchy data in DB for fast retrieval of hierarchy chains?

The data I am dealing with is a hierarchy of tags which are strings and more than writes we would be performing reads. We would be searching for a tag, and for whichever tag the search string is a sub string, we would need to return its full hierarchy with respect to its Root. like if we have the following tree:
Animals
|
------------------
| |
Tiger Pets
------------------
| |
Dogs Donkey
Incase our search string is "Do" we need to get Dogs->Pets->Animals and Donkey->Pets->Animals. The data could be pretty large and the searches need to be fastest possible. How should I model the data so as to get the required results. Which is more suitable for this: RDBMS or NoSql?
From relational part of the world, there are several ways how you can model and implement hierarchies. You can find in-depth coverage in Joe Celko's Trees and Hierarchies in SQL for Smarties book. For this particular task, I think path enumeration model could work quite well. In this model, you store a path in every node of the tree, so it is easy to search for nodes and output path.
I don't see any issues with RDBMS-based implementation. I'd look on NoSQL if 'large' is actually 'enormously huge'. I believe that searches would be faster on RDBMS.
The decision which technology to use is subjective. There are a lot more things you should consider than just this one query when making this decision. Also keep in mind that there is no such thing as a "typical" NoSQL database. There are dozens of different database technologies which all work differently.
But should you decide to use MongoDB, you can store the tag hierarchy by creating a document for each leaf-node which also includes the full tag-hierarchy of that leaf in an array like this:
{
name:"German Shepherd",
hierarchy: [
"Animals",
"Pets",
"Dogs"
]
}
A find({ hierarchy: "Dogs" }) will return all documents where "Dogs" appears anywere in the hierarchy chain. You can create an index on {hierarchy:1} which will drastically speed up this query (indexes on arrays create separate index keys for all array entries). MongoDB preserves the order of array entries, so you can rely on the order of the hierarchy array to accurately represent the hierarchy.

How to persist a graph data structure in a relational database?

I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.
The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.
I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.
Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.
I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.

The right record access implementation

I am looking into indexing engines, specifically Apache Lucene Solr. We are willing to use it for our searches, yet one of the problems solved by our frameworks search is row-level access.
Solr does not provide record access out of the box:
<...> Solr does not concern itself with security either at the document level or the communication level.
And in the section about document level security: http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
There are few suggestions - either use Manifold CF (which is highly undocumented and seems in a very pre-beta stage) or write your own request handler/search component (that part is marked as stub) - I guess that the later one would have bigger impact on performance.
So I assume not much is being done in this field.
In the recently released 4.0 version of Solr, they have introduced joining two indexed entities. Joining might seem a nice idea, since our framework also does a join to know whether the record is accessible for the user. The problem here is that sometimes we do a inner join, and sometimes and outer (depending on the optimistic (everything what's not forbidden is allowed) or pessimistic (everything is forbidden only what is explicitly allowed) security setting in the scope).
To give a better understanding of what our structure looks like:
Documents
DocumentNr | Name
------------------
1 | Foo
2 | Bar
DocumentRecordAccess
DocumentNr | UserNr | AllowRead | AllowUpdate | AllowDelete
------------------------------------------------------------
1 | 1 | 1 | 1 | 0
So for example the generated query for the Documents in pessimistic security setting would be:
SELECT * FROM Documents AS d
INNER JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
This would return only the foo, but not the bar. And in optimistic setting:
SELECT * FROM Documents AS d
LEFT JOIN DocumentRecordAccess AS dra ON dra.DocumentNr=d.DocumentNr AND dra.AllowRead=1 AND dra.UserNr=1
Returning both - the Foo and the Bar.
Coming back to my question - maybe someone has already done this and can share their insight and experience?
I am afraid there's no easy solution here. You will have to sacrifice something to get ACLs working together with the search.
If your corpus size is small (I'd say up to 10K documents), you could create a cached bit set of forbidden (or allowed, whichever less verbose) documents and send relevant filter query (+*:* -DocumentNr:1 ... -DocumentNr:X). Needless to say, this doesn't scale. Sending large queries will make the search a bit slower, but this is manageable (up to a point of course). Query parsing is cheap.
If you can somehow group these documents and apply ACLs on document groups, this would allow cutting on query length and the above approach would fit perfectly. This is pretty much what we are using - our solution implements taxonomy and has taxonomy permissions done via fq query.
If you don't need to show the overall result set count, you can run your query and filter the result set on the client side. Again, not perfect.
You can also denormalize your data structures and store both tables flattened in a single document like this:
DocumentNr: 1
Name: Foo
Allowed_users: u1, u2, u3 (or Forbidden_users: ...)
The rest is as easy as sending user id with your query.
Above is only viable if the ACLs are rarely changing and you can afford reindexing the entire corpus when they do.
You could write a custom query filter which would have cached BitSets of allowed or forbidden documents by user(group?) retrieved from the database. This would require not only providing DB access for Solr webapp but also extending/repackaging the .war which comes with Solr. While this is relatively easy, the harder part would be cache invalidation: main app should somehow signal Solr app when ACL data gets changed.
Options 1 and 2 are probably more reasonable if you can put Solr and your app onto the same JVM and use javabin driver.
It's hard to advice more without knowing the specifics of the corpus/ACLs.
I am agree with mindas, what he has suggested (sol-4), i have implemented my solution the same way,but the difference is i have few different type of ACLs. At usergroup level,user level and even document level too (private access).
The solution is working fine. But the main concern in my case is that ACLs gets changed frequently and that needs to be updated in the index,mean while search performance should not get affected too.
I am trying to manage this with load balancing and adding few more nodes into the cluster.
mindas,unicron can you please put your thoughts on this?

Nosql DB for undirected graphs?

I want to store a graph of millions of nodes where each node links to another in an undirected manner (point A to B, automatically B points to A). I have examined Neo4j, OrientDB as possible solutions but they seem to be oriented in directed graphs, and Neo4j not being free for >1 million nodes is not a solution for me.
Can you help me which of the other NoSQL DBs (Redis, CouchDB, MongoDB, ...) would suit best for something like this and how could it be implemented? I want to make a no-property (just give me the linked elements) breadth-first queries with 2 depth levels (having A<->B, B<->C, C<->D, querying A should give me B and C, but not D).
OrientDB has no limitation on the number of nodes. Furthermore the default model is bi-directional. You can use it for FREE also for commercial purposes, since the applied license is Apache 2.
The GraphDB is documented here: http://code.google.com/p/orient/wiki/GraphDatabase. Basilary you can use the native API or the Blueprints implementation. Native APIs has an evolution of the SQL language with special operators for graphs. Example:
SELECT FROM Account WHERE friends TRAVERSE (1,7) (address.city.country.name = 'New Zealand')
That means give me all the accounts with such friend that lives in New Zealand. Friends are taken up to the 7th level of deep.
The second one allows to use the full Blueprint stack such as the Gremlin language to create your super-complex queries.
Neo4j always stores relationships/edges as directed, but when traversing/querying you can easily treat the graph as undirected by using Direction.BOTH or in some cases by not defining a direction at all. (This way there's no need for "double" edges to cover both directions, you simply ignore the direction - and there's no performance penalty when traversing edges "backwards".)
The 1 million "primitives" limit was removed for quite a while now. If your code is open source, you can use the community version for any size of the DB. For other cases there's the commercial versions which includes one free alternative.

Resources