I have to choose a graph database system and am very surprised that the mainstream ones don't support this feature ?
Why is it such a no-go for database systems ? And why developers out there don't seem to ask for it ? There should be a reason I'm not aware of.
Thanks for your help.
To my understanding, a "pure" bidirectional graph database cannot support cases where there are also unidirectional relationship, Twitter for example.
So the question becomes "why there are no hybrid (bidirectional and unidirectional) graph databases?" There are two problems with this solution:
It might not save storage as you expected because for bidirectional relationship, a hybrid graph database would need to store three edges instead of just one: A -> B, B -> A, and A <-> B. The reason is that some very common queries involve unidirectional relationship.
The cost of some basic queries is rather high. For example, there are two frequently asked questions in graph databases:
Find all friends of A
Find all friends of B
Commonly a graph database saves all friends of A as edges adjacent (AB, AC, AD, …). To find all of A's friends they just need to locate A and skim to the first edge whose prefix is not A. Suppose A has m friends and there are n. records in database in total, then the query complexity is O(log(n)) + O(m). The same logic applies to B. However, in case bidirectional edge is used, say A<->B, the cost of query for A's friends is the same but query for B's friends would be O(n) because a full database scan is required.
Related
This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.
Good day!
I need to find a base for storage and processing complex structured information.
Something like a mind map. Need to have some arbitrary values in groups with connections to each other, connection must also have titles.
The biggest problem is that I need to get all the related values without knowing exactly what are the connections and how many of them.
For example:
With VALUE 3 connected
VALUE 1 from the group A as NAME OF COMMUNICATION 1
and VALUE 2 from group B as NAME OF COMMUNICATION 2
and ...
Before any level of the connections (i.e., the values of all properties connected to the associated properties, and for these properties and so on until a predetermined level) - but it can be implemented in the application logic.
I looked at some noSQL base, but they do not allow such requests without knowing the exact value or links. I pondered on the mysql development with a lot of logic in the application to handle all this, but perhaps there is a more suited storage for such a task?
I would be grateful for any help.
http://magika.tk/struct.png - A schematic example.
As Philipp says mind-maps are a type of graph, usually a spider diagram. A graph based NoSQL databases, such as Neo4j would be suitable. Here's a longer list. Graph databases store information about the nodes and the edges. Each node has a pointer to all its adjacent nodes so counting connections and groups should be very fast.
I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.
The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.
I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.
Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.
I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.
Is there a difference between a graph and a hypergraph database?
Is every hypergraph database system also a graph database system?
I am asking for a side-by-side comparison. If it is possible to show this in one row:
Graph support: No/Graph/Hypergraph
Or if it is better to use two rows:
Graph support: No/Yes
Hypergraph suppport: No/Yes
Or means "graph" and "hypergraph" the same in the database context?
How a certain graph database handles its edges is an implementation detail. Hence an answer cannot really be given in regards to "[hyper]graph databases in general".
From the point of mathematical graph theory however there is a difference:
Edges as known from standard graphs model (directed or undirected) 1:1 connections.
Hyperedges as known from hypergraphs model (directed or undirected) n:n connections.
Graph vs. Hypergraph:
A simple graph can be considered a special case of the hypergraph, namely the 2-uniform hypergraph. However, when stated without any qualification, an edge is always assumed to consist of at most 2 vertices, and a graph is never confused with a hypergraph.
(Source)
Undirected hyperedges:
A[n] [undirected] hyperedge is an edge that is allowed to take on any number of vertices, possibly more than 2. A graph that allows any hyperedge is called a hypergraph.
(Source)
Directed hyperedges:
Directed hypergraphs (Ausiello et al., 1985; Gallo et al., 1993) are a generalization of directed graphs (digraphs) and they can model binary relations among subsets of a given set.
(Source)
I want to store a graph of millions of nodes where each node links to another in an undirected manner (point A to B, automatically B points to A). I have examined Neo4j, OrientDB as possible solutions but they seem to be oriented in directed graphs, and Neo4j not being free for >1 million nodes is not a solution for me.
Can you help me which of the other NoSQL DBs (Redis, CouchDB, MongoDB, ...) would suit best for something like this and how could it be implemented? I want to make a no-property (just give me the linked elements) breadth-first queries with 2 depth levels (having A<->B, B<->C, C<->D, querying A should give me B and C, but not D).
OrientDB has no limitation on the number of nodes. Furthermore the default model is bi-directional. You can use it for FREE also for commercial purposes, since the applied license is Apache 2.
The GraphDB is documented here: http://code.google.com/p/orient/wiki/GraphDatabase. Basilary you can use the native API or the Blueprints implementation. Native APIs has an evolution of the SQL language with special operators for graphs. Example:
SELECT FROM Account WHERE friends TRAVERSE (1,7) (address.city.country.name = 'New Zealand')
That means give me all the accounts with such friend that lives in New Zealand. Friends are taken up to the 7th level of deep.
The second one allows to use the full Blueprint stack such as the Gremlin language to create your super-complex queries.
Neo4j always stores relationships/edges as directed, but when traversing/querying you can easily treat the graph as undirected by using Direction.BOTH or in some cases by not defining a direction at all. (This way there's no need for "double" edges to cover both directions, you simply ignore the direction - and there's no performance penalty when traversing edges "backwards".)
The 1 million "primitives" limit was removed for quite a while now. If your code is open source, you can use the community version for any size of the DB. For other cases there's the commercial versions which includes one free alternative.