Graph database modeling: multiple edges are better than single edges with properties?

Graph database modeling: multiple edges are better than single edges with properties? - database

This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???

Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.

I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.

Related

Data organisation in graph database

This is more of a logical question rather than technical one. I am asking for data organisation guidance for my requirements. Please keep in mind that I am willing to use a graph database for this purpose (though I am pretty new at that). So guidance in graph database context would be much appreciated.
Let me provide an overview of the scenario. There are two entities in the app, User and House. User can owns a house or rents a house. If an user rents a house, there should be time period mentioned for which the user has rented the house. An user may rent same house for different periods.
Demo Dataset:
A (User) -owns-> H1, H2, H3 (House) - one-liner for brevity
X -rents-> H2 (start=DATE1, end=DATE2)
Y -rents-> H2 (start=DATE3, end=DATE4)
X -rents-> H2 (start=DATE5, end=DATE6) - user rents same house again
I am assuming that User and House would be nodes and owns and rents would be edges. Rent period would be properties of rents edges. Please point out if there is any better way.
Questions:
Is this possible in graph database in general to have multiple edges of same type between two nodes? Should I keep just one edge for rent of a specific user to specific house and add periods? Or should I maintain multiple edges for multiple periods?
Is it possible to query for something like: "fetch all the houses that were empty for a period of 3 months"? This should fetch the houses that have a gap of 3 months between consecutive end and next start dates in rents. These houses may not be empty now.
I have checked neo4j, cayley, dgraph etc. Which may be better with this scenario?
Any guidance of how I should keep the data with relationships would be much appreciated. Have a nice day.

I think this may be solved, but I would just add that TerminusDB may be useful to assess as part of your process. The reason that I sat this is that you are:
TerminusDB uses an expressive logical schema language that allows anything that is logically expressible. So you could have multiple edges of same type between two nodes. However, data modeling is something of an art - as your question suggests - so it will depend on your domain. (I always think that 'deal' could be an edge or a node depending on context - you can participate in a deal or you could strike a deal with another party).
As TerminusDB is a native revision control database, time bound queries can be relatively straightforward. You can get a delta, or a series of deltas, between two events.

There could be a better answer than this, still posting my experience with graph on the given requirement if this of any help to you.
I think it is the best fit for the graph DB for your requirement and to answer your questions.
It is more of designing your graph model to suit the purpose and I think you can have multiple rent edges with different periods from node user to node house.
Which way you can maintain the history and you can later delete the older/expired period edges if you want.
[Just to avoid duplicates] Assume here you need to make sure the edge would be created between nodes (user & house) only if the period slot is free.
You can add logic to the query while creating the edge between nodes.
With the given demo data set, here is the sample graph I have created based on the scenario you have described.
http://console.neo4j.org/?id=bxu3sp
Click on the above console link and you can run the below cypher query in the query window at the bottom
MATCH (user:User)-[rent:RENT]->(house:House)
WITH house, rent
ORDER BY rent.startDate
WITH collect(rent) as rents, house
UNWIND range(0, size(rents)-1) as index
WITH rents, index, house
WHERE duration.inDays(date(rents[index].endDate), date(rents[index+1].startDate)).days > 30
RETURN house
This would get the list of houses that was with no allocation for a given period range.
I'm not an expert and never used other than neo4j and so far with my experience on neo4j, documentation is really good and it is powerful with additions like Kafka integrations, GraphQL, Halin monitoring, APOC, etc.
I'd say it is a learning curve, just explore and play around with it to get yourself into the graph DB world.
Update:
In case of the same user renting the same house for different periods then the graph would look something like below as said you should avoid creating duplicates by not allowing edges for the same/overlapping window period between any user node and any house node. here in this graph, I have created edges for the different and non-overlapping start/end date so which is valid and not a duplicate.

Global Indexing in Graph Databases

I am very new to graph databases and am trying to work on a survey of different graph databases. I am not able to understand what exactly the global indexing in graph databases are.
Can someone please help me to understand what is Global indexing in Graph Databases.

I am not sure whether all graph databases agree on the notion of what a global index is, but generally it means an index that applies to the whole graph. Such an index allows to efficiently retrieve vertices based on some indexed property, e.g.: find all person vertices with the name Manoj. Most graph queries use a global index to find one or a small number of vertices as an entry point into the graph and then traverse the graph from there.
Opposed to global indexes are vertex-centric indexes. They only apply to a specific vertex and can be used to make queries with so-called supernodes more efficient. The idea here is to index a property of incident edges of the vertex that can reduce the number of neighboring vertices returned to those that are really interesting for the query. Such a vertex-centric index could for example for twitter be used to index the followedSince property on follower edges. This would allow to efficiently query for all followers of Katy Perry that began following her on her birthday. Without an index you would have to check the property for all of her (currently over 95 Mio.) followers for this query.
(Your question didn't mention vertex-centric indexes but I think it helps to understand why global indexes are called that way when you know about vertex-centric indexes, as they are basically local indexes.)
For more information about indexing in graph databases see the respective sections in the documentation of graph databases like Titan or DSE Graph.

Modeling for Graph DBs

Coming from as SQL/NoSQL background I am finding it quite challenging to model (efficiently that is) the simplest of exercises on a Graph DB.
While different technologies have limitations and best practices, I am uncertain whether the mindset that I am using while creating the models is the correct one, hence, I am in the need of guidance, advice and/or resources to help me get closer to the right practices.
The initial exercise I have tried is representing a file share entire directory (subfolders and files) in a graph DB. For instance some of the attributes and queries I would like to include are;
The hierarchical structure of the folders
The aggregate size at the current node
Being able to search based on who created a file/folder
Being able to search on file types
This brings me to the following questions
When/Which attributes should be used for edges. Only those on which I intend to search? Only relationships?
Should I wish to extend my graph capabilities, for instance, search on files bigger than X? How does one try to maximize the future capabilities/flexibility of the model so that such changes do not cause massive impacts.
Currently I am exploring InfiniteGraph and TitanDB.

1) The only attribute I can think of to describe an edge in a folder hierarchy is whether it is a contains or contained-by relationship.
(You don't even need that if you decide to consider all your edges one or the other. In your case, it looks like you'll almost always be interrogating descendants to search and to return aggregate size).
This is a lot simpler than a network, or a hierarchy where the edges may be of different types. Think an organization chart that tracks not only who manages whom, but who supports whom, mentors whom, harasses whom, whatever.
2) I'm not familiar with the two databases you mentioned, but Neo4J allows indexes on node properties, so adding an index on file_size should not have much impact. It's also "schema-less," so that you can add attributes on the fly and various nodes may contain different attributes.

How to persist a graph data structure in a relational database?

I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.

The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.

I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.

Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.

I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.

Object oriented programming in Graph databases

Graph databases store data as nodes, properties and relations. If I need to retrieve some specific data from an object based upon a query, then I would need to retrieve multiple objects (as the query might have a lot of results).
Consider this simple scenario in object oriented programming in graph-databases:
I have a (graph) database of users, where each user is stored as an object. I need to retrieve a list of users living in a specific place (the place property is stored in the user object). So, how would I do it? I mean unnecessary data will be retrieved every time I need to do something (in this case, the entire user object might need to be retrieved). Isn't functional programming better in graph databases?
This example is just a simple analogy of the above stated question that came to my mind. Don't take it as a benchmark. So, the question remains, How great is object oriented programming in graph-databases?

A graph database is more than just vertices and edges. In most graph databases, such as neo4j, in addition to vertices having an id and edges having a label they have a list of properties. Typically in java based graph databases these properties are limited to java primatives -- everything else needs to be serialized to a string (e.g. dates). This mapping to vertex/edge properties can either be done by hand using methods such as getProperty and setProperty or you can something like Frames, an object mapper that uses the TinkerPop stack.

Each node has attributes that can be mapped to object fields. You can do that manually, or you can use spring-data to do the mapping.

Most graph databases have at least one kind of index for vertices/edges. InfiniteGraph, for instance, supports B-Trees, Lucene (for text) and a distributed, scaleable index type. If you don't have an index on the field that you're trying to use as a filter you'd need to traverse the graph and apply predicates yourself at each step. Hopefully, that would reduce the number of nodes to be traversed.

Blockquote I need to retrieve a list of users living in a specific place (the place property is stored in the user object).
There is a better way. Separate location from user. Instead of having a location as a property, create a node for locations.
So you can have (u:User)-[:LIVES_IN]->(l:Location) type of relationship.
it becomes easier to retrieve a list of users living in a specific place with a simple query:
match(u:User)-[:LIVES_IN]->(l:Location) where l.name = 'New York'.
return u,l.
This will return all users living in New York without having to scan all the properties of each node. It's a faster approach.

Why not use an object-oriented graph database?
InfiniteGraph is a graph database built on top of Objectivity/DB which is an massively scalable, distributed object-oriented database.
InfiniteGraph allows you to define your vertices and edges using a standard object-oriented approach, including inheritance. You can also embed a defined data type as an attribute in another data type definition.
Because InfiniteGraph is object-oriented, it give you access to query capabilities on complex data structures that are not available in the popular graph databases. Consider the following diagram:
In this diagram I create a query that determines the inclusion of the edge based on an evaluation of the set of CallDetail nodes hanging off the Call edge. I might only include the edge in my results if there exists a CallDetail with a particular date or if the sum of the callDurations of all of the CallDetails that occurred between two dates is over from threshold. This is the real power of object-oriented database in solving graph problems: You can support a much more complex data model.
I'm not sure why people have comingled the terms graph database and property graph. A property graph is but one way to implement a graph database, and not particular efficient. InfiniteGraph is a schema-based database and the schema provides several distinct advantages, one of which object placement.
Disclaimer: I am the Director of Field Operation for Objectivity, Inc., maker of InfiniteGraph.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight