I am very new to graph databases and am trying to work on a survey of different graph databases. I am not able to understand what exactly the global indexing in graph databases are.
Can someone please help me to understand what is Global indexing in Graph Databases.
I am not sure whether all graph databases agree on the notion of what a global index is, but generally it means an index that applies to the whole graph. Such an index allows to efficiently retrieve vertices based on some indexed property, e.g.: find all person vertices with the name Manoj. Most graph queries use a global index to find one or a small number of vertices as an entry point into the graph and then traverse the graph from there.
Opposed to global indexes are vertex-centric indexes. They only apply to a specific vertex and can be used to make queries with so-called supernodes more efficient. The idea here is to index a property of incident edges of the vertex that can reduce the number of neighboring vertices returned to those that are really interesting for the query. Such a vertex-centric index could for example for twitter be used to index the followedSince property on follower edges. This would allow to efficiently query for all followers of Katy Perry that began following her on her birthday. Without an index you would have to check the property for all of her (currently over 95 Mio.) followers for this query.
(Your question didn't mention vertex-centric indexes but I think it helps to understand why global indexes are called that way when you know about vertex-centric indexes, as they are basically local indexes.)
For more information about indexing in graph databases see the respective sections in the documentation of graph databases like Titan or DSE Graph.
Related
I am interested in storing a set of users that have personality scores.
I would like to get them to be more connected (closer?) to each other based on formulas that are applied to their scores. The more similar the users are, the more connected or closer to each other they are (like in a cluster). The closest nodes are to one-another, the more similar they are.
I currently do this over multiple steps (some in SQL and other in code) from a relational database.
Most posts out there and documentation seems to focus on how to get started and what the advantages are at a high level compared to relational databases.
I am wondering if Graph databases are better suited for this and would do most of the heavy lifting out of the box or more natively. Any details are greatly appreciated.
You could consider modeling it like this:
Where a vertex type/label named Score_range was introduced, together with the label User(with property score).
User vertices are connected to Score_range vertex like User with score: 101 is connected to Score_range(vertexID=100) which stands for [100, 110).
Thus, those vertices with closer score are more connected/clusterred in this graph, and in your applicaiton, you need to make connection changes when the score are recaculated/changed to the graph database.
Then, either to run cluster algorithm(i.e. Louvain) on the whole graph or graph query to find path between any two user nodes(i.e. FIND PATH in Nebula Graph, an opensource distributed graph database speaks opencypher), the closeness will be reflected.
But, I think due to this connection/closness is actually numerical/sortable, simply handling this closeness relationship may not need a graph database from the context you already provided.
PS. I drew a picture of a graph in the above schema:
This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.
I am using Rexster/TITAN 0.4 over Cassandra.
The vertex keys are indexed using standard index as below.
g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make();
I am not using Uniqueness for performance and scalability.
There are around ~10M vertices in graph.
My requirement is to iterate over each vertices and identify if any duplicates and then remove it.
Is there a way to get the sorted list of vertices, directly from the index which is already present.
A direct query on index (standard TITAN index) similar to "Direct Index Query" .
So that I can partition the entire vertices into smaller batches and process individually.
If not possible , what is the best way to achieve this.
I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.
I want to run the below query to get 1000 vertices in the sorted order.
gremlin> g.V.has('domain').domain.order[0..1000]
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes
But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.
How can I force gremlin to use index in this particular case?
The answer is the same as the one I provided in the comments of your previous question:
Throw more memory at the problem (i.e. increase -Xmx to the console or whatever application is running your query) - which would be a short-term solution.
Use titan-hadoop.
Restructure your graph or queries in some way to allow a use of an index. This could mean giving up some performance on insert and using a uniqueness lock. Maybe you don't have to remove duplicates in your source data - perhaps you can dedup them in your Gremlin queries at the time of traversal. The point is that you'll need to be creative.
Despite your reluctance to use titan-hadoop and not wanting to use it to "just for finding/removing duplicates in graph", that's the exact use case it will be good at. You have a batch process that must iterate all vertices and it can't fit in the memory you've allotted and you don't want to use titan-hadoop. That's a bit like saying: "I have a nail and a hammer, but I don't want to use the hammer to bang in the nail." :)
How can I force gremlin to use index in this particular case?
There is no way in gremlin to do this. In theory, there might be a way to try to read from Cassandra directly (bypassing Titan), decode the binary result and somehow iterate and delete, but it's not known to me. Even if you figured it out, which would mean lots of hours trying to dig into the depths of Titan to see how to read the index data, it would be a hack that is likely to break at any time you upgrade Titan, as the core developers might close that avenue to you at any point as you are circumventing Titan in an unexpected way.
The best option is to simply use titan-hadoop to solve your problem. Unless your graph is completely static and no longer growing, you will reach a point where titan-hadoop is inevitable. How will you be sure that your graph is growing correctly when you have 100M+ edges? How will you gather global statistics about your data? How will you repair bad data that got into the database from a bug in your code? All of those things become issues when your graph reaches a certain scale and titan-hadoop is your only friend there at this time.
I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.
The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.
I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.
Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.
I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.
Graph databases store data as nodes, properties and relations. If I need to retrieve some specific data from an object based upon a query, then I would need to retrieve multiple objects (as the query might have a lot of results).
Consider this simple scenario in object oriented programming in graph-databases:
I have a (graph) database of users, where each user is stored as an object. I need to retrieve a list of users living in a specific place (the place property is stored in the user object). So, how would I do it? I mean unnecessary data will be retrieved every time I need to do something (in this case, the entire user object might need to be retrieved). Isn't functional programming better in graph databases?
This example is just a simple analogy of the above stated question that came to my mind. Don't take it as a benchmark. So, the question remains, How great is object oriented programming in graph-databases?
A graph database is more than just vertices and edges. In most graph databases, such as neo4j, in addition to vertices having an id and edges having a label they have a list of properties. Typically in java based graph databases these properties are limited to java primatives -- everything else needs to be serialized to a string (e.g. dates). This mapping to vertex/edge properties can either be done by hand using methods such as getProperty and setProperty or you can something like Frames, an object mapper that uses the TinkerPop stack.
Each node has attributes that can be mapped to object fields. You can do that manually, or you can use spring-data to do the mapping.
Most graph databases have at least one kind of index for vertices/edges. InfiniteGraph, for instance, supports B-Trees, Lucene (for text) and a distributed, scaleable index type. If you don't have an index on the field that you're trying to use as a filter you'd need to traverse the graph and apply predicates yourself at each step. Hopefully, that would reduce the number of nodes to be traversed.
Blockquote I need to retrieve a list of users living in a specific place (the place property is stored in the user object).
There is a better way. Separate location from user. Instead of having a location as a property, create a node for locations.
So you can have (u:User)-[:LIVES_IN]->(l:Location) type of relationship.
it becomes easier to retrieve a list of users living in a specific place with a simple query:
match(u:User)-[:LIVES_IN]->(l:Location) where l.name = 'New York'.
return u,l.
This will return all users living in New York without having to scan all the properties of each node. It's a faster approach.
Why not use an object-oriented graph database?
InfiniteGraph is a graph database built on top of Objectivity/DB which is an massively scalable, distributed object-oriented database.
InfiniteGraph allows you to define your vertices and edges using a standard object-oriented approach, including inheritance. You can also embed a defined data type as an attribute in another data type definition.
Because InfiniteGraph is object-oriented, it give you access to query capabilities on complex data structures that are not available in the popular graph databases. Consider the following diagram:
In this diagram I create a query that determines the inclusion of the edge based on an evaluation of the set of CallDetail nodes hanging off the Call edge. I might only include the edge in my results if there exists a CallDetail with a particular date or if the sum of the callDurations of all of the CallDetails that occurred between two dates is over from threshold. This is the real power of object-oriented database in solving graph problems: You can support a much more complex data model.
I'm not sure why people have comingled the terms graph database and property graph. A property graph is but one way to implement a graph database, and not particular efficient. InfiniteGraph is a schema-based database and the schema provides several distinct advantages, one of which object placement.
Disclaimer: I am the Director of Field Operation for Objectivity, Inc., maker of InfiniteGraph.