Sharding a graph database - graph-databases

I was wondering how databases like Dgraph and TigerGraph managed to shard the graph in-order to support horizontal scaling without breaking the connections between nodes besides supports a lot of interesting algorithms.
And they claim to be a native graph solution so an approach like facebook or twitter for example is not the case here.
The only solution that come to my mind is by spreading the graph among so many small databases, which leads to so many nodes duplication to maintain the relationships.
Any ideas ?
Thanks in advance

So technically there are two principles to follow regarding graph sharding. The first one is Edge-Cut which cuts an edge into two parts (incoming and outgoing) and stores them in different servers respectively. Each vertex associated with the edge is distributed to a specific server in the cluster. Nebula Graph, a distributed graph database, follows this method. The second one is Vertex-Cut, which cuts a vertex into N parts (depending on how many edges the vertex has) and stores them in different servers. Each edge associated with the vertex is then distributed to a specific server in the cluster. GraphX did it this way.
However, graph sharding is an NP problem anyway, which is way much harder than sharding in SQL. So some vendors might do it differently than Cut-Edge only or Cut-Vertex only. For example, your thought, i.e. spreading subgraph, is somewhat like Neo4j Fabric. Some vendors place the whole graph structure, not including the properties, into a single host memory so that fetching subgraphs is very fast. While some vendors adopt adjacency list to separate nodes and edges in the graph, without considering too much for the locality.

This is a critical question, and a weakness of large graph databases. Most of them use read-only replicas or do have issues with too many hops across a network.
But you asked about Dgraph which is actually completely different and does not break a graph up into disconnected or even overlapping sub-graphs on different servers. Instead, it stores entire predicates on individual servers, and this allows a given query to execute in a small number of network hops.
See the "developers in sf who use vim" example here: https://dgraph.io/blog/post/db-sharding/ .

Related

Algorithms and data structures for 'tag' or key word based query on large data set?

I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
 "Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.

Is it possible to store graphs hbase? if so how do you model the database to support a graph structure?

I have been playing around with using graphs to analyze big data. Its been working great and really fun but I'm wondering what to do as the data gets bigger and bigger?
Let me know if there's any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I'm unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I'm not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I'm trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by #Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)
You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)
Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop.
Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms
I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.
Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)
There are graph databases build on top of HBase you could try and/or study.
Apache S2Graph
provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.
Titan
can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.

Graph Data Structures with millions of nodes (Social network)

In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?
Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).

Examples for Topological Sorting on Large DAGs

I am looking for real world applications where topological sorting is performed on large graph sizes.
Some fields where I image you could find such instances would be bioinformatics, dependency resolution, databases, hardware design, data warehousing... but I hope some of you may have encountered or heard of any specific algorithms/projects/applications/datasets that require topsort.
Even if the data/project may not be publicly accessible any hints (and estimates on the order of magnitude of potential graph sizes) might be helpful.
Here are some examples I've seen so far for Topological Sorting:
While scheduling task graphs in a distributed system, it is usually
needed to sort the tasks topologically and then assign them to
resources. I am aware of task graphs containing more than 100,000
tasks to be sorted in a topological order. See this in this context.
Once upon a time I was working on a Document Management System. Each
document on this system has some kind of precedence constraint to a
set of other documents, e.g. its content type or field referencing.
Then, the system should be able to generate an order of the documents
with the preserved topological order. As I can remember, there were
around 5,000,000 documents available two years ago !!!
In the field of social networking, there is famous query to know the
largest friendship distance in the network. This problem needs to
traverse the graph by a BFS approach, equal to the cost of a
topological sorting. Consider the members of Facebook and find your
answer.
If you need more real examples, do not hesitate to ask me. I have worked in lots of projects working on on large graphs.
P.S. for large DAG datasets, you may take a look at Stanford Large Network Dataset Collection and Graphics# Illinois page.
I'm not sure if this fits what you're looking for but did you know Bio4j project?
Not all the contents stored in the graph based DB would be adequate for topological sorting (there exists directed cycles in an important part of the graph), however there are sub-graphs like Gene Ontology and Taxonomy where this ordering may have sense.
TopoR is a commercial topological PCB router that works first by routing the PCB as topological problem and then translating the topology into physical space. They support up to 32 electrical layers, so it should be capable of many thousands of connections (say 10^4).
I suspect integrated circuits may use similar methods.
The company where I work manages a (proprietary) database of software vulnerabilities and patches. Patches are typically issued by a software vendor (like Microsoft, Adobe, etc.) at regular intervals, and "new and improved" patches "supercede" older ones, in the sense that if you apply the newer patch to a host then the old patch is no longer needed.
This gives rise to a DAG where each software patch is a node with arcs pointing to a node for each "superceding" patch. There are currently close to 10K nodes in the graph, and new patches are added every week.
Topological sorting is useful in this context to verify that the graph contains no cycles - if they do arise then it means that there was either an error in the addition of a new DB record, or corruption was introduced by botched data replication between DB instances.

Graph database to return a list of common friends among 2 person in a social network

Are there any graph database(s) that has a built-in feature to return a list of common friends among 2 or more people - just like in a social network like Facebook? The result should be returned as fast as possible without the need to perform complex calculations or to traverse the database. If not, what are the ways to implement it? What about OriendDB? What about using a combination of graph database and Redis?
Not sure about specific graph databases (I come at this from building my own graph database on top of redis) but assuming that friend means a direct connection, finding common friends is fairly simple - just get the full friends list from each and calculate the intersection.
Redis has a command to do this natively and very fast. The SQL query for it is also fairly simple. Getting all connections for a single node should be available on any graph database, and even if you need to retrieve the full lists and calculate the intersection in app code, performance will probably be adequate as long as you don't have to deal with people who have thousands/millions of friends.
Where it gets more complex is dealing with indirect relationships - the intersection operation is the same, but the sets don't exist in the form needed without traversing the graph, so before calculating the intersection you need to build a set of all second level connections for each user. You can either do this as the first step of your query or maintain permanent sets updated when connections change - the appropriate method depends on whether you need to optimize for data usage and write performance or read performance.

Resources