When I was reading about the theory of graph databases I found the different terms that denote connections between nodes. What is the difference between relationships, edges and links?
There is no difference between the terms relationship, edge and link. It is just a matter of the term that the author has decided to use.
Related
I was wondering how databases like Dgraph and TigerGraph managed to shard the graph in-order to support horizontal scaling without breaking the connections between nodes besides supports a lot of interesting algorithms.
And they claim to be a native graph solution so an approach like facebook or twitter for example is not the case here.
The only solution that come to my mind is by spreading the graph among so many small databases, which leads to so many nodes duplication to maintain the relationships.
Any ideas ?
Thanks in advance
So technically there are two principles to follow regarding graph sharding. The first one is Edge-Cut which cuts an edge into two parts (incoming and outgoing) and stores them in different servers respectively. Each vertex associated with the edge is distributed to a specific server in the cluster. Nebula Graph, a distributed graph database, follows this method. The second one is Vertex-Cut, which cuts a vertex into N parts (depending on how many edges the vertex has) and stores them in different servers. Each edge associated with the vertex is then distributed to a specific server in the cluster. GraphX did it this way.
However, graph sharding is an NP problem anyway, which is way much harder than sharding in SQL. So some vendors might do it differently than Cut-Edge only or Cut-Vertex only. For example, your thought, i.e. spreading subgraph, is somewhat like Neo4j Fabric. Some vendors place the whole graph structure, not including the properties, into a single host memory so that fetching subgraphs is very fast. While some vendors adopt adjacency list to separate nodes and edges in the graph, without considering too much for the locality.
This is a critical question, and a weakness of large graph databases. Most of them use read-only replicas or do have issues with too many hops across a network.
But you asked about Dgraph which is actually completely different and does not break a graph up into disconnected or even overlapping sub-graphs on different servers. Instead, it stores entire predicates on individual servers, and this allows a given query to execute in a small number of network hops.
See the "developers in sf who use vim" example here: https://dgraph.io/blog/post/db-sharding/ .
Good day,
Real estate companies have several Buildings, each Building managed by one or more Managers, Managers have access to one or more Buildings. So, there is a many-to-many relationship between Managers and Buildings. It has to be a table such as Permissions to get rid of many-to-many relationship.
Please help me to figure it out, what is the best design for the database ?
I came up with a two candidate diagrams, which one is better? If neither of them are good, what should I change ?
http://i.stack.imgur.com/Z0l6h.png
http://i.stack.imgur.com/Dg5Sv.png
Sincerely
The second picture seems closest
I'd suggest moving the boxes around a little to show the hierarchy. Put Companies top and center, then on the next row, Managers on the left, Buildings on the right and Permissions between those two.
ER diagrams are used for two different purposes. One purpose is to illustrate the subject matter entities, and the relationships between them, as understood by subject matter experts. This is called a conceptual model of the data.
The other purpose is to illustrate a proposed database design, one where the relationships are not only expressed, but also implemented somehow. If the design is relational (which it usually is) many-to-many relationships are expressed by creating an intermediate table. This is called a physical model of the data (in some literature it's called a logical model). This is what you have done in your second diagram.
Your first diagram could be cleaned up a little by eliminating the box named "permissions", and putting a crows-foot at both ends of the line connecting Managers and Buildings.
Now to come back to your question: which one is "better"? It depends. sometimes, a conceptual diagram is better for discussing the subject matter with the ultimate stakeholders: non-technical managers who work with the data all the time, and might be called "subject matter experts".
A physical diagram is usually better when discussing the proposed design among data architects and programmers. It explains not only how the data works in concept, but also how the database is to be built. This kind of detail is glossed over by a conceptual model.
So you may end up with two diagrams, and use the appropriate one depending on your audience.
I'm new to DBs of any kind. It seems you can represent any relational database in graph form (although it might be a very flat graph), and any graph database in a relational database (with enough tables).
A graph can avoid a lot of lookups in other tables by having a hard link from one entry to another, so in many/most cases I can see the speed advantage of a graph. If your data is naturally hierarchical, and especially if it forms a tree, I see the logical/reasoning benefit to a graph over relational. I imagine a node of a graph which links to other nodes probably contains multiple maps or lists... which is effectively containing a relational DB within nodes of a graph.
Are there any disadvantages to a graph db vs a relational db? (Note: I'm not looking to things like missing features in implementations, but instead the theoretical pros/cons)
When should I still use a relational database? Even if I logically have a single mapping of an int to int I could do it in a graph.
Graph databases were deprecated by relational-ish technology some 20 to 30 years ago.
The major theoretical disadvantage is that graph databases use TWO basic concepts to represent information (nodes and edges), whereas a relational database uses only one (the relation). This bleeds over into the language for data manipulation, in that a graph-based language must provide two distinct sets of operators : one for operating on nodes, and one for operating on edges. The relational model can suffice with only one.
More operators means more operators to implement for the DBMS builder, more opportunity for bugs, and for the user it means more distinct language constructs to learn. For example, adding information to a database is just INSERT in relational, in graph-based it can be either STORE (nodes) or CONNECT (edges). Removing information is just DELETE (relational), as opposed to either ERASE (nodes) or DISCONNECT (edges).
Building on Erwin Smout's fine answer, an important reason why the relational model supplanted the graph one is that a graph has a greater degree of "bias" baked into its structure than relations do. The edges of a graph are navigational links which user queries are expected to traverse in a particular way. A relational model of the same data assumes much less about how the data will be used. Users are free to join and manipulate relational data in ways that the database designer might not have foreseen. The disruptive costs of re-engineering graph database structures to support new requirements were a factor which drove the adoption of the relational model and its SQL-based offshoots in the 1980s.
Relational databases were designed to aggregate data, graph to find relations.
If You have for example a financial domain, all connections are known, You only aggregate data by other data to find sums and so on.
Graph databases are better in more chaotic domain where to connections are more important, and not all connections are apparent, for example:
networks of people, with different relations with one and other
films and people creating them. Not just actors but the whole crew.
natural language processing and finding connections between recognized words
Data model is important, but what matters more is how you access your data. Notice, there are very few (none, actually) sharded or otherwise distributed graph databases out there. If you compare insertion speed into a typical relational database and a graph database, your relational database will most likely win.
Yes, graph model is more versatile than relational model, but it doesn't make it universal - in some cases, this versatility is a roadblock for optimizations.
In fact, modern graph databases are a niche solutions for a narrow set of tasks - finding a route from A to B, working with friends in a social network, information technology in medicine.
For most business applications relational databases continue to prevail.
I'm missing the performance aspect in the answers above.
Performance of graph based data bases is inherently worse for scalar and maybe even tree based models. Only if you have a real graph, they may exhibit better performance.
Also most graph DBs do not feature ACID support such as almost any RDBMS.
From my real life experience I can tell almost any evolving data model will sooner or later become a graph and that's why graph DBs are superior in terms of flexibility and agility (they keep pace with the evolution of your data model).
That's why I don't think that RDBs will prevail for "For most business applications" as #Kostja says. I think they will prevail where ACID capability is essential.
I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
"Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.
I haven't been able to find and example or documentation how you do a many to many relation in SolrNet, so I hoped one of you experts might have a clue or a link which can point me in the right direction?
There is no many-to-many relationship in Solr, and in fact there are no relationships at all. Solr's index is a flat structure. You must denormalize your data, this depends on what searches you will need. See http://wiki.apache.org/solr/SchemaDesign