When to use graph databases, ontologies, and knowledge graphs - graph-databases

I've been struggling to understand when these technologies are useful from a practical standpoint, and how they are different from each other. Could an expert check my understanding?
Graph databases: These are easier to understand and manage than a relational databases when relationships are complex, inherited, inferred with varying degrees of confidence, and likely to change. Some examples: a user doesn't know how much depth they'll need in a hierarchy; is inferring relationships from social media with varying degrees of confidence in ID resolution, topic resolution, and the strength of a relationship; or doesn't know what kinds of call center data they're going to want to store; all of these can be stored in relational databases, but they will need constant updates. They're also more performative for certain tasks.
Ontologies: These formal and standardized representations of knowledge are used to break down data silos. For instance, let's say a B2B sales company derives revenue from several different lines of business, which take one-time payments, subscriptions, sales of IP, and consulting services. The revenue data is stored in many different databases with lots of idiosyncrasies. An ontology allows the user to define a "customer payment" as anything that "creates or refunds revenue," so that subject matter experts can appropriately label the payments in their databases. Ontologies can be used with either graph databases or relational databases, but the emphasis on class inheritance makes them far easier to implement in a graph database, where the taxonomy of classes can be easily modeled.
Knowledge graph: A knowledge graph is a graph database where language (meaning, the entity and node taxonomies) are governed by an ontology. So in our B2B example, "customer payment" edges have one-time payments, subscription, etc subtypes, and connect "customer" classes to "line of business" classes.
Is that basically correct?

A graph database (GD) is a database that can store graph data, which primarily has three types of elements: nodes, edges, and properties. Two popular types of graph databases are (1) Resource Description Framework (RDF)-based graph databases eg. Blazegraph and (2) Label Propagation Graph (LPG)-based graph databases eg. Neo4j. RDF represents knowledge in the form of subject, verb, and object (S-V-O) triplet, such as John livesIn London, and as its nodes and edges cannot hold properties, additional nodes or literals needs to be added to represent properties. LPG represents knowledge in the form of edges, nodes, and attributes where nodes and edges can hold properties in the form of key:value. Eg., a node can have label Person, and Person can have properties name:Tom Hanks, born:1956.
An ontology is a description of the concepts and their relationships, using instances of concepts, attributes of instances (and classes), restrictions of classes, and rules (if-then statements). These rules describe the logical inferences that can be drawn from the assertions/axioms that comprise the overall theory that the ontology describes. Upper level ontology (eg. DOLCE) describes general concepts and relations, whereas domain ontology (eg. Gene Ontology) describes concepts and relations in a particular domain. A graph database may have an ontology in its schema level for logical consistency checking.
Generally, a knowledge graph (KG) is an organization of a knowledge base as a graph having nodes and links between the nodes. An example of an early KG is Wordnet which captures semantic relationships between words and their meanings. Later Google developed their Google Knowledge Graph (GKG) building on DBpedia and Freebase using RDFa, Microdata and JSON-LD content extracted from indexed web pages, and used schema.org vocabulary to organize the nodes. Google reported that it held around 70 billion facts in GKG.
Graph databases supports queries, but not logical inference which needs an ontology. If the connections within the data are of primary focus (eg. friends of a friend), retrieval more important than storage, and data model changes often, then graph database would be a good fit.
Ontology is used when we need to infer new knowledge from the given knowledge. For eg, if given (1) Socrates is Man, and (2) AllMen are Mortal, then the reasoner or inference engine in an ontology can infer a new knowledge (3) Socrates is Mortal. This is made possible by description logic axioms that the Web Ontology Language (OWL) uses to describe resources. The OWL is serialized using Resource Description Framework (RDF).
Ontology is also used when we need to check consistency in the data model. For eg, if an axiom says Human and Sponge are disjoint classes and we make John (a human) instance of both Human and Sponge classes then it will fail consistency test.
Taxonomy is the IS-A class hierarchy which forms the backbone of an ontology.
Knowledge Graphs are often associated with linked open data (LOD) projects built upon standard Web technologies such as HTTP, RDF, URIs, and SPARQL. KG may use ontologies for reasoning and graph databases to store the knowledge. Several large organizations have introduced their KGs.

Related

How to draw a NoSQL database (document oriented) with UML?

I'm beginner in the use of NoSQL and aim to build a uber-like database. Is it possible to draw a CouchDB database (document oriented) with UML and in particular, how to do joins? Or is there another alternative better suited for NoSQL database modeling?
You can use UML class diagrams to model entities and aggregates of an application domain, regardless of the implementation technology.
You can also model a more concrete implementation that uses a NoSQL database, and in particular document stores such as CouchDB. Objects are stored in the database are kind-of dehydrated (i.e. object data without their behavior) into a document.
There are challenge that you will face:
mapping between the document world and the object world: a document may contain several related objects (no joins needed), as well as links to other objects (see also embedded/nested document vs. document references).
potentially unstructured (or loosely structured) documents: document databases are extremely flexible about the content of documents, and it is perfectly allowed to mix objects of completely unrelated classes into the same document collection. Moreover, the fields/properties/members of a document may be dynamic and evolve. In practice however, collections often contain similar objects that will vary mainly regarding the fields (for example to acknowledge existence of implicit classes). Documents may even be validated according to a schema to ensure consistency if needed.
UML classes are based on strong typing, whereas types in documents are as flexible as the rest of their content (e.g. a field from could be a date 2000-04-02 in one document or a string "a long time ago" in another one).
So before you start, you need to think of the mapping strategy. My advice would be to focus in UML on the design of your object model, and think of documents as a convenient grouping of related ones (DDD aggregates may help in this regard). The following rule of thumb may help for the modeling:
joins (e.g. links between independent documents) will be represented by associations.
systematic grouping of objects with other, may suggest existence of some kind of stronger relation such as UML composition.
fields that vary from document to document would be represented either with optional properties (multiplicity 0..1 or 0..*) or generalisation/specialization of the enclosing object, depending on the logic that explains the variation.

Do graph databases deprecate relational databases?

I'm new to DBs of any kind. It seems you can represent any relational database in graph form (although it might be a very flat graph), and any graph database in a relational database (with enough tables).
A graph can avoid a lot of lookups in other tables by having a hard link from one entry to another, so in many/most cases I can see the speed advantage of a graph. If your data is naturally hierarchical, and especially if it forms a tree, I see the logical/reasoning benefit to a graph over relational. I imagine a node of a graph which links to other nodes probably contains multiple maps or lists... which is effectively containing a relational DB within nodes of a graph.
Are there any disadvantages to a graph db vs a relational db? (Note: I'm not looking to things like missing features in implementations, but instead the theoretical pros/cons)
When should I still use a relational database? Even if I logically have a single mapping of an int to int I could do it in a graph.
Graph databases were deprecated by relational-ish technology some 20 to 30 years ago.
The major theoretical disadvantage is that graph databases use TWO basic concepts to represent information (nodes and edges), whereas a relational database uses only one (the relation). This bleeds over into the language for data manipulation, in that a graph-based language must provide two distinct sets of operators : one for operating on nodes, and one for operating on edges. The relational model can suffice with only one.
More operators means more operators to implement for the DBMS builder, more opportunity for bugs, and for the user it means more distinct language constructs to learn. For example, adding information to a database is just INSERT in relational, in graph-based it can be either STORE (nodes) or CONNECT (edges). Removing information is just DELETE (relational), as opposed to either ERASE (nodes) or DISCONNECT (edges).
Building on Erwin Smout's fine answer, an important reason why the relational model supplanted the graph one is that a graph has a greater degree of "bias" baked into its structure than relations do. The edges of a graph are navigational links which user queries are expected to traverse in a particular way. A relational model of the same data assumes much less about how the data will be used. Users are free to join and manipulate relational data in ways that the database designer might not have foreseen. The disruptive costs of re-engineering graph database structures to support new requirements were a factor which drove the adoption of the relational model and its SQL-based offshoots in the 1980s.
Relational databases were designed to aggregate data, graph to find relations.
If You have for example a financial domain, all connections are known, You only aggregate data by other data to find sums and so on.
Graph databases are better in more chaotic domain where to connections are more important, and not all connections are apparent, for example:
networks of people, with different relations with one and other
films and people creating them. Not just actors but the whole crew.
natural language processing and finding connections between recognized words
Data model is important, but what matters more is how you access your data. Notice, there are very few (none, actually) sharded or otherwise distributed graph databases out there. If you compare insertion speed into a typical relational database and a graph database, your relational database will most likely win.
Yes, graph model is more versatile than relational model, but it doesn't make it universal - in some cases, this versatility is a roadblock for optimizations.
In fact, modern graph databases are a niche solutions for a narrow set of tasks - finding a route from A to B, working with friends in a social network, information technology in medicine.
For most business applications relational databases continue to prevail.
I'm missing the performance aspect in the answers above.
Performance of graph based data bases is inherently worse for scalar and maybe even tree based models. Only if you have a real graph, they may exhibit better performance.
Also most graph DBs do not feature ACID support such as almost any RDBMS.
From my real life experience I can tell almost any evolving data model will sooner or later become a graph and that's why graph DBs are superior in terms of flexibility and agility (they keep pace with the evolution of your data model).
That's why I don't think that RDBs will prevail for "For most business applications" as #Kostja says. I think they will prevail where ACID capability is essential.

Is there such a thing as a schema in a graph database?

Is there such a thing as a schema in a graph database? For example, can you specify which types of node can have relationships with which other types of node?
What does such a schema look like?
Graph databases differ a lot in this area, just like das_weezul says. In the general case I think graph databases which are closer to object databases (OODB) also have built-in schema support. One nice thing about graph databases is that they're very well suited for mixing data and metadata. So a common approach for both dealing with schema support and security is to store this kind of metadata in a (sometimes hidden) part of the very same graph.
When it comes to Neo4j - where I'm on the team - there's currently at least two approaches in use for defining schemas:
Defining the schema in annotations, for example using Spring Data Graph (docs).
Using a meta-model layer on top of the database.
You'll find some more reading on this topic over at myNoSQL.
Yes. Schemas useful in selecting vertex labels, which are part both Neo4J 2 and Tinkerpop 3. I think writing down the schema helps clarify how the graph should be used, although most databases don't support validations against a schema.
I have a longer post on how to draw the schema as a graph. http://lambdazen.blogspot.com/2014/01/do-property-graphs-have-schemas.html
A graph database will always have a rudimentary schema consisting of (at least) Vertex and Edge objects, where an Edge can contain data about a particular relationship. The degree to which you can add to this schema varies widely across implementations. You may be able to customize the schema by inheriting from Edge and/or Vertex objects,for instance.
If the graph database uses an underlying RDBMS or ODBMS then you may have access to more powerful schema creation and manipulation capabilities.

What is the difference between graph-based databases and object-oriented databases?

What is the difference between graph-based databases (http://neo4j.org/) and object-oriented databases (http://www.db4o.com/)?
I'd answer this differently: object and graph databases operate on two different levels of abstraction.
An object database's main data elements are objects, the way we know them from an object-oriented programming language.
A graph database's main data elements are nodes and edges.
An object database does not have the notion of a (bidirectional) edge between two things with automatic referential integrity etc. A graph database does not have the notion of a pointer that can be NULL. (Of course one can imagine hybrids.)
In terms of schema, an object database's schema is whatever the set of classes is in the application. A graph database's schema (whether implicit, by convention of what String labels mean, or explicit, by declaration as models as we do it in InfoGrid for example) is independent of the application. This makes it much simpler, for example, to write multiple applications against the same data using a graph database instead of an object database, because the schema is application-independent. On the other hand, using a graph database you can't simply take an arbitrary object and persist it.
Different tools for different jobs I would think.
Yes, the API seems like the major difference, but is not really a superficial one. Conceptually a set of objects will form a graph and you could think of an API that treats this graph in a uniform way. Conversely, you could in theory mine a generic graph structure for patterns and map them to objects exposed via some API. But the design of the API of an actual product will generally have consequence on how data is actually stored, how it can be queried, so it would be far from trivial to, say, create a wrapper and make it look like something else. Also, an object-oriented database must offer some integrity guarantees and a typing structure that a graph database won't normally do. In fact, serious OO database are far from "free form" :)
Take a look at [HyperGraphDB][1] - it is both a full object-oriented database (like db4o) and a very advanced graph database both in terms of representational and querying capabilities. It is capable of storing generalized hypergraphs (where edges can point to more than one node and also to other edges as well), it has a fully extensible type system embedded as a graph etc.
Unlike other graph databases, in HyperGraphDB every object becomes a node or an edge in the graph, with none-to-minimal API intrusion and you have the choice of representing your objects as a graph or treating them in a way that is orthogonal to the graph structure (as "payload" values of your nodes or edges). You can do sophisticated traversals, customized indexing and querying.
An explanation why HyperGraphDB is in fact an ODMS, see the blog post Is HyperGraphDB an OO Database? at Kobrix's website.
As Will descibes from another angle, a graphdb will keep your data separated from your application classes and objects. A graphdb also has more built-in functionality to deal with graphs, obviously - like shortest path or deep traversals.
Another important difference is that in a graphdb like neo4j you can traverse the graph based on relationship (edge) types and directions without loading the full nodes (including node properties/attributes). There's also the choice of using neo4j as backend of an object db, still being able to use all the graphy stuff, see: jo4neo This project has a different approach that could also count as an object db on top of neo4j: neo4j.rb. A new option is to use Spring Data Graph, which gives graphdb support through annotations.
The same question was asked in the comments to this blogpost.
From a quick browse of both their websites:
The major difference is the way the APIs are structured, rather than the kind of free-form database you can build with them.
db4o uses an object mapping - you create a Java/C# class, and it uses reflection to persist it in the database.
neo4j has an explicit manipulation API.
Neo4j seemed, in my humble opinion, much nicer to interact with.
You might also consider a key-value store - you could make exactly the same free-form database with one of those.
The difference at low-level is not so huge. Both manage relationships as direct links without costly joins. Furthermore both have a way to traverse relationships with the Query language, but the graph database has operators to go recursively at Nth level.
But the biggest difference is in the domain: in a Graph databases all is based on the 2 types: vertexes and edges, even if usually you can define your own types as a sort of subtypes of Vertex or Edge.
In the ODBMS you have no Vertex and Edge concepts, unless you write your own.
With graph databases, you have a slight semblance of a chance that it is based on mathematical graph theory. With Object-oriented databases, you have the certainty that it is based on nothing at all (and most certainly no mathematical theory at all).

Database system that is not relational

What are the other types of database systems out there. I've recently came across couchDB that handles data in a non relational way. It got me thinking about what other models are other people is using.
So, I want to know what other types of data model is out there. (I'm not looking for any specifics, just want to look at how other people are handling data storage, my interest are purely academic)
The ones I already know are:
RDBMS (mysql,postgres etc..)
Document based approach (couchDB, lotus notes)
Key/value pair (BerkeleyDB)
db4o
Quote from the "about" page:
db4o is the open source object database that enables Java and .NET developers to store and retrieve any application object with only one line of code, eliminating the need to predefine or maintain a separate, rigid data model.
Older non-relational databases:
Network Database
Hierarchical Database
Both mostly went out of style when relational became feasible.
Column-oriented databases are also a bit of a different animal. Many of them do support standard relational database SQL though. These are generally used for data warehouse type applications.
Semantic Web is also a non-relational data storage paradigm. There are no relations, all metadata is stored in the same way as data, and every entity has potentially its own unique set of attributes. Open-source projects that implement RDF, a Semantic Web standard, include Jena and Sesame.
Isn't Amazon's SimpleDB non-relational?
db4o, as mentioned by Eric, is an Object-Oriented database management system (OODBMS).
There's object-based databases(Gemstore, for example). Google's Big-Table and Amason's Simple Storage I am not sure how you would categorize, but both are map-reduce based.
A non-relational document oriented database we have been looking at is Apache CouchDB.
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API. Among other features, it provides robust, incremental replication with bi-directional conflict detection and resolution, and is queryable and indexable using a table-oriented view engine with JavaScript acting as the default view definition language.
Our interest was in providing a distributed access user preferences store that would be immune to shape changes to which we could serialize preference objects from Java and access those just as easily with Javascript from a XULRunner based client application.
I'd like to detail more on Bill Karwin's answer about semantic web and triplestores, since it's what I am working on at the moment, and I have something to say on it.
The idea behind a triplestore is to store a graph-based database, whose datamodel roots in RDF. With RDF, you describe nodes and associations among nodes (in other words, edges). Data is organized in triples :
start node ----relation----> end node
(in RDF speech: subject --predicate--> object). With this very simple data model, any data network can be represented by adding more and more triples, provided you give a meaning to nodes and relations.
RDF is very general, and it's a graph-based data model well suited for search criteria looking for all triples with a particular combination of subject, predicate, or object, in any combination. Eventually, through a query language called SPARQL, you can also perform more complex queries, an operation that boils down to a graph isomorphism search onto the graph, both in terms of topology and in terms of node-edge meaning (we'll see this in a moment). SPARQL allows you only SELECT (and similar) queries. No DELETE, no INSERT, no UPDATE. The information you query (e.g. specific nodes you are interested in) are mapped into a table, which is what you get as a result of your query.
Now, topology in itself does not mean a lot. For this, a Schema language has been invented. Actually, more than one, and calling them schema languages is, in some cases, very limitative. The most famous and used today are RDF-Schema, OWL (Lite and Full), and they predate from the obsolete DAML+OIL. The point of these languages is, boiling down stuff, to give a meaning to nodes (by granting them a type, also described as a triple) and to relationships (edges). Also, you can define the "range" and "domain" of these relationships, or said differently what type is the start node and what type is the end node: you can say for example, that the property "numberOfWheels" can be applied only to connect a node of type Vehicle to a non-zero integer value.
ns:MyFiat --rdf:type--> ns:Vehicle
ns:MyFiat --ns:numberOfWheels-> 4
Now, you can use these ontologies in two directions: validation and inference. Validation is not that fancy today, but I've seen instances of use. Inference is what is cool today, because it allows reasoning. Inference basically takes a RDF graph containing a set of triples, takes an ontology, mixes them into a triplestore database which contains an "inference engine" and like magic the inference engine invents triples according to your ontological description. Example: suppose you just store this information in the database
ns:MyFiat --ns:numberOfWheels--> 4
and nothing else. No type is specified about this node, but the inference engine will add automatically a triple saying that
ns:MyFiat --rdf:type--> ns:Vehicle
because you said in your ontology that only objects of type Vehicle can be described by a property numberOfWheels.
Conversely, you can use the inference engine to validate your data against the ontology so to refuse not compliant data (sort of like XML-Schema for XML). In this case, you will need both triples to have your data successfully accepted by the triplestore.
Additional characteristics of triplestores are Formulas and Context-aware storage. Formulas are statements (as usual, triples subject predicate object) that describe something hypothetical. I never used Formulas, so I won't go into more details of something I don't know. Context awareness are basically subgraphs: the problem with storing triples is that you don't have anything to say where these triples come from. Suppose you have two dealers that describe the same price of a component. One says that the price is 5.99 and the other 4.99. If you just store both triples into a database, now you don't know anything about who stated each information. There are two ways to solve this problem.
One is reification. Reification means that you store additional triples to describe another triple. It's wasteful, and makes life hell because you have to reify every and each triple you store. The alternative is context-awareness. Having a context-aware storage It's like being able to box a bunch of triples into a container with a label on it (the context identifier). You now can use this identifier as subject for additional statements, hence describing a bunch of triples in a single action.
4. Navigational. Includes Tree/Hierarchy and Graph/Network.
File systems, the semantic web, XML, Object databases, CODASYL, and many others all fit into this category.
Those 4 are pretty much it.
There is also what is referred to as an "inverted index" or "inverted list" database. Software AG's Adabas product would be an example. As with hierachical, these databases continue to be used in large corporate or university environments because of legacy considerations or due to a performance advantage in certain situations (typically high-end transactional applications).
There are BASE systems (Basically Available, Soft State, Eventually consistent) and they work well with simple data models holding vast volumes of data. Google's BigTable, Dojo's Persevere, Amazon's Dynamo, Facebook's Cassandra are some examples.
See LINK
The illuminate Correlation Database is a new revolutionary non-relational database. The Correlation Database Management Dystem (CDBMS) is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values (data is 100% indexed). Queries are performed using natural language instead of SQL (NoSQL).
Learn more at: www.datainnovationsgroup.com

Resources