Graph data modeling basics - data-modeling

I am playing around with orientdb quite some time now. In most of my projects I am dealing with GIS and ITS data from traffic networks...so I thought what would be a better datastore for a real world graph than a graph database?
So I wrote a python script to migrate a multimodal traffic network graph from a relational database to orientdb. The basic datamodel (traffic network nodes/crossings and edges/ways) is easy and I also took advantage of orientdbs spatial datatypes to store actually their real world representation. But now it gets hard for me to understand the principles of graph data modeling right.
In real world the nodes/crossings aren't very interesting...most properties are situated on the edges (type of way, lanes, width, etc...). In the graph datamodel the graph is used to associate entities, so the nodes are carrying most of the properties and edges are only to associate nodes with each other.
How would you model a real world traffic network graph in a graph data model the right way, and specifically how would you model aspects like a lane or the coating of a street to the network graphs edge .
P.S. Lanes and Properties of an edge should be their own classes, as they should only be referenced to an network graphs edge, as an edge can carry multiple types of traffic (train, street, walkways, bikeroutes, etc...)

Read the following articles, they deal with modeling issues (search google scholar)
Bordoloi, S. and Kalita, B. (2013a). Designing Graph Database Models from existing relational databases. International Journal of Computer Applications, 74(1).
Bordoloi, S. and Kalita, B. (2013b). ER Model to an Abstract Mathematical Model for Database Schema using Reference Graph. International Journal of Engineering Research And Development, e-ISSN, pages 51–60.
De Virgilio, R., Maccioni, A., and Torlone, R. (2014). Model-driven design of graph databases. In Conceptual Modeling, pages 172–185. Springer.
Park, Y., Shankar, M., Park, B.-H., and Ghosh, J. (2014). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on, pages 12–19. IEEE.

Related

When to use graph databases, ontologies, and knowledge graphs

I've been struggling to understand when these technologies are useful from a practical standpoint, and how they are different from each other. Could an expert check my understanding?
Graph databases: These are easier to understand and manage than a relational databases when relationships are complex, inherited, inferred with varying degrees of confidence, and likely to change. Some examples: a user doesn't know how much depth they'll need in a hierarchy; is inferring relationships from social media with varying degrees of confidence in ID resolution, topic resolution, and the strength of a relationship; or doesn't know what kinds of call center data they're going to want to store; all of these can be stored in relational databases, but they will need constant updates. They're also more performative for certain tasks.
Ontologies: These formal and standardized representations of knowledge are used to break down data silos. For instance, let's say a B2B sales company derives revenue from several different lines of business, which take one-time payments, subscriptions, sales of IP, and consulting services. The revenue data is stored in many different databases with lots of idiosyncrasies. An ontology allows the user to define a "customer payment" as anything that "creates or refunds revenue," so that subject matter experts can appropriately label the payments in their databases. Ontologies can be used with either graph databases or relational databases, but the emphasis on class inheritance makes them far easier to implement in a graph database, where the taxonomy of classes can be easily modeled.
Knowledge graph: A knowledge graph is a graph database where language (meaning, the entity and node taxonomies) are governed by an ontology. So in our B2B example, "customer payment" edges have one-time payments, subscription, etc subtypes, and connect "customer" classes to "line of business" classes.
Is that basically correct?
A graph database (GD) is a database that can store graph data, which primarily has three types of elements: nodes, edges, and properties. Two popular types of graph databases are (1) Resource Description Framework (RDF)-based graph databases eg. Blazegraph and (2) Label Propagation Graph (LPG)-based graph databases eg. Neo4j. RDF represents knowledge in the form of subject, verb, and object (S-V-O) triplet, such as John livesIn London, and as its nodes and edges cannot hold properties, additional nodes or literals needs to be added to represent properties. LPG represents knowledge in the form of edges, nodes, and attributes where nodes and edges can hold properties in the form of key:value. Eg., a node can have label Person, and Person can have properties name:Tom Hanks, born:1956.
An ontology is a description of the concepts and their relationships, using instances of concepts, attributes of instances (and classes), restrictions of classes, and rules (if-then statements). These rules describe the logical inferences that can be drawn from the assertions/axioms that comprise the overall theory that the ontology describes. Upper level ontology (eg. DOLCE) describes general concepts and relations, whereas domain ontology (eg. Gene Ontology) describes concepts and relations in a particular domain. A graph database may have an ontology in its schema level for logical consistency checking.
Generally, a knowledge graph (KG) is an organization of a knowledge base as a graph having nodes and links between the nodes. An example of an early KG is Wordnet which captures semantic relationships between words and their meanings. Later Google developed their Google Knowledge Graph (GKG) building on DBpedia and Freebase using RDFa, Microdata and JSON-LD content extracted from indexed web pages, and used schema.org vocabulary to organize the nodes. Google reported that it held around 70 billion facts in GKG.
Graph databases supports queries, but not logical inference which needs an ontology. If the connections within the data are of primary focus (eg. friends of a friend), retrieval more important than storage, and data model changes often, then graph database would be a good fit.
Ontology is used when we need to infer new knowledge from the given knowledge. For eg, if given (1) Socrates is Man, and (2) AllMen are Mortal, then the reasoner or inference engine in an ontology can infer a new knowledge (3) Socrates is Mortal. This is made possible by description logic axioms that the Web Ontology Language (OWL) uses to describe resources. The OWL is serialized using Resource Description Framework (RDF).
Ontology is also used when we need to check consistency in the data model. For eg, if an axiom says Human and Sponge are disjoint classes and we make John (a human) instance of both Human and Sponge classes then it will fail consistency test.
Taxonomy is the IS-A class hierarchy which forms the backbone of an ontology.
Knowledge Graphs are often associated with linked open data (LOD) projects built upon standard Web technologies such as HTTP, RDF, URIs, and SPARQL. KG may use ontologies for reasoning and graph databases to store the knowledge. Several large organizations have introduced their KGs.

Graph database vs. RDB with link/bridge tables

I work in the fraud/AML (anti-money laundering) field, and we are exploring using a graph database to unearth hidden connections and links. I've read a fair amount abut graph databases lately (mostly neo4j, but I think the concepts are similar across different products?), and from what I can tell, they seem to be well-suited to this domain. The issue is that I'm having a hard time getting buy-in from tech management, as they seem to think that we can do the same things with our existing data reporting model, which is in Hadoop, and is essentially a data warehouse which has specific tables that provide many-to-many link tables between the core tables (I believe Kimball calls them 'bridge' tables?).
In a way, they seem to provide the same functionality as the relationship tables in a graph DB. Given that we have already constructed the link tablesin Hadoop, would a graph database provide any performance advantage for the kinds of things we may want to do (e.g. How is Customer A connected to Customer B), or have we largely negated any performance advantage of a graph DB by building all of the link tables?
On similar hardware platforms, a relational database will never be able to keep up with a well constructed graph database when performing "path-between" queries. Never.
Every graph database product has its own internal storage representation, but they are all fundamentally designed to store nodes and edges and support navigational queries across those nodes and edges. Without the addition of new graph-support features, relational database will struggle to provide graph-like capabilities.
The other advantage of using a native graph database is that the graph query languages are specifically designed to support path-between queries. In Objectivity/DB, a massively scalable and distributable object/graph database, we can use the DO query language to find all of the paths between two entities up to a specified number of degrees apart in milliseconds or seconds. A DO query might look like the following:
Match p = (:Account { accountId = "1234"})
-[*..100]->
(:Account { accountId = "5678"})
return p;
Here, we are saying: Find all paths (p) from Account 1234 to Account 5678, where they are between 1 and 100 degrees apart.
To create and execute this same query in a relational database would be much more complicated (without the addition of graph features to the database) and the execution of a query like this in a relational database would be much more resource intensive (memory, cpu, I/O).
If you have the opportunity to explore graph database for your project, make sure you understand your scalability and data distribution requirements. That information will be key to selecting the correct product.
Disclaimer: I am the Director of Field Operations for Objectivity.

Why there are not so many graph databases as graph processing frameworks?

For graph databases, especially those are active and distributed, I knew some but not a lot. Like orientdb, Titan, Dex, etc.
Regarding the graph processing frameworks, there are huge set of tools like graphx, graph lab, powergraph, xstream, pregel, etc. and there are more coming out every year.
Can any one tell me the difference between those two categories of tools? Are they exchangeable? And why graph databases are not drawing enough attention as graph processing frameworks?
The difference between graph databases and graph processing frameworks is databases are built to save data in the basic form of a graph, where relationships between the data are built with edges and the data points are built with nodes/vertices. Some databases, like OrientDB extend this basic concept considerably, to make the database much more versatile. Others are less versatile. Though in general, the main goal is to persist the data an a graph-like form, edges and vertices.
With graph processing frameworks, on the other hand, they take a set of data and build analytical graphs out of the data. The goal is mainly analysis of graph like patterns or structures within the data.
I'll try to put this in an analogy, as I understand it.
Say you have a punch bowl full of punch (your data).
In a graph database scenario, the punch is already a graph and you can look into the bowl and see all the stuff in your graph and analyze it too.
With a graph processing framework, you have a punch bowl full of stuff too, but it is murky and you don't see any graphs in it directly. To get a graph of some type, you first have to ladle out some of the punch, in let's say, a "graph processing ladle". This allows you to see some kind of graph coherence, depending on the algorithms you choose to try and analyze the data with. Of course, depending on your machine or system, like Spark, the graph processing ladle could be huge, even just as big as your whole punch bowl or even bigger.
Still, it takes time and processing to make a "sensible graph" out of the punch (your data). The other thing about this is, if you want to store this newly found ladle of analyzed graph punch, you'd have to have another bowl to put it in. And, if you drop the ladle on the floor, your graph data is gone. This wouldn't happen with a graph database.
I hope that makes sense.
Scott
There are connections and isolations between the graph database and graph computing.
Connections:
Graph database will not only offer data storage but also a series of graph data processing, For example, to find solve the SSSP problem needs traversal and computation of the graph which must be supported by the graph processing framework.
Isolations:
You can't use the graph database for most of the graph computing like PageRank, Greedy Graph Coloring, because as a basic storage and query system, graph database doesn't need to have the ability to do computing jobs.
Correct me if I'm wrong, I'm also a freshman for graph computing.

How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related?

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

Do graph databases deprecate relational databases?

I'm new to DBs of any kind. It seems you can represent any relational database in graph form (although it might be a very flat graph), and any graph database in a relational database (with enough tables).
A graph can avoid a lot of lookups in other tables by having a hard link from one entry to another, so in many/most cases I can see the speed advantage of a graph. If your data is naturally hierarchical, and especially if it forms a tree, I see the logical/reasoning benefit to a graph over relational. I imagine a node of a graph which links to other nodes probably contains multiple maps or lists... which is effectively containing a relational DB within nodes of a graph.
Are there any disadvantages to a graph db vs a relational db? (Note: I'm not looking to things like missing features in implementations, but instead the theoretical pros/cons)
When should I still use a relational database? Even if I logically have a single mapping of an int to int I could do it in a graph.
Graph databases were deprecated by relational-ish technology some 20 to 30 years ago.
The major theoretical disadvantage is that graph databases use TWO basic concepts to represent information (nodes and edges), whereas a relational database uses only one (the relation). This bleeds over into the language for data manipulation, in that a graph-based language must provide two distinct sets of operators : one for operating on nodes, and one for operating on edges. The relational model can suffice with only one.
More operators means more operators to implement for the DBMS builder, more opportunity for bugs, and for the user it means more distinct language constructs to learn. For example, adding information to a database is just INSERT in relational, in graph-based it can be either STORE (nodes) or CONNECT (edges). Removing information is just DELETE (relational), as opposed to either ERASE (nodes) or DISCONNECT (edges).
Building on Erwin Smout's fine answer, an important reason why the relational model supplanted the graph one is that a graph has a greater degree of "bias" baked into its structure than relations do. The edges of a graph are navigational links which user queries are expected to traverse in a particular way. A relational model of the same data assumes much less about how the data will be used. Users are free to join and manipulate relational data in ways that the database designer might not have foreseen. The disruptive costs of re-engineering graph database structures to support new requirements were a factor which drove the adoption of the relational model and its SQL-based offshoots in the 1980s.
Relational databases were designed to aggregate data, graph to find relations.
If You have for example a financial domain, all connections are known, You only aggregate data by other data to find sums and so on.
Graph databases are better in more chaotic domain where to connections are more important, and not all connections are apparent, for example:
networks of people, with different relations with one and other
films and people creating them. Not just actors but the whole crew.
natural language processing and finding connections between recognized words
Data model is important, but what matters more is how you access your data. Notice, there are very few (none, actually) sharded or otherwise distributed graph databases out there. If you compare insertion speed into a typical relational database and a graph database, your relational database will most likely win.
Yes, graph model is more versatile than relational model, but it doesn't make it universal - in some cases, this versatility is a roadblock for optimizations.
In fact, modern graph databases are a niche solutions for a narrow set of tasks - finding a route from A to B, working with friends in a social network, information technology in medicine.
For most business applications relational databases continue to prevail.
I'm missing the performance aspect in the answers above.
Performance of graph based data bases is inherently worse for scalar and maybe even tree based models. Only if you have a real graph, they may exhibit better performance.
Also most graph DBs do not feature ACID support such as almost any RDBMS.
From my real life experience I can tell almost any evolving data model will sooner or later become a graph and that's why graph DBs are superior in terms of flexibility and agility (they keep pace with the evolution of your data model).
That's why I don't think that RDBs will prevail for "For most business applications" as #Kostja says. I think they will prevail where ACID capability is essential.

Resources