Graph databases store data as nodes, properties and relations. If I need to retrieve some specific data from an object based upon a query, then I would need to retrieve multiple objects (as the query might have a lot of results).
Consider this simple scenario in object oriented programming in graph-databases:
I have a (graph) database of users, where each user is stored as an object. I need to retrieve a list of users living in a specific place (the place property is stored in the user object). So, how would I do it? I mean unnecessary data will be retrieved every time I need to do something (in this case, the entire user object might need to be retrieved). Isn't functional programming better in graph databases?
This example is just a simple analogy of the above stated question that came to my mind. Don't take it as a benchmark. So, the question remains, How great is object oriented programming in graph-databases?
A graph database is more than just vertices and edges. In most graph databases, such as neo4j, in addition to vertices having an id and edges having a label they have a list of properties. Typically in java based graph databases these properties are limited to java primatives -- everything else needs to be serialized to a string (e.g. dates). This mapping to vertex/edge properties can either be done by hand using methods such as getProperty and setProperty or you can something like Frames, an object mapper that uses the TinkerPop stack.
Each node has attributes that can be mapped to object fields. You can do that manually, or you can use spring-data to do the mapping.
Most graph databases have at least one kind of index for vertices/edges. InfiniteGraph, for instance, supports B-Trees, Lucene (for text) and a distributed, scaleable index type. If you don't have an index on the field that you're trying to use as a filter you'd need to traverse the graph and apply predicates yourself at each step. Hopefully, that would reduce the number of nodes to be traversed.
Blockquote I need to retrieve a list of users living in a specific place (the place property is stored in the user object).
There is a better way. Separate location from user. Instead of having a location as a property, create a node for locations.
So you can have (u:User)-[:LIVES_IN]->(l:Location) type of relationship.
it becomes easier to retrieve a list of users living in a specific place with a simple query:
match(u:User)-[:LIVES_IN]->(l:Location) where l.name = 'New York'.
return u,l.
This will return all users living in New York without having to scan all the properties of each node. It's a faster approach.
Why not use an object-oriented graph database?
InfiniteGraph is a graph database built on top of Objectivity/DB which is an massively scalable, distributed object-oriented database.
InfiniteGraph allows you to define your vertices and edges using a standard object-oriented approach, including inheritance. You can also embed a defined data type as an attribute in another data type definition.
Because InfiniteGraph is object-oriented, it give you access to query capabilities on complex data structures that are not available in the popular graph databases. Consider the following diagram:
In this diagram I create a query that determines the inclusion of the edge based on an evaluation of the set of CallDetail nodes hanging off the Call edge. I might only include the edge in my results if there exists a CallDetail with a particular date or if the sum of the callDurations of all of the CallDetails that occurred between two dates is over from threshold. This is the real power of object-oriented database in solving graph problems: You can support a much more complex data model.
I'm not sure why people have comingled the terms graph database and property graph. A property graph is but one way to implement a graph database, and not particular efficient. InfiniteGraph is a schema-based database and the schema provides several distinct advantages, one of which object placement.
Disclaimer: I am the Director of Field Operation for Objectivity, Inc., maker of InfiniteGraph.
Related
I'm trying to come up with a way of storing the underlying structure of C++ classes in a database, and trying to determine what the best type of database would be/how to lay out information in that database.
C++ classes by themselves as code would be parsed into an AST, which would suggest a tree as a possibile data structure, which would fit well into a graph database. However, I don't think the data could be stored purely as a tree, once you consider that pointers could create a loop. The thought then would be a graph. Being someone primarily familiar with relational databases, I'm not sure what the plausibility of that is. The primary pieces of data that would be needed for a class are:
Class name
Children nodes, including their type and offset within the struct
For children nodes that aren't a primitive type, they would have a relationship to the class. This by itself seems like it would fit well inside of a graph database. The main thing I'm having a hard time envisioning is how things like pointers or arrays would be stored. Would those just have different attributes on the edge between the two classes? Is there some other way of storing this data that I'm missing which would work better?
This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.
I'm currently working on a project where I use natural language processing to extract emotions from text to correlate them with contextual information.
Definition of contextual information: Every information that is relevant to describe an entity's situation in time an space.
Description of the data structure I'm looking for:
There is a arbitrary number of entities (an entity can either be a person or a group for example (twitter hash tags)) of which I want to track contextual information and their conversations with other entities. Conversations between entities are processed in order to classify their emotional features. Basic emotional features consist of a vector that specifies their occurrence percentually: {fear: 0.1, happiness: 0.4, joy: 0.1, surprise: 0.9, anger: 0}
Entities can also submit any contextual information they'd like to share, for example: location, room-temperature, blood pressure, ... and so on (will refer to this as contextual variables).
Because neither the number of conversations of an entity, nor the number of contextual variables they want to share is clear at any point in time, the data structure needs to be able to adjust accordingly.
Important: Every change in the data must also represent an own state as I'm looking forward to correlate certain changes in state with each other.
Example: Bob and Alice have a conversation that shows high magnitude of fear. A couple of hours later they have another conversation that shows no more fear, but happiness.
Now, one could argue that high magnitude fear, followed by happiness actually could be interpreted as the emotion relief.
However, in order to be able to extract this very information I need to be able to correlate different states with each other.
Same goes for using contextual information to correlate them with the tracked emotions in conversations.
This is why every state change must be recorded and available.
To make this more clear to you, I've created a graphic and attached it to the question.
Now, the actual question I have is: Which database/data structure can I use to solve this problem?
I've looked into event-sourcing databases but wasn't quite convinced if I can easily recreate a graph structure with them. I also looked at graph databases but didn't find what I was looking for.
Therefore it would be nice if someone here could at least point me in the right direction or help me adjust my structure accordingly to solve the problem. If however there are data structures supporting, what I call it graph databases with snapshots then ease of usage is probably the most important feature to filter for.
There's a database called Datomic by Rich Hickey (of Clojure fame) that stores facts over time. Every entry in the database is a fact with a timestamp, append-only as in Event Sourcing.
These facts can be queried with a relational/logical language ala Datalog (remiscent of Prolog). Please see This post by kisai for a quick overview. It has been used for querying graphs with some success in the past: Using Datomic as a Graph Database.
While I have no experience with Datomic, it does seem to be quite suitable for your particular problem.
You have an interesting project, I do not work on things like this directly but for my 2 cents -
It seems to me your picture is a bit flawed. You are trying to represent a graph database overtime but there isn't really a way to represent time this way.
If we examine the image, you have conversations and context data changing over time, but the fact of "Bob" and "Alice" and "Malory" actually doesn't change over time. So lets remove them from the equation.
Instead focus on the things you can model over time, a conversation, a context, a location. These things will change as new data comes in. These objects are an excellent candidate for an event sourced model. In your app, the conversation would be modeled as a series of individual events which your aggregate would use and combine and factor to generate a final state which would be your 'relief' determination.
For example you could write logic where if a conversation was angry then a very happy event came in then the subject is now feeling relief.
What I would do is model these conversation states in your graph db connected to your 'Fact' objects "Bob", "Alice", etc. And a query such as 'What is alice feeling right now?' would be a graph traversal through your conversation states factoring in the context data connected to alice.
To answer a question such as 'What was alice feeling 5 minutes ago?' you would take all the event streams for the conversations and rewind them to the appropriate point then examine the state of the conversations.
TLDR:
Separate the time dependent variables from the time independent variables and use event sourcing to model time.
There is an obvious 1:1 correspondence between your states at a given time and a relational database with a given schema. So there is an obvious 1:1 correspondence between your set of states over time and a changing-schema database, ie a variable whose value is a database plus metadata, manipulated by both DDL and DML update commands. So there is no evidence that you shouldn't just use a relational DBMS.
Relational DBMSs allow generic querying with automated implementation at a certain computational complexity with certain opportunities for optimization. Any application can have specialized queries that make a specialized data structure and operators a better choice. But you must design your application and know about such special aspects to justify this. As it is, with the obvious correspondences between your states and relational states, this has not been justified.
EAV is frequently used instead of DDL and a changing schema. But under EAV the DBMS does not know the real tables you are concerned with, which have columns that are EAV attributes, and which are explicit in the DDL/DML changing schema approach. So EAV foregoes simplicity, clarity, optimization and most of all integrity and ACID. It can only be justified (compared to DDL/DML, assuming a relational representation is otherwise appropriate) by demonstrating that DDL with schema updates (adding, deleting and changing columns and tables) is worse (per the above) than EAV in your particular application.
Just because you can draw a picture of your application state at some time using a graph does not mean that you need a graph database. What matters is what specialized queries/expressions you will be evaluating. You should understand what these are in terms of your problem domain, which is probably most easily expressible per some specialized data structure and operators and relationally. Then you can compare the expressive and computational demands to a specialized data structure, a relational representation, and the models of particular graph databases. Be sure to google stackoverflow.
According to Wikipedia "Neo4j is the most popular graph database in use today".
Coming from as SQL/NoSQL background I am finding it quite challenging to model (efficiently that is) the simplest of exercises on a Graph DB.
While different technologies have limitations and best practices, I am uncertain whether the mindset that I am using while creating the models is the correct one, hence, I am in the need of guidance, advice and/or resources to help me get closer to the right practices.
The initial exercise I have tried is representing a file share entire directory (subfolders and files) in a graph DB. For instance some of the attributes and queries I would like to include are;
The hierarchical structure of the folders
The aggregate size at the current node
Being able to search based on who created a file/folder
Being able to search on file types
This brings me to the following questions
When/Which attributes should be used for edges. Only those on which I intend to search? Only relationships?
Should I wish to extend my graph capabilities, for instance, search on files bigger than X? How does one try to maximize the future capabilities/flexibility of the model so that such changes do not cause massive impacts.
Currently I am exploring InfiniteGraph and TitanDB.
1) The only attribute I can think of to describe an edge in a folder hierarchy is whether it is a contains or contained-by relationship.
(You don't even need that if you decide to consider all your edges one or the other. In your case, it looks like you'll almost always be interrogating descendants to search and to return aggregate size).
This is a lot simpler than a network, or a hierarchy where the edges may be of different types. Think an organization chart that tracks not only who manages whom, but who supports whom, mentors whom, harasses whom, whatever.
2) I'm not familiar with the two databases you mentioned, but Neo4J allows indexes on node properties, so adding an index on file_size should not have much impact. It's also "schema-less," so that you can add attributes on the fly and various nodes may contain different attributes.
Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.