Neo4j or MongoDB for relative attributes/relations - database

I wish to build a database of objects with various types of relations between them. It will be easier for me to explain by giving an example.
I wish to have a set of objects, each is described by a unique name and a set of attributes (say, height, weight, colour, etc.), but instead of values, these attributes may contain values which are relative to other objects. For example, I might have two objects, A and B, where A has height 1 and weight "weight of B + 2", and B has height "height of A + 3" and weight 4.
Some objects may have completely other attributes; for example, object C may represent a box, and objects A and B will be related to C by the relations "I appear x times in C".
Queries may include "what is the height of A/B" or what is the total weight of objects appearing in C with multiplicities.
I am a bit familiar with MongoDB, and fond of its simplicity. I heard of Neo4j, but never tried working with it. From its description, it sounds more suitable for my need (but I can't tell it is capable of the task). But is MongoDB, with its simplicity, suitable as well? Or perhaps a different database engine?
I am not sure it matters, but I plan to use python as the engine which processes the queries and their outputs.

Either can do this. I tend to prefer neo4j, but either way could work.
In neo4j, you'd create a graph consisting of a node (A) and its "base" (B). You could then connect them like this:
(A:Node { weight: "base+2" })-[:base]->(B:Node { weight: 2 })
Note that modeling in this way would make it possible to change the base relationship to point to another node without changing anything about A. The downside is that you'd need a mini calculator to expand expressions like "base+2", which is easy but in any case extra work.
Interpreting your question another way, you're in a situation where you'd probably want a trigger. Here's an article on neo4j triggers, how graphs handle this. If parsing that expression "base+2" at read time isn't what you want, and you want to actually set the value on A to be b.weight + 2, then you want a trigger. This would let you define some other function to be run when the graph gets updated in a certain way. In this case, when someone inserts a new :base relationship in the graph, you might check the base value (endpoint of the relationship) and add 2 to its weight, and set that new property value on the source of the relationship.

Yes, you can use either DBMS.
To help you decide, this is an example of how to support your uses cases in neo4j.
To create your sample data:
CREATE
(a:Foo {id: 'A'}), (b:Foo {id: 'B'}), (c:Box {id: 123}),
(h1:Height {value: 1}), (w4:Weight {value: 4}),
(a)-[:HAS_HEIGHT {factor: 1, offset: 0}]->(h1),
(a)-[:HAS_WEIGHT {factor: 1, offset: 2}]->(w4),
(b)-[:HAS_WEIGHT {factor: 1, offset: 0}]->(w4),
(b)-[:HAS_HEIGHT {factor: 1, offset: 3}]->(h1),
(c)-[:CONTAINS {count: 5}]->(a),
(c)-[:CONTAINS {count: 2}]->(b);
"A" and "B" are represented by Foo nodes, and "C" by a Box node. Since a given height or weight can be referenced by multiple nodes, this example data model uses shared Weight and Height nodes. The HAS_HEIGHT and HAS_WEIGHT relationships have factor and offset properties to allow adjustment of the height or weight for a particular Foo node.
To query "What is the height of A":
MATCH (:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height)
RETURN ra.factor * ha.value + ra.offset AS height;
To query "What is the ratio of the heights of A and B":
MATCH
(:Foo {id: 'A'})-[ra:HAS_HEIGHT]->(ha:Height),
(:Foo {id: 'B'})-[rb:HAS_HEIGHT]->(hb:Height)
RETURN
TOFLOAT(ra.factor * ha.value + ra.offset) /
(rb.factor * hb.value + rb.offset) AS ratio;
Note: TOFLOAT() is used above to make sure integer division, which would truncate, is never used.
To query "What is the total weight of objects appearing in C":
MATCH (:Box {id: 123})-[c:CONTAINS]->()-[rx:HAS_WEIGHT]->(wx:Weight)
RETURN SUM(c.count * (rx.factor * wx.value + rx.offset));

I have not used Mongo and decided not to after studying it. So filter my opinion with that in mind; users may find my comments easy to overcome. Mongo is not a true graph database. The user must create and manage the relationships. In Neo4j, relationships are "native" and robust.
There is a head to head comparison at this site:
[https://db-engines.com/en/system/MongoDB%3bNeo4j]
see also: https://stackoverflow.com/questions/10688745/database-for-efficient-large-scale-graph-traversal
There is a distinction between NoSQL (e.g., Mongo) and a true graph database. Many seem to assume that if it is not SQL then it's a graph database. This is not true. Most NoSQL data bases do not store relationships. The free book describes this in chapter 2.
Personally, I'm sold on Neo4j. It makes relationships, transerving graphs and collecting lists along the path easy and powerful.

Related

Data structure for set inclusion queries

What I've done so far: I have a map of sets (sort of database), where each set is a collection of strings (different features measured by a doctor). It looks like this:
{{"temperature", "blood pressure"}: Model1, {"temperature", "weight"}: Model2, {"temperature", "blood pressure", "weight"}: Model3}
Each set maps to a ML model that is used for that particular measurements. Sets may have different number of features, overlapping features (e. g. "temperature" is frequent).
Task: doctor makes some measurements, e. g. someone measured only {"temperature", "weight"}. I have to check which sets from my database are inclusive in this set, so I know which model can be used with this data, e. g. for this example there is Model2 available. It's okay if the model does not require all measured features - I only require that the model does not need more features than those measured. I need a data structure to effectively make such queries.
Data: it's not organized in any way yet, I'm also not bound to a particular language (I prefer Python, since the rest of application is in it, but it's not required). I can modify it in any way that' required, e. g. identify a model by a string ID, or throw this into some relational/non-relational database.
Question: what data structure / database type / data organization will be effective for such queries? I'm open to implementing data structures myself, as well as using SQL, MongoDB or any other solution.
The data structure that you want is a trie. That is, order the features in a canonical order (alphabetical works, frequency of feature in your models is somewhat more efficient) and put them into a nested structure (models, further_lookups) like this:
([], {
'blood pressure': ([], {
'temperature': ([Model1], {
'weight': ([Model3], {})
}),
}),
'temperature': ([], {
'weight': ([Model2], {})
}),
})
And now given a particular set of fields, you navigate in along the path you have and collect all of the models you run across. Which is to say [] for the start, [] after 'temperature' and [Model2] after 'weight'.
Note that you have to try both using and not using any particular field. So if you had 'blood pressure' as well then you'd need to both try searching for models with 'blood pressure' and without. This is easily done with recursion. It theoretically can take exponential time in the number of features that you have, but is unlikely to do so in practice.
I don't have recommendations for a good implementation of a trie in a data store.

Neo4j: How can I display labels as nodes?

I have a question about Neo4j. I need to show labels in my graph database as node - like if I have only two types of labels in my database (for example Thing and Person), I want to have 2 extra nodes - Thing and Person with relationships to normal nodes.
Example - I have this:
Orange node is Person, red is Thing. So I want to have extra label nodes for every label in graph. So I want this:
Can be this created automatically?
You do not really want to do that, since a visualization with N nodes would then have N extraneous relationships to the special "label" nodes, making it hard (or even impossible) to see the actual data. Using different colors for different labels is a good compromise.
In any case, the top of the result panel (in the neo4j Browser) tells you which color belongs to which label, so you can already easily get the information you want.
[UPDATE]
However, if you really need to do something like that, there is no "automated" way. But you could use some APOC procedures to create virtual nodes and relationships that are not stored in the DB, but which can be visualized.
For example, if your original Cypher query is:
MATCH path=(p:Person)-[r:RELTYPE]->(t:Thing)
RETURN *
you can use this query to generate the appropriate virtual nodes and relationships:
MATCH path=(p:Person)-[r:RELTYPE]->(t:Thing)
WITH COLLECT(path) AS paths, COLLECT(DISTINCT p) AS ps, COLLECT(DISTINCT t) AS ts
CALL apoc.create.vNode(['V_Label'], {label: 'Person'}) YIELD node AS pLabel
CALL apoc.create.vNode(['V_Label'], {label: 'Thing'}) YIELD node AS tLabel
UNWIND ps AS person
CALL apoc.create.vRelationship(person, 'IS', {}, pLabel) YIELD rel AS pRel
WITH paths, ts, pLabel, tLabel, COLLECT(pRel) AS pRels
UNWIND ts AS thing
CALL apoc.create.vRelationship(thing, 'IS', {}, tLabel) YIELD rel AS tRel
RETURN *
A sample resulting visualization:

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

Referential Integrity with Neo4j

I am working on a project that uses a graph database to hold click data for a search engine. The nodes can be search terms or urls, and the edges hold a weight attribute, and a percentage of times that search led to someone clicking that URL.
Number of times the URL was clicked / Number of times term was searched
My issue is that when I update the edges, the percentage will be accurate, but if I later update the search term node and the searched count changes, the edge will no longer have the correct percentage. Is there a way in Neo4j to keep referential integrity? like a foreign key type thing?
The following info might be helpful.
If you stored the number of clicks instead of the percentage, there is no way to get inconsistent data. For example:
(:Term {id: 1, nSearches: 123})-[:HAS_URL {weight: 2, nClicks: 17}]->(:Url {id: 2})
With this data model, you'd calculate the percentage whenever you needed it.
For example, to find the 10 terms that have the highest percentage of visits to a specific URL:
MATCH (term:Term)-[r:HAS_URL]->(url:Url {id: 2})
RETURN url, term
ORDER BY r.nClicks/term.nSearches DESC
LIMIT 10;
But notice that the inverse query (find the 10 URLs that have the highest percentage of visits from a specific term) does not even require that you calculate the percentage! This is because in this case the percentages all have the same denominator. So, you can just use nClicks for sorting:
MATCH (term:Term {id: 1})-[r:HAS_URL]->(url:Url)
RETURN term, url
ORDER BY r.nClicks DESC
LIMIT 10;
Unfortunately no, neo4j doesn't support this. You can still do it, with one of two methods. I'll tell you what they both are, then make a recommendation.
Relative to your relational database, I don't think you're looking for a foreign key or "referential integrity" -- I think what you're looking for is more like a trigger. A trigger is like a function or procedure that executes when data changes. In your case, it'd probably be good to have trigger functions that re-calculated all of the weight percentages on incident edges.
Option 1 - The capable Max De Marzi has got you covered there with a description of how you can do triggers in neo4j. Spoiling the surprise, there's a TransactionEventHandler in the java API. When the right kind of transaction comes through, you can catch that and do extra stuff.
Option 2 - the server provides an extension/plugin mechanism so that you could write this on your own. This is a big hammer, it can do just about anything, but it's harder to wield, too.
I'd recommend you look into Max's post and the TransactionEventHandler. You might then implement public void afterCommit(TransactionData transactionData, Object o). In that method, you'd check out the transaction data to see if it was something of interest (not all transactions would be of interest). If the transaction updated a search term node or searched count changes, then I'd go do your recomputation, fix your weights, and you should be good.

Neo4j output format

After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r

Resources