I'm trying to wrap my head around graph databases. So maybe someone could help explain to me the right way to model this relationship. This is mostly from the perspective of neo4j, but I assume it would be applicable to most graph databases
I have a Recipe node and Ingredient nodes. The Ingredient nodes have a ingredient_in relationship to the Recipe node. The relationship will hold several attributes, of particular note is an amount field with a unit of measure.
I can imagine that elsewhere in the graph there would be a UnitOfMeasure nodes that would have converts_to relationships with a conversion ratio.
The point I'm struggling with is how do I represent the Ingredient->Recipe relationship as having a UnitOfMeasure. Coming from RDMS this feels like I need a another node in between, but that feels wrong for a graph database.
It depends on two things:
a) do you have attributed relations or n-ary relations
b) how do you use the units and amounts - possibly a node in between is easier
Imo, using a "normal" design like this
Recipe -- Entry -- Ingredient
amount: double
|
|
UniOfMeasure
is fine with Entry being a Node - even if you use a graph database which can handle attributed edges. The design would be quite the same with an attributed n-ary edge btw - the only difference would be that Entry, now possibly named "contains", would be an Edge not a Node.
Related
I've been tasked with looking into Neo4j for our business needs. I've created some very small graphs to get used to the cypher syntax.
We have a scenario where a user will be able to search via many options which will need to then need to show their related data and keep track of these available items in stock as the results are filtered down. As a simple example (but the same design as what we will need). We might have 4 items of clothing (T-shirt, sweater, jeans, shirt) and the user can select either one to reveal their sizes and colours etc and keep track of the number in stock. However the user should be able to select size or colour first instead to reveal the different items (t-shirt,jeans etc). Basically different combinations depending on what is selected.
Jeans (20 in stock) > red (6) > small (2) or large (4), Jeans > green > small or large, Small > red > t-shirts, shirts, Green > large > t-shirts
In this scenario would the colour and size nodes need to be repeated for each items or could I just create them once and reuse them? This is the thing I am a little confused about. We will have potentially 150+ (list of countries) choices for one option node and if each one has its own unique nodes related to it (but are repeated for other options as new nodes) that is a lot of duplicates? We could have a million plus nodes...
Sorry if this is a dumb question! Just trying to gather if there is a particular way of handling this kind of use case in Neo4j.
Thank you very much for your help and advice. :)
In essence, this problem can be traced back to the good old attributes vs. entities question in ER modeling
Using separate entites. Creating singleton nodes for colors, sizes, country etc. seems a working solution and you can reuse them for multiple items. For example, if you want to assign red color to an item n, you'd issue this query: MATCH (r:Color {name: 'red'} CREATE (n)-[:HAS_COLOR]->(r). To select all red nodes, use MATCH (n:Item)-[:HAS_COLOR]->(:Color {name: 'red'}). This approach makes is easy to select all available colors, e.g. MATCH (c:Color) RETURN DISTINCT c
Using attributes. Using properties should also work fine. Filtering is even easier (MATCH (:Item {color: 'red')) and listing available colors can be implemented with MATCH (n) RETURN DISTINCT n.color
In conclusion, as with most data modeling questions, you'll probably need to go through a couple of iterations to get the data model right and maybe do some benchmarking/performance tuning as well. Fortunately, Neo4j makes it very easy to experiment with different data models.
There's often many ways to create your data model, and you'll have to weigh the pros and cons and figure out what may work best.
One aspect of consideration, attributes (properties) vs entities, Gabor covers in good detail.
Another aspect, just considering entities, is whether you want to use a tree structure, drilling down to a specific item whose attributes are defined by the nodes above it in the tree, or
For example, you might have a tree like this:
(jeans:Clothing:Attribute{type:'jeans'})-[:COLOR]->(jeansColor:Color:Attribute{type:'red'})
(jeansColor)-[:SIZE]->(:Size:Attribute{type:'small'})-[:QUANTITY]->(:Stock{quantity:2})
(jeansColor)-[:SIZE]->(:Size:Attribute{type:'large'})-[:QUANTITY]->(:Stock{quantity:4})
In this model each successive node in the hierarchy only has a single parent. The :Color node with type 'red' would only be applicable to the :Clothing node for 'jeans', and there would be other :Color nodes for 'red' in different hierarchies for different types of clothing. Similarly, :Size nodes would only have meaning within their hierarchy, so the 'small' and 'large' sizes above would only be applicable to red jeans, and the :Stock nodes would be specific to the hierarchy as well. We're using a second label :Attribute on :Color and :Size nodes so we can address those nodes more generically if we want.
Queries for stock at each level would use variable-length relationships down to :Item nodes and sum the quantities like so:
MATCH (:Clothing{type:'jeans'})-[*]->(item:Item)
RETURN sum(item.quantity) as stock
Queries for type would work the other direction (note we can use the :Attribute label instead of :Color here if we wanted):
MATCH (:Color{type:'red'})<-[*]-(clothing:Clothing)
RETURN collect(distinct clothing.type) as clothing
This model requires fairly rigid trees, and many duplicated nodes (as nodes with the same properties need to be duplicated across different branches of the trees).
An alternate model to consider is one where the attribute nodes (:Clothing, Color, :Size, and so on) have relationships directly with the related item, so each item is connected to all attributes which apply to it, similar to point 1 in Gabor's answer.
In this model, there is only one of each attribute node, so you won't have to deal with node duplication, but as the number of items in your db get larger, the work done in your matches might get more complicated, since you would be looking for item nodes at the intersection of all the items connected to the attributes you're searching for (so to find small red jeans you would expand to all items from each of the small, red, and jean attribute nodes and only keep the ones that are common between the three).
Need a brief difference between network model, relational model,entity set model like why and when can each one be used?
I have looked into various resources but there needs to be a brief and concise idea about the said topic
In graph terms, the relational model is an undirected n-ary graph in which the nodes are values and the edges are rows. Logically, tables represent predicates and rows represent propositions about entities which are represented as values of a domain. Edges are joined to form paths and processed in sets.
The entity-relationship model is also an undirected n-ary graph in which some tables represent sets of nodes and other tables represent sets of edges. It's a "semantic framework" built on top of the relational model, and while it offers a seemingly simpler and richer structure than the pure relational model, it's actually more complicated and less expressive. It's usually queried and processed using relational mechanisms.
The network data model is a directed binary graph in which the nodes are rows and the edges are pointers. Unlike relational models it's usually processed imperatively and edges are navigated to get to related nodes. It makes a hard distinction between attributes and relationships (unlike ER in which attributes are binary relationships).
I'm not very familiar with the entity set model and haven't seen it in the field. Did you perhaps mean the entity-relationship model? For more on the entity set model, see Data structures and accessing in data-base systems.
Different data models exist since people have different concepts of how best to organize, manipulate and access data. The relational model is the only one that has been proven equivalent to first-order logic. While knowledge of these models are invaluable in understanding data, we don't get to choose models freely, but must choose among software systems that tend to implement a mish-mash of models and features.
A quick google search shows some pretty decent explanations.
https://en.wikipedia.org/wiki/Database_model
http://www.unixspace.com/context/databases.html
If this is not what you're asking, please clarify the question.
I have something that completely confuses me and I have no idea how to store this much data in a database. Below I'll explain exactly what I think I need to store in the database and how I plan to use that data (to store it efficiently).
Okay, so. I have a around 40 points on a grid. I'll call them "objects". They have information associated with them such as coordinates (x,y), ID, number, resources, and then a lot of other objects and an amount that "defends" that point on the grid. There are over 100 different types of units that can defend the point. These units can be owned by any number of players. ID and number can be derived from each other easily (so both may not need to be stored).
What I need to do, is store all this information every time I scan these points with the time I'm scanning them. I'll need to then take this information out of the database to create graphs of a player's units over time to see if it is increasing or decreasing. I'd also like to plot the objects total defense over time to track how that is changing overall.
The frequency I scan these objects can vary, to even be at most once a minute. I can't even conceive how I'll store all this information in a database.
Any help is appreciated! Ask any and all questions you need.. I know it's a wall of text, but please read it!
Edit: The number of objects on the grid can change at any instant. We can gain one or we can lose one.
The starting point is really to understand Entity Relationship Modeling. Although your requirements look very unique to you, in terms on an entity relationship model they are old hat. Basically
is all about the types of relationships between objects that matter. Learn about one-one, one-to-many, and many-to-many relationships. The entity model is the place to start, and some tools even let you generate the tables off this. Once you understand how a given relationship translates to relational database model you are on your way. For example, one team has many baseball players. So this is a one-to-many relationship. Once you get this it will be a lot easier to understand why you need foreign keys in tables, and also unique id per row etc. As you build out your tables remember to model the relationships first and all the attributes later.
The other approach is to design your object model first, using say UML. Still its about relationships, inheritance, composition etc which will also translate into a database design. But if you want to design off database, then entity relationship modeling is the way to go.
Specifically a Multigraph.
Some colleague suggested this and I'm completely baffled.
Any insights on this?
It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. Like this:
create table node (
id integer primary key
);
create table edge (
start_id integer references node,
end_id integer references node,
primary key (start_id, end_id)
);
However, there are a couple of sticky points about storing a graph this way.
Firstly, the edges in this scheme are naturally directed - the start and end are distinct. If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me.
Secondly, the schema above has no way to represent a multigraph. You can extend it easily enough to do so; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing.
However, even having sorted out the storage, you have the problem of working with the graph. If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. If you want to do this, my suggestion would be to post further questions about the specific queries.
It's an acceptable approach. You need to consider how that information will be manipulated. More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation.
Without considering what types of queries you might execute, start with two tables vertices and edges. Vertices are simple, an identifier and a name. Edges are complex given the multigraph. Edges should be uniquely identified by a combination two vertices (i.e. foreign keys) and some additional information. The additional information is dependent on the problem you're solving. For instance, if flight information, the departure and arrival times and airline. Furthermore you'll need to decide if the edge is directed (i.e. one way) or not and keep track if that information as well.
Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. For instance, optimal flights. The book Programming Collective Intelligence has some useful algorithms for this purpose. But where the data is kept doesn't change the algorithm itself.
Well, the information has to be stored somewhere, a relational database isn't a bad idea.
It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections.
Consider how Facebook might implement the social graph in their database. They might have a table for people and another table for friendships. The friendships table has at least two columns, each being foreign keys to the table of people.
Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that.
Say we are representing school course data. The relevant part of the example encompasses three real-world concepts: school, campus, and semester. A school can have many campuses, and there is a finite number of semesters.
In the real world, if we wanted to specify a campus + semester combination, it would be elementary. But the data model needs to be represented using a tree structure, like
Foo University:
Main campus
Fall 2010
Spring 2011
Bar College:
North campus
Spring 2011
South campus
Spring 2011
This pattern could continue. For instance, departments could exist in the real world as children of the school, but in the model they would be represented as child nodes of the semester, because what's important about them can change from semester to semester. Basically, we represent the permutations of a set of choices as a tree.
What is the name for this data model pattern?
In the heading you heading mention "choice permutations" which suggests a dynamic pattern (i.e. how to use such a structure for decision making). If it's this then I'd agree with #robert that it's a Decision Tree.
In the body however you say
...the data model needs to be represented using a tree structure...
If your questions is simply the name of this tree-based structural pattern the answer is Hierarchical Database Model.
It's characterised by 1..N relationships between parent and child and pre-dates the Relational model (it was - and still is - the basis for IBM's IMS database system).
You allude to one of the problems with it. Namely, that the only way to model graph-based structures using it means denormalising and repeating elements. Removing that limitation is central to the Relational model.
hth.
I would call it an Entity Tree. It's not so much a data modeling pattern as a natural representation of your Entity relationships.
Decision tree.
A decision tree is a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences, including
chance event outcomes, resource costs, and utility. It is one way to
display an algorithm. Decision trees are commonly used in operations
research, specifically in decision analysis, to help identify a
strategy most likely to reach a goal. If in practice decisions have to
be taken online with no recall under incomplete knowledge, a decision
tree should be paralleled by a probability model as a best choice
model or online selection model algorithm. Another use of decision
trees is as a descriptive means for calculating conditional
probabilities.