Streamlining Neo4j query that conditionnaly creates new relationships - database

I have a graph database with 3 type of nodes and two relationships:
(p:PERSON)-[:manages]->(c:COMPANY)-[:seeks]->(s:SKILLS)
I want to create a new relationship between the nodes labeled (:PERSON) such as: (p1:PERSON)-[:competes_with]->(p2:PERSON) and
(p2:PERSON)-[:competes_with]->(p1:PERSON) subject to p1.name <> p2.name.
so that I can represent competition for scarce labor in a variety of markets represented by (s:SKILLS).
The condition to establish the new relationship [:competes_with] is that 2 distinct persons nodes (:PERSON) manage companies that seek at least 3 (:SKILLS) profiles that coincide between the 2 companies.
Orders of magnitude are:
|(:PERSON)| = 6000
|(:COMPANY)| = 15000
|(:SKILLS)| = 95000
In my plodding way, what I did was:
MATCH (p1:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1, collect(DISTINCT s.skill_names) AS p1_skills
MATCH (p2:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1,p1_skills, p2, collect(DISTINCT s.skill_names) AS p2_skills
WHERE p1 <> p2
UNWIND p1_skills AS sought_skills
WITH p1,p2, sought_skills, reduce(com_skills=[], sought_skills IN p2_skills | com_skills + sought_skills) AS NCS
WHERE size(NCS) >= 3
MERGE(p1)-[competes_with]->(p2)
MERGE(p2)-[competes_with]->(p1)
Given the size of the problem, this causes a 14GB RAM box to crash after a while with an "out-of-memory" exception.
So, besides the fact that I don't know whether my query actually does what I want (it crashes before completing), the question is: Can I streamline this to make it work with smaller memory requirements ? What would the improved query be like ? Tx.

The standard neo4j naming convention is to have camel-case label names, and all-upper-case relationship names (and properties should start with a lower-case character). In this answer, I will follow the standard and use names like Person and MANAGES.
You don't need 2 COMPETES_WITH relationships between the same 2 Person nodes if the relationship is inherently bidirectional. Neo4j can navigate incoming and outgoing relationships equally easily, and the MATCH clause allows a relationship pattern to not specify a direction (e.g., MATCH (a)-[:FOO]-(b)). Also, the MERGE clause (but not CREATE) allows you to specify an undirected relationship -- which ensures that only one relationship exists between the 2 endpoints.
It seems that the COMPETES_WITH relationship really belongs between Company nodes, since that is really the source of the competition. Also, if a Person left a company, you should not have to remove any COMPETES_WITH relationships from that node (and you should also not have to add a COMPETES_WITH relationship to the replacement Person).
In addition, you should consider whether the COMPETES_WITH relationship is really needed in the first place. Every time the skills sought by a Company changes, you'd have to recalculate its COMPETES_WITH relationships. You should determine whether doing that is worth it, or whether your queries should just dynamically determine a company's competitors as needed.
Here is a simplified version of your original query:
MATCH (p1:Person)-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(p1)-[:COMPETES_WITH]-(p2);
To find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:COMPETES_WITH]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you changed the data model to have the COMPETES_WITH relationship between Company nodes:
MATCH (c1:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(c2:Company)
WITH c1, c2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(c1)-[:COMPETES_WITH]-(c2);
With this model, to find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:COMPETES_WITH]-(:Company)<-[:MANAGES]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you did not have COMPETES_WITH relationships at all, to find the Person nodes that compete with a given Person:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
RETURN p1, COLLECT(p2) AS competing_people;

Related

Recipe Database Design

I am trying to create a database to store my recipes. However, I am not sure how to implement it. I looked at other questions like this but they do not have the same focus as I.
I assume any dish is actually just an ingredient, which can then be used in other dishes, or in this case in other ingredients. Any ingredient may have multiple recipes. For now, each recipe indicates how much of each ingredient is needed, but I also want to know how these ingredients are combined without having a long text description of it.
For example, in text, I would describe one (very bad) scrambled eggs recipes like this:
Scrambled eggs:
Cooked for 5 minutes(
1g Butter,
Whisked(
1g Salt,
1g Pepper,
2 Eggs
)
and then Scrambled eggs could be used in another recipe as an ingredient.
But how would that translate in a database? I don't need that database to be SQL based since this is a personal project, but I don't know any other kind of databases so far.
I thought about defining an Ingredient, as having an optional Technique associated with it but that means Whisked(1g salt, 1g pepper, 2 eggs) would have to be an Ingredient. Which I guess could work and I could also make the name of ingredients optional, but it seems awkward.
I also thought about defining a Recipe as having multiple TransformedIngredients which would contain a Technique applied to many Ingredients but sometimes a Recipe contains raw, untransformed, Ingredients and sometimes TransformedIngredients would need to be applied to TransformedIngredient. From what I know of databases that wouldn't work.
PS: I stumbled onto a functional programming Tiramisu recipe which, though very much focused on the techniques, displays fairly well what I'm trying to implement for my database.
I think what's confusing is that there are two different things to think about with a recipe, 'Items' and 'Steps'.
One database structure that comes to mind for this is a Star Schema structure which separates these ideas nicely (into Dimension and Fact tables, respectively).
A quick description of each:
Dimension
"The state of something" i.e. a record is merely there to describe what the thing is. A customer's address table would be an example of a dimension table.
Fact
"Things changing over time" i.e. each record relates to a dimension table, but has changing values. An example would be shipped purchases from a website to a customer's address. The address stays the same, but the shipments are getting constantly added to the table.
This isn't to say that Dimension tables don't change, too; obviously new users sign up for websites all the time. In the above address example, if a customer were to change his address, a new primary key value would be added for the new address.
Now on to your recipe examples:
Imagine you're cooking something. I would put anything that you hold in your hands in a "dimension" table. For example: DIM_INGREDIENT (with columns such as INDREDIENT_ID, INGREDIENT_NAME), and DIM_AMOUNT (AMOUNT_ID, AMOUNT, UNITS) to describe the amounts. And DIM_ACTION (ACTION_ID, TYPE, LENGTH, UNITS) to describe the action. There are more you can come up with; these are a few to get started.
Any steps I'd be taking could go in a FACT_RECIPE_STEPS table that would map to all the dimension tables. Any step that doesn't have a logical step would have a null value (i.e. stir for 5 minutes would have null for INGREDIENT_ID).
The FACT_RECIPE_STEPS could look like this:
RECIPE_ID, RECIPE_STEP, ACTION_STEP_ID, INGREDIENT_ID, AMOUNT_ID, ACTION_ID
What gets confusing is the "substep" of whisking the stuff together. I put that in another FACT table called FCT_ACTION_STEP since "whisking" is one action in the recipe list, but to perform the action you actually need to do three things.
I think the following is what some of the tables would look like with your data:
DIM_INGREDIENT
INGREDIENT_ID: 1
INGREDIENT_NAME: 'Scrambled eggs'
INGREDIENT_ID: 2
INGREDIENT_NAME: 'Salt'
INGREDIENT_ID: 3
INGREDIENT_NAME: 'Pepper'
INGREDIENT_ID: 4
INGREDIENT_NAME: 'Eggs'
INGREDIENT_ID: 5
INGREDIENT_NAME: 'Butter'
DIM_ACTION
ACTION_ID: 1
TYPE: 'Cook'
LENGTH: 5
UNITS: 'minutes'
ACTION_ID: 2
TYPE: 'Whisk'
LENGTH: null
UNITS: null
FCT_ACTION_STEP
STEP_ID: 1
ACTION_ID: 2
DIM_AMOUNT
AMOUNT_ID: 1
AMOUNT: 1
UNITS: 'grams'
AMOUNT_ID: 2
AMOUNT: 2
UNITS: null
FACT_RECIPE_STEPS
RECIPE_ID, RECIPE_STEP, ACTION_STEP_ID, INGREDIENT_ID, AMOUNT_ID, ACTION_ID
EDIT:
I was a bit unsure myself as to how to do the "Whisked" part of the recipe and thought that, when you add the whisked mixture to the final result, it's like adding in one ingredient to the recipe. However, you need to prepare the mixture before and it has three steps. It's basically like it's own little recipe, and the FACT_ACTION_STEP takes that other 'recipe' into account to be able to add the result one row in the FACT_RECIPE_STEPS table.
Now that I think about it a bit more, it might be better to just assign "Whisked" as its own recipe in FACT_RECIPE_STEPS and DIM_INGREDIENT (called something like "Whisked spices for eggs") +and get rid of the FACT_ACTION_STEP table altogether. That way you can easily make more complex recipes, such as "Eggs and Pancake Breakfast" where the Eggs part is the result of this recipe.
You can add some other fields to tables but I believe this schema works for you.
recipe
------------
r_id PK
recipe_name
cooking_time
recipe_of_recipes
-----------------
ror_id PK
ror_name
recipe_ror (table for many to many relation-> defining a recipe as an ingredient)
-------------
r_ror_id PK
r_id FK
ror_id FK
ingredients
-------------
i_id PK
t_id FK
r_id FK
ror_id FK (added later)
ingredient_name
quantity
technique
-------------
t_id PK
technique_name
EDIT
Let's say you want to store a recipe (X) which is a combination of x and y recipes plus z ingredient.
To prepare X recipe (big X),
in recipe,ingredients and technique tables you store
the x recipe and w,t,r ingredients with technique of p
the y recipe and b,n,m ingredients with technique of v
also z ingredient with technique of f (for this I forgot to add field ror_id as a FK in ingredients table)
You can define 2 different recipes (x and y) as ingredients of a recipe (X) using the recipe_ror table. This table relates to different recipes as one.(many to many relationship between tables recipe and recipe_of_recipes)
If you also want to store the technique for X,x or y recipes(like cook in your example), you can also add t_id field as FK to recipe and recipe_of_recipes table.

Is this use case a candidate for Graph Database application?

Consider I have some users U1, U2, U3 each with property 'age' such that;
U1.age = 10
U2.age = 30
U3.age = 70
I also have some lists which are dynamic collections of users based on some criteria, say L1, L2, L3, such that;
L1: where age < 60
L2: where age < 30
L3: where age > 20
Since the lists are dynamic, the relationship between lists and users is established only through the user properties and list criteria. There is no hard mapping to indicate which users belong to which list. When the age of any user changes or when the criteria of any list changes, the users associated with a list may also change.
In this scenario, at any point of time it is very easy to get the users associated with a list by querying users matching the list criteria.
But to get the lists associated with a user, is an expensive operation which involves first determining users associated with each list and then picking those lists where the result has the user in question.
Could this be a candidate for using Graph Database? And why? (I'm considering Neo4j) If yes, how to model the nodes and the relationships so that I can easily get the lists given a user.
Since 2.3 Neo4j does allow index range queries. Assume you have an index:
CREATE INDEX on :User(age)
Then this query gives you the list of people younger 60 years and is performed via the index
MATCH (u:User) WHERE u.age < 60 RETURN u
However I would not store the age, instead I'd store the date of birth as a long property. Otherwise you have can the age over and over again.
Update based on comment below
Assume you have a node for each list:
CREATE (:List{name:'l1', min:20, max:999})
CREATE (:List{name:'l2', min:0, max:30})
CREATE (:List{name:'l3', min:0, max:60})
Let's find all the lists a user U1 belongs to:
MATCH (me:User{name:'U1'})
WITH me.age as age
MATCH (l:List) WHERE age >= l.min AND age <= l.max // find lists
WITH l
MATCH (u:User) WHERE u.age >= l.min AND age <= l.max
RETURN l.name, collect(u)
Update 2
A complete different idea would be to use a timetree. Both, all users and your list definitions are connected to the timetree

modeling correct star schema for ssas tabular

I'm using ssas tabular (powerpivot) and need to design a data-model and write some DAX.
I have 4 tables in my relational database-model:
Orders(order_id, order_name, order_type)
Spots (spot_id,order_id, spot_name, spot_time, spot_price)
SpotDiscount (spot_id, discount_id, discount_value)
Discounts (discount_id, discount_name)
One order can include multiple spots but one spot (spot_id 1) can only belong to one order.
One spot can include different discounts and every discount have one discount_value.
Ex:
Order_1 has spot_1 (spot_price 10), spot_2 (spot_price 20)
Spot_1 has discount_name_1(discount_value 10) and discount_name_2 (discount_value 20)
Spot_2 has discount_name_1(discount_value 15) and discount_name_3 (discount_value 30)
I need to write two measures: price(sum) and discount_value(average)
How do I correctly design a star schema with fact table (or maybe two fact tables) so that I in my powerpivot cube can get:
If i choose discount_name_1 I should get
order_1 with spot_1 and spot_2 and price on order_1 level will have value 50 and discount_value = 12,5
If I choose discount_name_3 I should get
order_1 with only spot_2 and price on order level = 20 and discount_value = 30
Fact(OrderKey, SpotKey, DiscountKey, DateKey, TimeKey Spot_Price, Discount_Value,...)
DimOrder, DimSpot, DimDiscount, etc....
TotalPrice:=
SUMX(
SUMMARIZE(
Fact
,Fact[OrderKey]
,Fact[SpotKey]
,Fact[Spot_Price]
)
,Fact[Spot_Price]
)
AverageDiscount:=
AVERAGE(Fact[Discount_Value])
Fact table is denormalized and you end up with the simplest star schema you can have.
First measure deserves some explanation. [Spot_Price] is duplicated for any spot with multiple discounts, and we would get wrong results with a simple SUM(). SUMMARIZE() does a group by on all the columns passed to it, following relationships (if necessary, we're looking at a single table here so nothing to follow).
SUMX() iterates over this table and accumulates the value of the expression in its second argument. The SUMMARIZE() has removed our duplicate [Spot_Price]s so we accumulate the unique ones (per unique combination of [OrderKey] and [SpotKey]) in a sum.
You say
One order can include multiple spots but one spot (spot_id 1) can only
belong to one order.
That's is not supported in the table definitions you give just above that statement. In the table definitions, one order has only one spot but (unless you've added a unique index to Orders on spot_id) each Spot can have multiple orders. Each Spot can also have multiple discounts.
If you want to have the relationship described in your words, the table definitions should be:
Orders(order_id, order_name, order_type)
OrderSpot(order_id, spot_id) -- with a Unique index on spot_id)
Spots (spot_id, spot_name, spot_time, price)
or:
Orders(order_id, order_name, order_type)
Spots (spot_id, spot_name, spot_time, order_id, price)
You can create the ssas cube with Order as the fact table, with one dimention in the Spot Table. If you then add the SpotDiscount and Discount tables with their relations (SpotDiscount to Spot, Discount to SpotDiscount) you have a 1 dimentional.
EDIT as per comments
Well, the Fact table would have order_id, order_name, order_type
The Dimension would be made up of the other 3 tables and have the columns you're interested in: probably spot_name, spot_time, spot_price, discount_name, discount_value.

Designing a multilevel company structure

I try to build a database model for the following structure:
I have companies with up to 3 hierachical levels. For each unit I have a value (these values are given randomly and duplicates between companies (not within) are possible. Let us say (1 Level: 222-Amazon, 2 Level: 441-Amazon: Germany, 542-Britan, 3 Level: 6-Distribution, 99-Shop, 124-Programming, 5-HR.
Of course for each company this is different. What I did is:
Table1:
ID_Worker
CompanyName
ID_CompanyLvL1
ID_CompanyLvL2
ID_CompanyLvL3
...
Table2:
ID_CompanyLevel1
Slot1
Slot2
...
Table3:
ID_CompanyLevel2
Slot1
Slot2
...
But with this approach I have the following problem: If two companies have the same number for a CompanyLevel1(2 or 3) unit I cannot distingush them anymore.
Another approach that is not working is
Table1:
ID_Company
ID_Worker
ID_CompanyLevel1
...
Tabel2:
ID_CompanyLevel1
Slot1
ID_CompanyLevel2
...
Table3:
ID_CompanyLevel2
Slot
ID_CompanyLevel3
...
With this approach I cannot identify which person is in e.g. which level2 unit. Could anyone help me with this i just cannot come up with the right design.
You need to decide whether the organization structure is purely hierarchical (an org unit can only belong to 0 or 1 other org unit), or whether it is graphical (an org unit can belong to 0, 1, or 1+ org units).
Your limit of three is a business rule, and should be enforced by database logic (trigger) and not the database schema.
Why the codes with the names?
If hierarchical, this is your schema:
create table organizations (
organization_id int primary key,
name varchar(whatever) not null,
parent_id int null references organizations(organization_id)
);
Use Recursive Common Table Expressions to query them.
If graphical, this is your schema:
create table organizations (
organization_id int primary key,
name varchar(whatever) not null
);
create table organizations_structure (
parent_organization_id int references organizations(organization_id),
child_organization_id int references organizations(organization_id),
primary key (parent_organization_id, child_organization_id),
check (parent_organization_id <> child_organization_id)
);
For anything like that - make sure you do not put yourself into a cornder. For example:
I have companies with up to 3 hierachical levels
No. YOu do have companies with CURRENTLY up to 3 hierarchical levels. And they do not want to scream at you when one of them decides to have 4.
I would suggest reading the Data Model Ressource Book Volume 1 - they describe all kinds of stuff and standard data schemata, among them entity organizations (entity as in "legal, human or organizatonal entity" which includes organigrams. Things are a lot more complex as you think when you do not want to put yourself into a corner that WILL make the program require a rewrite in the not too far future.

Determining the functional dependencies of a relationship and their normal forms

I'm studying for a database test, and the study guide there are some (many) exercises of normalization of DB, and functional dependence, but the teacher did not make any similar exercise, so I would like someone help me understand this to attack the other 16 problems.
1) Given the following logical schema:
Relationship product_sales
POS Zone Agent Product_Code Qualification Quantity_Sold
123-A Zone-1 A-1 P1 8 80
123-A Zone-1 A-1 P1 3 30
123-A Zone-1 A-2 P2 3 30
456-B Zona-1 A-3 P1 2 20
456-B Zone-1 A-3 P3 5 50
789-C Zone-2 A-4 P4 2 20
Assuming that:
• Points of Sale are grouped into Zone.
• Each Point of Sale there are agents.
• Each agent operates in a single POS.
• Two agents of the same points of sale can not market the same product.
• For each product sold by an agent, it is assigned a Qualification depending on the product and
the quantity sold.
a) Indicate 4 functional dependencies present.
b) What is the normal form of this structure.
To get you started finding the 4 functional dependencies, think about which attributes depend on another attribute:
eg: does the Zone depend on the POS? (if so, POS -> Zone) or does the POS depend on the Zone? (in which case Zone -> POS).
Four of your five statements tell you something about the dependencies between attributes (or combinations of several attributes).
As for normalisation, there's a (relatively) clear tutorial here. The phrase "the key, the whole key, and nothing but the key" is also a good way to remember the 1st, 2nd and 3rd normal forms.
In your comment, you said
Well, According to the theory I've read I think it may be, but I have
many doubts: POS → Zone, {POS, Agent} → Zone, Agent → POS, {Agent,
Product_code, Quantity_Sold} → Qualification –
I think that's a good effort.
I think POS->Zone is right.
I don't think {POS, Agent} → Zone is quite right. If you look at the sample data, and you think about it a bit, I think you'll find that Agent->POS, and that Agent->Zone.
I don't think {Agent, Product_code, Quantity_Sold} → Qualification is quite right. The requirement states "For each product sold by an agent, it is assigned a Qualification depending on the product and the quantity sold." The important part of that is "a Qualification depending on the product and the quantity sold". Qualification depends on product and quantity, so {Product_code, Quantity}->Qualification. (Nothing in the requirement suggests to me that the qualification might be different for identical orders from two different agents.)
So based on your comment, I think you have these functional dependencies so far.
POS->Zone
Agent->POS
Agent->Zone
Product_code, Quantity->Qualification
But you're missing at least one that has a significant effect on determining keys. Here's the requirement.
Two agents of the same points of sale can not market the same product.
How do you express the functional dependency implied in that requirement?

Resources