How to traverse nodes efficiently in postgresql? - database

I'm trying to traverse nodes from the specific one with the recursive clause in PostgreSQL.(Btw I'm new to Postgresql) Here's a simple version of my db tables:
table: nodes
|--------------|------------------|
| id | node_name |
|--------------|------------------|
| 1 | A |
|--------------|------------------|
| 2 | B |
|--------------|------------------|
| 3 | C |
|--------------|------------------|
| 4 | D |
|--------------|------------------|
| 5 | E |
table: links
|--------------|---------------------|-----------------|
| id | id_from | id_to |
|--------------|---------------------|-----------------|
| 1 | 1 | 2 |
|--------------|---------------------|-----------------|
| 2 | 1 | 3 |
|--------------|---------------------|-----------------|
| 3 | 2 | 4 |
|--------------|---------------------|-----------------|
| 4 | 3 | 4 |
|--------------|---------------------|-----------------|
| 5 | 4 | 5 |
So, it's just this simple direct graph. (all edges go left to right)
B
/ \
A D - E
\ /
C
In this situation, what is the efficient way to get all vertices that can be visited starting from A?
What I tried:
dfs solution found at Simple Graph Search Algorithm in SQL (PostgreSQL)
with recursive graph(node1, node2, path) as
(
select id_from, id_to, ARRAY[id_from] from links
where id_from = 1
union all
select nxt.id_from, nxt.id_to,array_append(prv.path, nxt.id_from)
from links nxt, graph prv
where nxt.id_from = prv.node2
and nxt.id_from != ALL(prv.path)
)
select * from graph
It gave me almost all paths. But it visited D vertex twice. (Perform D -> E logic twice) I want to ignore the visited vertex for efficiency.
So, how can I achieve this? Thanks in advance!

It is not that simple. Within a single query, all recursive paths are completely independent. So, each path does not know about what's going on on the sibling path. It does not know that a certain node was already visited by a sibling.
Because SQL queries don't support some kind of global variables, it is impossible to share such information between the recursion paths that way.
I'd recommend to write a function where you can use plsql syntax which solves the problem in a more "common programmatical" way.

Related

Postgres query chain structure data

Assuming there is a gigantic organization with a crazy way to manage. Each employee has one or multiple managers, managers are employees themselves who have one or multiple managers on top.
employee table
| id | name |managers_id|
| -------- | -------------- |-----------|
| 1 | Smith | 5,6 |
| 2 | Matt | 1 |
| 3 | Bob | 1,2 |
| 4 | Adam | 1,3 |
| 5 | Suzi | 6 |
| 6 | Emily | 23,25 |
| ... | ... | ... |
It is a one-way management chain, no loops, meaning it goes A-B-C-D, A-a-b-C-D etc, no such case as A-B-C-D-A
The query is to get the management chains, say C has two management chains on top:
A-B-C
A-a-b-C
C also has one chain below:
C-D
The level of C along the chains is not a matter.
In theory, there is no limitation on the number of levels, the chain can keep going indefinitely.
I was thinking about 'inheritance' but probably it is not the solution.
Any tips on how to design this postgres dababase, please? Thank you.

Find all possible relationships between two nodes in multi-parent hierarchy data model [SQL Server]

I have a data model to define multi-parent hierarical data. Each record will represent a relationship of two nodes in which one will be a parent node and another will be a child node. In my case, a node can have multiple parents. I need to find all possible relationsips between two nodes.
For example take the below table.
---------------------------------
| id | parent_node | child_node |
---------------------------------
| 1 | NULL | A |
| 2 | NULL | B |
| 3 | A | C |
| 4 | A | D |
| 5 | B | D |
| 6 | B | E |
| 7 | C | G |
| 8 | C | H |
| 10 | D | I |
| 11 | E | I |
| 12 | E | J |
---------------------------------
This will form a graph like below
A B
/ \ / \
C D E
/ \ \ / \
G H I J
In the above model, A and B will be the top level node and each has two children. Node D is assigned as the child of node A and B. And also node I is assigned as the child of node D and node E. All other nodes has exactly one parent.
I need to write a query to show all possible relationship of a node with another node.
For example,
A and C has a relationship, because C is child of Node A.
A and D has a relationship, because D is child of Node A.
A and G has a relatiohship, because G is the grandchild of Node A.
This will go for any number of levels.
Two nodes doesn't have any relationship, if any one node is not a child or nth-level grandchild of another.
If two nodes doesn't have any relationship, it will not show up.
The final outcome for the above graph will be as below,
----------------------------
| parent_node | child_node |
----------------------------
| A | C |
| A | D |
| C | G |
| C | H |
| D | I |
| A | G |
| A | H |
| A | I |
| B | D |
| B | E |
| B | I |
| E | I |
| E | J |
| B | J |
----------------------------
I am new to SQL Server. Please help me to solve this query.
By doing some research, I was able to write the query myself. As #SeanLange pointed out in the comments, this type of query is called a recursive CTE.
If the table name is nodes, the following query will create the new table relationship and store all possible relationships in it as mentioned in my question.
;with cte as (
select child_node
, parent_node
, child_node as root
from nodes
union all
select child.child_node
, child.parent_node
, parent.root
from cte parent
join nodes child
on parent.parent_node = child.child_node
)
select parent_node,
root as child_node
into relationship
from cte
where parent_node is not null;
select * from relationship;

Traversing and Getting Nodes in Graph without Loop

I have a person table which keeps some personal info. like as table below.
+----+------+----------+----------+--------+
| ID | name | motherID | fatherID | sex |
+----+------+----------+----------+--------+
| 1 | A | NULL | NULL | male |
| 2 | B | NULL | NULL | female |
| 3 | C | 1 | 2 | male |
| 4 | X | NULL | NULL | male |
| 5 | Y | NULL | NULL | female |
| 6 | Z | 5 | 4 | female |
| 7 | T | NULL | NULL | female |
+----+------+----------+----------+--------+
Also I keep marriage relationships between people. Like:
+-----------+--------+
| HusbandID | WifeID |
+-----------+--------+
| 1 | 2 |
| 4 | 5 |
| 1 | 5 |
| 3 | 6 |
+-----------+--------+
With these information we can imagine the relationship graph. Like below;
Question is: How can I get all connected people by giving any of them's ID.
For example;
When I give ID=1, it should return to me 1,2,3,4,5,6.(order is not important)
Likewise When I give ID=6, it should return to me 1,2,3,4,5,6.(order is not important)
Likewise When I give ID=7, it should return to me 7.
Please attention : Person nodes' relationships (edges) may have loop anywhere of graph. Example above shows small part of my data. I mean; person and marriage table may consist thousands of rows and we do not know where loops may occur.
Smilar questions asked in :
PostgreSQL SQL query for traversing an entire undirected graph and returning all edges found
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=118319
But I can't code the working SQL. Thanks in advance. I am using SQL Server.
From SQL Server 2017 and Azure SQL DB you can use the new graph database capabilities and the new MATCH clause to answer queries like this, eg
SELECT FORMATMESSAGE ( 'Person %s (%i) has mother %s (%i) and father %s (%i).', person.userName, person.personId, mother.userName, mother.personId, father.userName, father.personId ) msg
FROM dbo.persons person, dbo.relationship hasMother, dbo.persons mother, dbo.relationship hasFather, dbo.persons father
WHERE hasMother.relationshipType = 'mother'
AND hasFather.relationshipType = 'father'
AND MATCH ( father-(hasFather)->person<-(hasMother)-mother );
My results:
Full script available here.
For your specific questions, the current release does not include transitive closure (the ability to loop through the graph n number of times) or polymorphism (find any node in the graph) and answering these queries may involve loops, recursive CTEs or temp tables. I have attempted this in my sample script and it works for your sample data but it's just an example - I'm not 100% it will work with other sample data.

Creating hierarchical data (tree) structures in Neo4j using "tree keys"

I have imported data from a CSV file and created a lot of Nodes, all of which are related to other Nodes within the same data set based on a "Tree Number" hierarchy system:
For example, the Node with Tree Number A01.111 is a direct child of Node A01, and the Node with Tree Number A01.111.230 is a direct child of Node A01.111.
What I am trying to do is create unique relationships between Nodes that are direct children of other Nodes. For example Node A01.111.230 should only have one "IS_CHILD_OF" relationship, with Node A01.111.
I have tried several things, for example:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (n <> n2)
AND NOT ((n2)-[:IS_CHILD_OF]->())
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
This example results in creating unique "IS_CHILD_OF" relationships but not with the direct parent of a Node. Rather, Node A01.111.230 would be related to Node A01.
I'd like to suggest another general solution, also avoiding a cartesian product as #InverseFalcon points out.
Let's indeed start by creating an index for faster lookup, and inserting some test data:
CREATE CONSTRAINT ON (n:Node) ASSERT n.treeNumber IS UNIQUE;
CREATE (n:Node {treeNumber: 'A01.111.230'})
CREATE (n:Node {treeNumber: 'A01.111'})
CREATE (n:Node {treeNumber: 'A01'})
Then we need to scan all nodes as potential parents, and look for children which start with the treeNumber of the parent (STARTS WITH can use the index) and have no dots in the "remainder" of the treeNumber (i.e. a direct child), instead of splitting, joining, etc.:
MATCH (p:Node), (c:Node)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
I replaced the creation of the relationship by a simple RETURN for profiling purposes, but you can simply replace it by CREATE UNIQUE or MERGE.
Actually, we can get rid of the p <> c predicate and the + 1 on the length by pre-computing the actual prefix which should match:
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
However, profiling that query shows that the index is not used, and there is a cartesian product (so we have a O(n^2) algorithm):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 26 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) AND StartsWith(c.treeNumber,parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Apply | 2 | 9 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 9 | 9 | 12 | c | :Node |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 45
But, if we simple add a hint like so
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
it does use the index and we have something like a O(n*log(n)) algorithm (log(n) for the index lookup):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 6 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Apply | 2 | 3 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| | +NodeUniqueIndexSeekByRange | 9 | 3 | 6 | c | :Node(treeNumber STARTS WITH parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
Total database accesses: 19
Note that I did cheat a bit when introducing the WITH step creating the prefix earlier, as I noticed it improved the execution plan and DB accesses over
MATCH (p:Node), (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
which has the following execution plan:
Compiler CYPHER 3.0
Planner RULE
Runtime INTERPRETED
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Variables | Other |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 9 | c, p | NOT(p == c) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 6 | 12 | c -- p | PrefixSeekRangeExpression(p.treeNumber); :Node(treeNumber) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabel | 3 | 4 | p | :Node |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 25
Finally, for the record, the execution plan of the original query I wrote (i.e. without the hint) was:
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 21 | c, p | NOT(p == c) AND StartsWith(c.treeNumber,p.treeNumber) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +CartesianProduct | 9 | 9 | 0 | p -- c | |
| |\ +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 3 | 9 | 12 | c | :Node |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 37
It's not the worse one: the one without the hint but with the pre-computed prefix is! This is why you should always measure.
I think we can improve on the query a bit. First, ensure you have either a unique constraint or an index on :Node.treeNumber, as you'll need that to improve your parent node lookups in this query.
Next, let's match on child nodes, excluding root nodes (assuming no .'s in the root's treeNumber) and nodes that have already been processed and have a relationship already.
Then we'll find each node's parent by the treeNumber using our index, and create the relationship. This assumes that a child treeNumber always has 4 more characters, including the dot.
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SUBSTRING(child.treeNumber, 0, SIZE(child.treeNumber)-4) as parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
I think this query avoids a cartesian product as you may get from other answers, and should be around O(n) (someone correct me if I'm wrong).
EDIT
In the event that each subset of numbers in treeNumbers is NOT constrained to 3 (as in your description, actually, with 'A01.111.23'), then you need a different means of deriving the parentNumber. Neo4j is a little weak here, as it lacks both an indexOf() function as well as a join() function to reverse a split(). You may need the APOC Procedures library installed to allow access to a join() function.
The query to handle cases with variable counts of digits in the numeric subsets of treeNumber becomes this:
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SPLIT(child.treeNumber, '.') as splitNumber
CALL apoc.text.join(splitNumber[0..-1], '.') YIELD value AS parentNumber
WITH child, parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
I think I just figured out a solution! (If someone has a more elegant one please do post)
I just realized that the "Tree Number" coding system always uses 3-digit numbers between the dots, i.e. A01.111.230 or C02.100, therefore if a Node is the direct child of another Node, it's "Tree Number" should not only start with the Tree Number of the parent Node, it should also be 4 characters longer (one character for the dot '.' and 3 characters for the numeric value).
Therefore my solution that seems to do the job is:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (length(n2.treeNumber) = (length(n.treeNumber) + 4))
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
For your requirement STARTS WITH won't work, since A01.111.23 does indeed start with A01 in addition to starting with A01.111.
The treeNumber is made up of several parts with '.' as the separator. Let's not make any assumptions about the maximum/minimum possible character lengths of the individual parts. What we need is to compare all but the last part of each node's treeNumber with that of the potential child node being tested. You can achieve this using Cypher's split() function as follows:
MATCH (n1:Node), (n2:Node)
WHERE split(n2.treeNumber,'.')[0..-1] = split(n1.treeNumber,'.')
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n1);
The split() function splits a string, at each occurrence of a given separator, into a list of strings (parts). In this context the separator is '.' to split any treeNumber. We can select a subset of a list in cypher using the syntax list[{startIndex}..{endIndex}]. Negative indices for reverse lookup are permitted, such ass the one used in the above query.
This solution should generalize to all possible treeNumber values, in the format at hand, irrespective of number of parts and individual part lengths.

Fill sequence in sql rows

I have a table that stores a group of attributes and keeps them ordered in a sequence. The chance exists that one of the attributes (rows) could be deleted from the table, and the sequence of positions should be compacted.
For instance, if I originally have these set of values:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 2 | two | 2 |
| 3 | three | 3 |
| 4 | four | 4 |
+----+--------+-----+
And the second row was deleted, the position of all subsequent rows should be updated to close the gaps. The result should be this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 3 | three | 2 |
| 4 | four | 3 |
+----+--------+-----+
Is there a way to do this update in a single query? How could I do this?
PS: I'd appreciate examples for both SQLServer and Oracle, since the system is supposed to support both engines. Thanks!
UPDATE: The reason for this is that users are allowed to modify the positions at will, as well as adding or deleting new rows. Positions are shown to the user, and for that reason, these should show a consistence sequence at all times (and this sequence must be stored, and not generated on demand).
Not sure it works, But with Oracle I would try the following:
update my_table set pos = rownum;
this would work but may be suboptimal for large datasets:
SQL> UPDATE my_table t
2 SET pos = (SELECT COUNT(*) FROM my_table WHERE id <= t.id);
3 rows updated
SQL> select * from my_table;
ID NAME POS
---------- ---------- ----------
1 one 1
3 three 2
4 four 3
Do you really need the sequence values to be contiguous, or do you just need to be able to display the contiguous values? The easiest way to do this is to let the actual sequence become sparse and calculate the rank based on the order:
select id,
name,
dense_rank() over (order by pos) as pos,
pos as sparse_pos
from my_table
(note: this is an Oracle-specific query)
If you make the position sparse in the first place, this would even make re-ordering easier, since you could make each new position halfway between the two existing ones. For instance, if you had a table like this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 100 |
| 2 | two | 200 |
| 3 | three | 300 |
| 4 | four | 400 |
+----+--------+-----+
When it becomes time to move ID 4 into position 2, you'd just change the position to 150.
Further explanation:
Using the above example, the user initially sees the following (because you're masking the position):
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 2 | two | 2 |
| 3 | three | 3 |
| 4 | four | 4 |
+----+--------+-----+
When the user, through your interface, indicates that the record in position 4 needs to be moved to position 2, you update the position of ID 4 to 150, then re-run your query. The user sees this:
+----+--------+-----+
| id | name | pos |
+----+--------+-----+
| 1 | one | 1 |
| 4 | four | 2 |
| 2 | two | 3 |
| 3 | three | 4 |
+----+--------+-----+
The only reason this wouldn't work is if the user is editing the data directly in the database. Though, even in that case, I'd be inclined to use this kind of solution, via views and instead-of triggers.

Resources