PostgreSQL / TypeORM: search array in array column - return only the highest arrays' intersection - arrays

let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:
Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.
Second edge had 3 events: BC, BC, C.
We want the user to be able to search
how many events occurred on every edge
by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.
We represent this schema with 2 pre-aggregated tables:
Edges table:
+----+
| id |
+----+
| 1 |
| 2 |
+----+
EdgeStats table (which contains relation to Edges table via tag_id):
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 1 | 1 | [A, B, C] | 7 |
| 2 | 1 | [A, B] | 7 |
| 3 | 1 | [B, C] | 5 |
| 4 | 1 | [A, C] | 6 |
| 5 | 1 | [A] | 5 |
| 6 | 1 | [B] | 4 |
| 7 | 1 | [C] | 4 |
| 8 | 1 | null | 7 | //null represents aggregated stats for given edge, not important here.
| 9 | 2 | [B, C] | 3 |
| 10 | 2 | [B] | 2 |
| 11 | 2 | [C] | 3 |
| 12 | 2 | null | 3 |
+------+---------+-----------+---------------+
Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.
Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).
I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 4 | 1 | [A, C] | 6 |
| 11 | 2 | [C] | 3 |
+------+---------+-----------+---------------+
i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.
Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].
How do I write query to return this?
If you can just answer the question above, you can ignore what's bellow:
If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:
query.where("edge_stats.tags = :filter",
{
filter: [A, B],
}
)
However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.
Here is list of few possible solutions, all imperfect:
try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
use #> operator:
.where(
"edge_stats.tags <# :tags",
{
tags:[A, C, D],
}
)
This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.
approach built on 2) and inspired by this answer:
.where(
"edge_stats.tags <# :tags",
{
tags: [A, C, D],
}
)
.addOrderBy("edge.id")
.addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
.distinctOn(["edge.id"]);
What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).
So returned rows indeed are 4, 11.
This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.
I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?

Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,
DataTeam populated EdgeStats table with all permutations of tags observed per given edge
If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.
That way, you can
SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)
using the array_intersect function from here.

Related

How to traverse nodes efficiently in postgresql?

I'm trying to traverse nodes from the specific one with the recursive clause in PostgreSQL.(Btw I'm new to Postgresql) Here's a simple version of my db tables:
table: nodes
|--------------|------------------|
| id | node_name |
|--------------|------------------|
| 1 | A |
|--------------|------------------|
| 2 | B |
|--------------|------------------|
| 3 | C |
|--------------|------------------|
| 4 | D |
|--------------|------------------|
| 5 | E |
table: links
|--------------|---------------------|-----------------|
| id | id_from | id_to |
|--------------|---------------------|-----------------|
| 1 | 1 | 2 |
|--------------|---------------------|-----------------|
| 2 | 1 | 3 |
|--------------|---------------------|-----------------|
| 3 | 2 | 4 |
|--------------|---------------------|-----------------|
| 4 | 3 | 4 |
|--------------|---------------------|-----------------|
| 5 | 4 | 5 |
So, it's just this simple direct graph. (all edges go left to right)
B
/ \
A D - E
\ /
C
In this situation, what is the efficient way to get all vertices that can be visited starting from A?
What I tried:
dfs solution found at Simple Graph Search Algorithm in SQL (PostgreSQL)
with recursive graph(node1, node2, path) as
(
select id_from, id_to, ARRAY[id_from] from links
where id_from = 1
union all
select nxt.id_from, nxt.id_to,array_append(prv.path, nxt.id_from)
from links nxt, graph prv
where nxt.id_from = prv.node2
and nxt.id_from != ALL(prv.path)
)
select * from graph
It gave me almost all paths. But it visited D vertex twice. (Perform D -> E logic twice) I want to ignore the visited vertex for efficiency.
So, how can I achieve this? Thanks in advance!
It is not that simple. Within a single query, all recursive paths are completely independent. So, each path does not know about what's going on on the sibling path. It does not know that a certain node was already visited by a sibling.
Because SQL queries don't support some kind of global variables, it is impossible to share such information between the recursion paths that way.
I'd recommend to write a function where you can use plsql syntax which solves the problem in a more "common programmatical" way.

What is the maximum number of tuples that can be returned by natural join?

Consider that the relation R(A,B,C) contains 200 tuples and relation S(A,D,E) contains 100 tuples, then the maximum number of tuples possible in a natural join of R and S.
Select one:
A. 300
B. 200
C. 100
D. 20000
It will be great if the answer is provided with some explanation.
The maximum number of tuples possible in natural join will be 20000.
You can find what natural join exactly is in this site.
Let us check for the given example:
Let the table R(A,B,C) be in the given format:
A | B | C
---------------
1 | 2 | 4
1 | 6 | 8
1 | 5 | 7
and the table S(A,D,E) be in the given format:
A | D | E
---------------
1 | 2 | 4
1 | 6 | 8
Here, the result of natural join will be:
A | B | C | D | E
--------------------------
1 | 2 | 4 | 2 | 4
1 | 2 | 4 | 6 | 8
1 | 6 | 8 | 2 | 4
1 | 6 | 8 | 6 | 8
1 | 5 | 7 | 2 | 4
1 | 5 | 7 | 6 | 8
Thus we can see the resulting table has 3*2=6 rows. This is the maximum possible value because both the input tables have the same single value in column A (1).
Natural join returns all tuple values that can be formed from (tuple-joining or tuple-unioning) a tuple value from one input relation and a tuple value from the other. Since they could agree on a single subtuple value for the common set of attributes, and there could be unique values for the non-common subtuples within each relation, you could get a unique result tuple from every pairing, although no more than that. So the maximum number of tuples is the product of the tuple counts of the relations.
Here that's D 20000.
A and A present in R and S so according to natural join 100 tuples take part in join process.
Option C 100 is the answer.

array formula with dates

I have four columns of dates(A,B,C,D),I want to excel verify if each period of "period 2" intersects ALL "period 1" from 1,2,3.....
For example 20-28/01/2016 intersects 01-24/01/2016 AND 25/01-03/02/2016. The answer in this case in column E must be "wrong".
I think this have to be an array because in a cell must be verified an entire column. If it could be done without array I would be very happy, because array slow time calculation very much down on my computer .
_________________________________________________________________
| A | B | C | D | E |
| period 1 | period 2 | |
1 |01/01/2016|24/01/2016|20/01/2016|28/01/2016| "wrong" |
2 |25/01/2016|03/02/2016|04/02/2016|10/02/2016| "ok" |
3 |

Creating hierarchical data (tree) structures in Neo4j using "tree keys"

I have imported data from a CSV file and created a lot of Nodes, all of which are related to other Nodes within the same data set based on a "Tree Number" hierarchy system:
For example, the Node with Tree Number A01.111 is a direct child of Node A01, and the Node with Tree Number A01.111.230 is a direct child of Node A01.111.
What I am trying to do is create unique relationships between Nodes that are direct children of other Nodes. For example Node A01.111.230 should only have one "IS_CHILD_OF" relationship, with Node A01.111.
I have tried several things, for example:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (n <> n2)
AND NOT ((n2)-[:IS_CHILD_OF]->())
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
This example results in creating unique "IS_CHILD_OF" relationships but not with the direct parent of a Node. Rather, Node A01.111.230 would be related to Node A01.
I'd like to suggest another general solution, also avoiding a cartesian product as #InverseFalcon points out.
Let's indeed start by creating an index for faster lookup, and inserting some test data:
CREATE CONSTRAINT ON (n:Node) ASSERT n.treeNumber IS UNIQUE;
CREATE (n:Node {treeNumber: 'A01.111.230'})
CREATE (n:Node {treeNumber: 'A01.111'})
CREATE (n:Node {treeNumber: 'A01'})
Then we need to scan all nodes as potential parents, and look for children which start with the treeNumber of the parent (STARTS WITH can use the index) and have no dots in the "remainder" of the treeNumber (i.e. a direct child), instead of splitting, joining, etc.:
MATCH (p:Node), (c:Node)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
I replaced the creation of the relationship by a simple RETURN for profiling purposes, but you can simply replace it by CREATE UNIQUE or MERGE.
Actually, we can get rid of the p <> c predicate and the + 1 on the length by pre-computing the actual prefix which should match:
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
However, profiling that query shows that the index is not used, and there is a cartesian product (so we have a O(n^2) algorithm):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 26 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) AND StartsWith(c.treeNumber,parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Apply | 2 | 9 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 9 | 9 | 12 | c | :Node |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 45
But, if we simple add a hint like so
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
it does use the index and we have something like a O(n*log(n)) algorithm (log(n) for the index lookup):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 6 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Apply | 2 | 3 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| | +NodeUniqueIndexSeekByRange | 9 | 3 | 6 | c | :Node(treeNumber STARTS WITH parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
Total database accesses: 19
Note that I did cheat a bit when introducing the WITH step creating the prefix earlier, as I noticed it improved the execution plan and DB accesses over
MATCH (p:Node), (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
which has the following execution plan:
Compiler CYPHER 3.0
Planner RULE
Runtime INTERPRETED
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Variables | Other |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 9 | c, p | NOT(p == c) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 6 | 12 | c -- p | PrefixSeekRangeExpression(p.treeNumber); :Node(treeNumber) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabel | 3 | 4 | p | :Node |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 25
Finally, for the record, the execution plan of the original query I wrote (i.e. without the hint) was:
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 21 | c, p | NOT(p == c) AND StartsWith(c.treeNumber,p.treeNumber) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +CartesianProduct | 9 | 9 | 0 | p -- c | |
| |\ +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 3 | 9 | 12 | c | :Node |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 37
It's not the worse one: the one without the hint but with the pre-computed prefix is! This is why you should always measure.
I think we can improve on the query a bit. First, ensure you have either a unique constraint or an index on :Node.treeNumber, as you'll need that to improve your parent node lookups in this query.
Next, let's match on child nodes, excluding root nodes (assuming no .'s in the root's treeNumber) and nodes that have already been processed and have a relationship already.
Then we'll find each node's parent by the treeNumber using our index, and create the relationship. This assumes that a child treeNumber always has 4 more characters, including the dot.
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SUBSTRING(child.treeNumber, 0, SIZE(child.treeNumber)-4) as parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
I think this query avoids a cartesian product as you may get from other answers, and should be around O(n) (someone correct me if I'm wrong).
EDIT
In the event that each subset of numbers in treeNumbers is NOT constrained to 3 (as in your description, actually, with 'A01.111.23'), then you need a different means of deriving the parentNumber. Neo4j is a little weak here, as it lacks both an indexOf() function as well as a join() function to reverse a split(). You may need the APOC Procedures library installed to allow access to a join() function.
The query to handle cases with variable counts of digits in the numeric subsets of treeNumber becomes this:
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SPLIT(child.treeNumber, '.') as splitNumber
CALL apoc.text.join(splitNumber[0..-1], '.') YIELD value AS parentNumber
WITH child, parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
I think I just figured out a solution! (If someone has a more elegant one please do post)
I just realized that the "Tree Number" coding system always uses 3-digit numbers between the dots, i.e. A01.111.230 or C02.100, therefore if a Node is the direct child of another Node, it's "Tree Number" should not only start with the Tree Number of the parent Node, it should also be 4 characters longer (one character for the dot '.' and 3 characters for the numeric value).
Therefore my solution that seems to do the job is:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (length(n2.treeNumber) = (length(n.treeNumber) + 4))
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
For your requirement STARTS WITH won't work, since A01.111.23 does indeed start with A01 in addition to starting with A01.111.
The treeNumber is made up of several parts with '.' as the separator. Let's not make any assumptions about the maximum/minimum possible character lengths of the individual parts. What we need is to compare all but the last part of each node's treeNumber with that of the potential child node being tested. You can achieve this using Cypher's split() function as follows:
MATCH (n1:Node), (n2:Node)
WHERE split(n2.treeNumber,'.')[0..-1] = split(n1.treeNumber,'.')
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n1);
The split() function splits a string, at each occurrence of a given separator, into a list of strings (parts). In this context the separator is '.' to split any treeNumber. We can select a subset of a list in cypher using the syntax list[{startIndex}..{endIndex}]. Negative indices for reverse lookup are permitted, such ass the one used in the above query.
This solution should generalize to all possible treeNumber values, in the format at hand, irrespective of number of parts and individual part lengths.

SQL query/functions to flatten a multiple table item and hierarchical item links

I have the data structure below, storing items and links between them in parent-child relashionship.
I need to display the result as show below, one line by parent, with all children.
Values are the ItemCodes by item type, for ex. C-1 and C-2 are the 2 first items of type C, and so on.
In a previous application version, there were only one C and one H maximum for each P.
So I did a max() and group by mix and the result was there.
But now, parents may be linked to different types and number of children.
I tried several techniques including adding temporary tables, views, use of PIVOT, ROLLUP, CUBE, stored procedures and cursors (!), but nothing worked for this specific problem.
I finally succeeded to adapt the query. However, there are many select from (select ...) clauses, as well as row_number based queries.
Also, the result is not dynamic, meaning the number of columns is fixed (which is acceptable).
My question is: what would be your approach for such issue (if possible in a single query)? Thank you!
The table structure:
Item
-------------------------------
ItemId | ItemCode | ItemType
-------------------------------
1 | P1 | P
2 | C11 | C
3 | H11 | H
4 | H12 | H
5 | P2 | P
6 | C21 | C
7 | C22 | C
8 | C23 | C
9 | H21 | H
ItemLink
---------------------------------------
LinkId | ParentItemId | ChildItemId
---------------------------------------
1 | 1 | 2
2 | 1 | 3
3 | 1 | 4
4 | 2 | 6
5 | 2 | 7
6 | 2 | 8
7 | 2 | 9
Expcted Result
-----------------------------------------------------
P C-1 C-2 ... C-N H1 H2 ... H-N
-----------------------------------------------------
P1 C11 NULL NULL NULL H11 H12 NULL NULL
P2 C21 C22 C23 NULL H21 NULL NULL NULL
...
Part of my current query (which is working):
!http://s12.postimg.org/r64tgjjnh/SOQuestion.png

Resources