let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:
Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.
Second edge had 3 events: BC, BC, C.
We want the user to be able to search
how many events occurred on every edge
by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.
We represent this schema with 2 pre-aggregated tables:
Edges table:
+----+
| id |
+----+
| 1 |
| 2 |
+----+
EdgeStats table (which contains relation to Edges table via tag_id):
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 1 | 1 | [A, B, C] | 7 |
| 2 | 1 | [A, B] | 7 |
| 3 | 1 | [B, C] | 5 |
| 4 | 1 | [A, C] | 6 |
| 5 | 1 | [A] | 5 |
| 6 | 1 | [B] | 4 |
| 7 | 1 | [C] | 4 |
| 8 | 1 | null | 7 | //null represents aggregated stats for given edge, not important here.
| 9 | 2 | [B, C] | 3 |
| 10 | 2 | [B] | 2 |
| 11 | 2 | [C] | 3 |
| 12 | 2 | null | 3 |
+------+---------+-----------+---------------+
Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.
Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).
I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 4 | 1 | [A, C] | 6 |
| 11 | 2 | [C] | 3 |
+------+---------+-----------+---------------+
i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.
Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].
How do I write query to return this?
If you can just answer the question above, you can ignore what's bellow:
If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:
query.where("edge_stats.tags = :filter",
{
filter: [A, B],
}
)
However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.
Here is list of few possible solutions, all imperfect:
try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
use #> operator:
.where(
"edge_stats.tags <# :tags",
{
tags:[A, C, D],
}
)
This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.
approach built on 2) and inspired by this answer:
.where(
"edge_stats.tags <# :tags",
{
tags: [A, C, D],
}
)
.addOrderBy("edge.id")
.addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
.distinctOn(["edge.id"]);
What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).
So returned rows indeed are 4, 11.
This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.
I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?
Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,
DataTeam populated EdgeStats table with all permutations of tags observed per given edge
If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.
That way, you can
SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)
using the array_intersect function from here.
I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+
I have a Postgres DB with a column that has a JSON array for the value in each row.
Example:
id | json_data
1 | [{"sub_id":"a1", "flag":"true", "type":"something"}, {"sub_id":"a2", "flag":"true", "type":"something"}]
2 | [{"sub_id":"b1", "flag":"false", "type":"something"}, {"sub_id":"b2", "flag":"false", "type":"something"}]
3 | [{"sub_id":"c1", "flag":"true", "type":"something"}]
I want to be able create a new view so that I can interact with data structured like this:
id | sub_id | flag | type
1 | a1 | true | something
1 | a2 | true | something
2 | b1 | false | something
2 | b2 | false | something
3 | c1 | true | something
Perhaps there is something I am not understanding from the Postgres documentation. It seems like I need to to leverage json_array_elements but all of the documentation and related examples I see show the JSON being passed as a string to this function.
How to I do I use this for each row of a given column?
You need to unnest using jsonb_array_elements, then you can access each key:
select t.id,
j.e ->> 'sub_id' as sub_id,
j.e ->> 'flag' as flag,
j.e ->> 'type' as type
from the_table t
cross join jsonb_array_elements(t.json_data::jsonb) as j(e)
order by t.id;
This assumes your column is defined as jsonb (which it should be). If it's just json you need to use json_array_elements()
The purpose of the query is to give an X tag the probability that a Y tag will also appear.
For example:
Table ListTagsFromPost:
id| Tags |
_________________
1 | <A><B><C><D>|
2 | <A><B> |
3 | <A><C> |
4 | <B><D> |
5 | <A><D> |
6 | <A><B><C> |
7 | <A><C> |
8 | <A><D> |
9 | <B><D> |
The query return :
|A | |B| |C| |D|
_____________________________________________________________
|A| |100| |0.4285714286| |0.5714285714| |0.4285714286|
|B| |0.6| |100| |0.4| |0.6|
|C| |100| |0.5| |100| |0.25|
|D| |0.6| |0.6| |0.2| |100|
The problem is that I have 50,000 records per table ListTagsFromPost.
That means there can be lots of tags (as in example A, B, C, D)
Need a simple query in order to get the result
I would like to import a txt file that contains an defined structure. I need to create an identity for every record type and also a parent-child id that will make easy to join specific record types.
I also have a lookup table that says which is the parent record type:
LOOKUP TABLE
TYPE | STRUCTURE LEVEL | PARENT
A | 1 |
B | 2 | A
C | 3 | B
D | 3 | B
E | 2 | A
F | 3 | E
And my data looks similar to:
TYPE | INFO
A | dummy
B | dummy2
C | dummy3
C | dummy4
D | dummy5
B | dummy6
B | dummy7
C | dummy8
B | dummy9
D | dummy10
E | dummy11
F | dummy12
If you look at table data, there are some situations that I need to cover:
First "B" record has 3 children (two "C" type and one "D")
Second "B" record has no child
Third "B" record has no "D" child
Fourth "B" record has no "C" child
"B" and "E" records are siblings and the same for "C" and "D"
I would like to get the following result (does not matther whether result is in a single table or not):
Table A
ID | PARENT_ID | TYPE | INFO
1 | | A | dummy
Table B
ID | PARENT_ID | TYPE | INFO
1 | 1 | B | dummy2
2 | 1 | B | dummy6
3 | 1 | B | dummy7
Table C
ID | PARENT_ID | TYPE | INFO
1 | 1 | C | dummy3
2 | 1 | C | dummy4
3 | 3 | C | dummy8
Table D
ID | PARENT_ID | TYPE | INFO
1 | 1 | D | dummy5
2 | 4 | D | dummy10
Table E
ID | PARENT_ID | TYPE | INFO
1 | 1 | E | dummy11
Table F
ID | PARENT_ID | TYPE | INFO
1 | 1 | F | dummy12
Sorry for the long explanation and thanks in advance for any help.
Solved. My workaround was develop my own C# program for importing data into SQL server.