How to search star query on AgensGraph? - agens-graph

I want to find star pattern on AgensGraph using CYPHER?
But CYPHER pattern present linear pattern only?
How to find star pattern on AgensGraph?
I want to get following result from query.
agens=# create (:v{id:1})-[:e]->(:v{id:2})-[:e]->(:v{id:3});
GRAPH WRITE (INSERT VERTEX 3, INSERT EDGE 2)
agens=# match (n:v{id:2}) create (n)-[:e]->(:v{id:4});
GRAPH WRITE (INSERT VERTEX 1, INSERT EDGE 1)
agens=# match some-pattern
agens-# return n1, r1, n2, r2, n3, r3, n4;
n1 | r1 | n2 | r2 | n3 | r3 | n4
-----------------+-------------------+-----------------+-------------------+-----------------+-------------------+-----------------
v[3.1]{"id": 1} | e[4.1][3.1,3.2]{} | v[3.2]{"id": 2} | e[4.2][3.2,3.3]{} | v[3.3]{"id": 3} | e[4.3][3.2,3.4]{} | v[3.4]{"id": 4}
(1 row)

You can use variable for searching star pattern.
First, write linear pattern on first match.
And add new pattern using variable on next match.
agens=# match (n1:v{id:1})-[r1:e]->(n2:v{id:2})-[r2:e]->(n3:v{id:3})
agens-# match (n2)-[r3:e]->(n4:v{id:4})
agens-# return n1, r1, n2, r2, n3, r3, n4;
n1 | r1 | n2 | r2 | n3 | r3 | n4
-----------------+-------------------+-----------------+-------------------+-----------------+-------------------+-----------------
v[3.1]{"id": 1} | e[4.1][3.1,3.2]{} | v[3.2]{"id": 2} | e[4.2][3.2,3.3]{} | v[3.3]{"id": 3} | e[4.3][3.2,3.4]{} | v[3.4]{"id": 4}
(1 row)

Related

PostgreSQL / TypeORM: search array in array column - return only the highest arrays' intersection

let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:
Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.
Second edge had 3 events: BC, BC, C.
We want the user to be able to search
how many events occurred on every edge
by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.
We represent this schema with 2 pre-aggregated tables:
Edges table:
+----+
| id |
+----+
| 1 |
| 2 |
+----+
EdgeStats table (which contains relation to Edges table via tag_id):
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 1 | 1 | [A, B, C] | 7 |
| 2 | 1 | [A, B] | 7 |
| 3 | 1 | [B, C] | 5 |
| 4 | 1 | [A, C] | 6 |
| 5 | 1 | [A] | 5 |
| 6 | 1 | [B] | 4 |
| 7 | 1 | [C] | 4 |
| 8 | 1 | null | 7 | //null represents aggregated stats for given edge, not important here.
| 9 | 2 | [B, C] | 3 |
| 10 | 2 | [B] | 2 |
| 11 | 2 | [C] | 3 |
| 12 | 2 | null | 3 |
+------+---------+-----------+---------------+
Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.
Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).
I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 4 | 1 | [A, C] | 6 |
| 11 | 2 | [C] | 3 |
+------+---------+-----------+---------------+
i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.
Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].
How do I write query to return this?
If you can just answer the question above, you can ignore what's bellow:
If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:
query.where("edge_stats.tags = :filter",
{
filter: [A, B],
}
)
However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.
Here is list of few possible solutions, all imperfect:
try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
use #> operator:
.where(
"edge_stats.tags <# :tags",
{
tags:[A, C, D],
}
)
This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.
approach built on 2) and inspired by this answer:
.where(
"edge_stats.tags <# :tags",
{
tags: [A, C, D],
}
)
.addOrderBy("edge.id")
.addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
.distinctOn(["edge.id"]);
What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).
So returned rows indeed are 4, 11.
This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.
I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?
Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,
DataTeam populated EdgeStats table with all permutations of tags observed per given edge
If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.
That way, you can
SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)
using the array_intersect function from here.

Pyspark loop over dataframe and decrement column value

I need help with looping row by row in pyspark dataframe:
E.g:
df1
+---------+
|id|value|
+---------+
|a|100|
|b|100|
|c|100|
+---------+
I need to loop and decrease the value based on another dataframe
df2
+---------+---------------
|id|value|timestamp
+---------+---------------
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
+---------+-------------
Expected Output based on a udf or function
customFunction(df1(row_n)))
df1
+---------+
|id|value|
+---------+
|a|30| ( 100-20 ) ( 80 - 50 )
|b|50| ( 100-50 ) skip operation since lhs < rhs ( 50 - 80 )
|c|50| ( 100 - 50 )
+---------+
How do i achieve this in pyspark ? Also the dataframes will have > 50k rows
You can achieve this with joining both the dataframes & using groupBy to aggregate the values from df2 to determine if the value is greater or less than the aggregation.
Combining DataFrames
input_str1 = """
|a|100
|b|100
|c|100
""".split("|")
input_values1 = list(map(lambda x:x.strip(),input_str1))[1:]
input_list1 = [(x,y) for x,y in zip(input_values1[0::2],input_values1[1::2])]
sparkDF1 = sql.createDataFrame(input_list1,['id','value'])
input_str2 = """
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
""".split("|")
input_values2 = list(map(lambda x:x.strip(),input_str2))[1:]
input_list2 = [(x,y,z) for x,y,z in zip(input_values2[0::3]
,input_values2[1::3],input_values2[2::3])]
sparkDF2 = sql.createDataFrame(input_list2,['id','value','timestamp'])
finalDF = (sparkDF1.join(sparkDF2
,sparkDF1['id'] == sparkDF2['id']
,'inner'
).select(sparkDF1["*"],sparkDF2['value'].alias('value_right')))
finalDF.show()
+---+-----+-----------+
| id|value|value_right|
+---+-----+-----------+
| c| 100| 50|
| b| 100| 50|
| b| 100| 80|
| a| 100| 20|
| a| 100| 50|
+---+-----+-----------+
GroupBy
agg_lst = [
F.first(F.col('value')).alias('value')
,F.sum(F.col('value_right')).alias('sum_value_right')
,F.min(F.col('value_right')).alias('min_value_right')
]
finalDF = finalDF.groupBy('id').agg(*agg_lst).orderBy('id')
finalDF = finalDF.withColumn('final_value'
,F.when(F.col('value') > F.col('sum_value_right')
,F.col('value') - F.col('sum_value_right'))\
.otherwise(F.col('value') - F.col('min_value_right'))
)
finalDF.show()
+---+-----+---------------+---------------+-----------+
| id|value|sum_value_right|min_value_right|final_value|
+---+-----+---------------+---------------+-----------+
| a| 100| 70.0| 20| 30.0|
| b| 100| 130.0| 50| 50.0|
| c| 100| 50.0| 50| 50.0|
+---+-----+---------------+---------------+-----------+
Note - If the above logic does not work on your entire set , implementing a UDF with your custom logic , along with groupBy would be the ideal solution

Search for a value in multiple columns and return all results that are in the right column of the found column

Search the value "thisisit" in column A and then search in this line for all "de" columns and return me what is in the columns to the right of it. If multiple results are returned, return the results in one cell but separate these results with a line break.
A1 | B1 | C1 | D1 | E1 | F1 | G1
thisisit | de | Bicycle | en | Car | de | Boot
A3
Bicycle (line break)
Boot
try:
=TEXTJOIN(CHAR(10), 1, QUERY({TRANSPOSE(FILTER(
INDIRECT(MATCH("thisisit", A1:A)&":"&MATCH("thisisit", A1:A)),
MOD(COLUMN(1:1), 2)=0)), {QUERY(TRANSPOSE(FILTER(
INDIRECT(MATCH("thisisit", A1:A)&":"&MATCH("thisisit", A1:A)),
MOD(COLUMN(1:1)-1, 2)=0)), "offset 1", 0); ""}},
"select Col2 where Col1 = 'de'", 0))

How to define range in importxml formula in Google Sheets

In my google sheet, I have URL in A1 and other data in B1, C1, D1, E1, etc... I want to use importxml formula but want to this work automatic, I mean in column B2 I try this
=ARRAYFORMULA( IMPORTXML($A1&$B1:$CZ1, "//suggestion/#data"))
so its do the rest of the job automatic I mean fetch data for A1and B1 in B2 and do A1 and C1 in C2 and A1 and D1 in D2, is it possible? I mean now I have to do manually enter formula in every column. hope you understand what I mean.
Column A | Column B | Column C
=================================
site.com | Name 1 | Name 2
| |
| |
| |
Column A | Column B | Column C
=================================
site.com | Name 1 | Name 2
| =ARRAYFORMULA( IMPORTXML($A1&$B1:$CZ1, "//name/#data"))
| |
| |
IMPORTXML is not supported in ARRAYFORMULA as you expect it. your options are:
paste in B2 and drag the formula to the right
=IMPORTXML($A1&B1, "//suggestion/#data")
paste in B2 the hardcoded formula for all columns:
={IMPORTXML(A1&B1, "//suggestion/#data"),
IMPORTXML(A1&C1, "//suggestion/#data"),
IMPORTXML(A1&D1, "//suggestion/#data")}
or scripted solution:
function onOpen() {
var sheet = SpreadsheetApp.getActiveSheet();
var length = sheet.getRange("B2:2").getValues().filter(String).length;
sheet.getRange("C3:3").clearContent();
sheet.getRange("C3:3" + length).setFormula(sheet.getRange("B3").getFormula());
sheet.getRange("C4:1000").clearContent();
}
where cell B2 is:
=ARRAYFORMULA(A1&B1:D1)
and cell B3 is:
=ARRAYFORMULA(IFERROR(IMPORTXML(B2:2, "//suggestion/#data")))

Normalizing a table by rank

I have a Table which looks like this:
Rank Account
| 1 | | A1 |
| 2 | | A2 |
| 3 | | A3 |
| 1 | | A4 |
| 2 | | A5 |
| 1 | | A6 |
In it all the Accounts which are linked together are ranked, done in a previous stored procedure. I want to now break down this table so that whenever a rank goes back down to 1 it takes the previous rows and writes it into a new table. for example Accounts A1,A2,A3 are all written into a separate table as are Accounts A4,A5 and Account A6. How would I do this.
The eventual goal being too right each table into a CSV file. Thanks!
In order to do this, you will need to add a third column to your table. This column would need to either be:
A grouping column that shows which Accounts are linked together, so that it will have the same value for A1, A2, A3. And another common value for A4 & A5, and another value for A6. Or
An ordinal column, such as an IDENTITY column, that you can use to tell in what order the Accounts were inserted, so that you can use that to figure out which items are grouped together.
Without a third column, there is NO WAY to know that A2 goes with A1 and not A4 or A6.

Resources