Iterating through tables with map function and function that queries other dataframes - loops

I have two tables: Table A
|Group ID | User ids in group|
| -------- | -------------- |
| 11 | [45,46,47,48] |
| 20 | [49,10,11,12] |
| 31 | [55,7,48,43] |
and Table B:
| User ids| Related Id |
| ------- | -------------- |
| 1 | [5,6,7,8] |
| 2 | [6, 9, 10,11] |
| 3 | [1, 2, 5, 7] |
And I have a reference table that has the info: Reference table:
| User ids | Group ID |
| -------- | -------------- |
| 1 | 11 |
| 2 | 20 |
| 3 | 31 |
This is just a minimal sample, I have this situation with millions of rows on each table. I am trying to use pyspark (or sql but I haven't figured out a way to do it here) to iterate through User ids column in the reference table and get the intersection between the lists of User ids in group from Table A and Related Id from Table B.
So in the end, I would like to have a table of the form:
| User ids | Intersection |
| -------- | -------------- |
| 2 | [10, 11] |
| 3 | [7] |
In Pyspark Id have a function of the form:
def test_function(user_id, ref_df, tableB_df, tableA_df):
group_id = int(ref_df.filter(ref_df.userID == user_id).collect()[0][1])
group_list = tableA_df.filter(tableA_df.groupID == group_id)
related_id_list = tableB_df.filter(tableB_df.userID == user_id)
return group_list.intsersection(related_id_list)
abc = ref_df.rdd.map(lambda x: test_function(x, ref_df, tableB_df, tableA_df))
However, when I run this function I am getting the following error:
An error was encountered:
Could not serialize object: TypeError: can't pickle _thread.RLock objects
Can anyone give any suggestion on how to solve this problem? Or how to modify my approach to solve this problem? Since my table has millions of rows, I want to use pyspark as best as possible to make use of the parallelization abilities as much as possible. Thanks for all your help.

You first join the reference table with table a on Group ID, and join the resulting table with table b on User ids. This will give you a dataframe that looks like this:
+--------+--------+-----------------+--------------+
|User ids|Group ID|User ids in group| Related Id|
+--------+--------+-----------------+--------------+
| 1| 11| [45, 46, 47, 48]| [5, 6, 7, 8]|
| 2| 20| [49, 10, 11, 12]|[6, 9, 10, 11]|
| 3| 31| [55, 7, 48, 43]| [1, 2, 5, 7]|
+--------+--------+-----------------+--------------+
Then you perform an intersection on column User ids in group and Related Id. This gives you the columns you want, but you need to filter rows where the intersection is empty.
The code snippet below does all of that in pyspark:
import pyspark.sql.functions as F
# Init example tables
table_a = spark.createDataFrame(
[(11, [45, 46, 47, 48]), (20, [49, 10, 11, 12]), (31, [55, 7, 48, 43])],
["Group ID", "User ids in group"],
)
table_b = spark.createDataFrame(
[(1, [5, 6, 7, 8]), (2, [6, 9, 10, 11]), (3, [1, 2, 5, 7])],
["User ids", "Related Id"],
)
reference_table = spark.createDataFrame(
[(1, 11), (2, 20), (3, 31)], ["User ids", "Group ID"]
)
# Relevant code
joined_df = reference_table.join(table_a, on="Group ID").join(table_b, on="User ids")
intersected_df = joined_df.withColumn("Intersection", F.array_intersect("User ids in group", "Related Id"))
intersected_df.select("User ids", "Intersection").filter(F.size("Intersection") > 0).show()
output:
+--------+------------+
|User ids|Intersection|
+--------+------------+
| 2| [10, 11]|
| 3| [7]|
+--------+------------+

Related

PostgreSQL query to iterate through array and perform product calculation on each cell

I have a cell with an array of integers (column created using the syntax: some_arrays integer[])
| id | some_arrays |
|----+-----------------|
| 0 | 10, 15, 20, 25 |
|----+-----------------|
| 1 | 1, 3, 5, 7 |
|----+-----------------|
| 2 | 33, 674, 2, 4 |
|----+-----------------|
For any given row, I want to iterate through each array and for each element return the product of all the elements in the array other than itself.
ex:
some_arrays[0]
-> [10, 15, 20, 25]
-> [15 * 20 * 25, 10 * 20 * 25, 10 * 15 * 25, 10 * 15 * 20]
-> [7500, 5000, 3750, 3000]
In Python we could do something like:
def get_prod(array, n):
prod = 1
# calculate prod all elements in array
for i in range(n):
prod *= array[i]
# replace each element by the product divided by that element
for i in range(n):
array[i] = prod // array[i]
array = [10, 15, 20, 25]
n = len(array)
get_prod(array, n)
for i in range(n):
print(array[i], end=" ")
In postgreSQL what's the correct query to select a cell, iterate through each element, and obtain the product of the other elements?
Doubling each element of each cell and displaying it to a new column is possible:
SELECT id, (SELECT SUM(a) FROM unnest(some_arrays) AS a) AS total FROM this_table;
SELECT some_arrays,
ARRAY(SELECT x * 2
FROM unnest(some_arrays) AS x) AS doubles
FROM table;
This yields:
| some_arrays | doubled |
|-----------------+-----------------
| 10, 15, 20, 25 | 20, 30, 40, 50 |
|-----------------+-----------------|
| 1, 3, 5, 7 | 2, 6, 10, 14 |
|-----------------+-----------------|
| 66, 674, 2, 4 | 33, 1348, 4, 8 |
|-----------------+-----------------|
(Using Postgres 11.10)
Thank you for any tips on writing a comprehensive for loop!
Explicit looping would be one way to approach this task, but I prefer to use relational calculus,
postgresql doesn't now how to do a product aggregate, but you can teach it:
create aggregate product(integer)
(stype=bigint,sfunc=int84mul,initcond=1);
Here int84mul is a built-in function multiplies an 8 byte integer by a 4 byte integer. it's not described in the documentations but is visiblwe via instrospection eg
the psql command \dfS+ *mul*
then
select cell.*, product(a)
from cell,
lateral unnest(some_arrays) as a group by 1,2 order by 1;
id | some_arrays | exclusive_products
----+---------------+------------------------
0 | {10,15,20,25} | {7500,5000,3750,3000}
1 | {1,3,5,7} | {105,35,21,15}
2 | {33,674,2,4} | {5392,264,88968,44484}
will get you the result:
id | some_arrays | product
----+---------------+---------
0 | {10,15,20,25} | 75000
1 | {1,3,5,7} | 105
2 | {33,674,2,4} | 177936
this is close.
if you then divide the product by the cell.
with b as (
select cell.*, product(a)
from cell,
lateral unnest(some_arrays) as a group by 1,2 )
select id,some_arrays,
ARRAY(SELECT product / x FROM unnest(some_arrays) AS x)
AS exclusive_products from b order by 1;
gets the result
id | some_arrays | exclusive_products
----+---------------+------------------------
0 | {10,15,20,25} | {7500,5000,3750,3000}
1 | {1,3,5,7} | {105,35,21,15}
2 | {33,674,2,4} | {5392,264,88968,44484}

Append column to an array in a pyspark dataframe

I have a Dataframe containing 2 columns
| VPN | UPC |
+--------+-----------------+
| 1 | [4,2] |
| 2 | [1,2] |
| null | [4,7] |
I need a result column with the values of vpn (string) appended to the array UPC. the result should look something like this below.
| result |
+--------+
| [4,2,1]|
| [1,2,2]|
| [4,7,] |
One option is to use concat + array. First use array to convert VPN column to an array type, then concatenate the two array columns with concat method:
df = spark.createDataFrame([(1, [4, 2]), (2, [1, 2]), (None, [4, 7])], ['VPN', 'UPC'])
df.show()
+----+------+
| VPN| UPC|
+----+------+
| 1|[4, 2]|
| 2|[1, 2]|
|null|[4, 7]|
+----+------+
df.selectExpr('concat(UPC, array(VPN)) as result').show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+
Or more pythonic:
from pyspark.sql.functions import array, concat
df.select(concat('UPC', array('VPN')).alias('result')).show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+

Postgres Insert based on JSON data

Trying to do an insert based on whether a value all ready exists or not in a JSON blob in a Postgres table.
| a | b | c | d | metadata |
_____________________________________________________
| 1 | 2 | 3 | 4 | {"other-key": 1} |
| 2 | 1 | 4 | 4 | {"key": 99} |
| 3 | 1 | 4 | 4 | {"key": 99, "other-key": 33} |
Currently trying to use something like this.
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
(:metadata->>'key'::TEXT IS NULL
OR :metadata->>'key'::TEXT NOT IN (SELECT :metadata->>'key'::TEXT
FROM mytable));
But keep getting an ERROR: operator does not exist: unknown ->> boolean
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.
It was just a simple casting issue
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
((:metadata::JSONB->>'key') is NULL or
:metadata::JSONB->>'key' NOT IN (
SELECT metadata->>'key' FROM mytable
WHERE
metadata->>'key' = :metadata::JSONB->>'key'));

Querying array using list in postgreSQL

I have a table in postgre SQL like this
which using 1 dimentional integer data type in array column
+----+--------+-----------+
| id | name | array |
+----+--------+-----------+
| 1 | apple | {1, 2, 3} |
| 2 | mango | {2, 3, 4} |
| 3 | banana | {4, 5, 6} |
+----+--------+-----------+
and I want to do a query to find every rows with array column contains at least one number from my list of number. For example, search each array to see if it contains 2, 4
Is there any better solution than using query like this?
SELECT * FROM table WHERE 2 = ANY (array) OR 4 = ANY (array)
You can try the following:
SELECT *
FROM yourTable
WHERE '{1,4}' && array

Does this require recursive CTE, just creative window functions, a loop?

I cannot for the life of me figure out how to get a weighted ranking for scores across X categories. For example, the student needs to answer 10 questions across 3 categories (both # of questions and # of categories will be variable eventually). To get a total score the top 1 score in each of the X (3) categories will be added to whatever is left to add up to 10 total question scores.
Here is the data. I used a CASE WHEN Row_Number() to get the TopInCat
http://sqlfiddle.com/#!6/e6e9f/1
The fiddle has more students.
| Question | Student | Category | Score | TopInCat |
|----------|---------|----------|-------|----------|
| 120149 | 125 | 6 | 1 | 1 |
| 120127 | 125 | 6 | 0.9 | 0 |
| 120124 | 125 | 6 | 0.8 | 0 |
| 120125 | 125 | 6 | 0.7 | 0 |
| 120130 | 125 | 6 | 0.6 | 0 |
| 120166 | 125 | 6 | 0.5 | 0 |
| 120161 | 125 | 6 | 0.4 | 0 |
| 120138 | 125 | 4 | 0.15 | 1 |
| 120069 | 125 | 4 | 0.15 | 0 |
| 120022 | 125 | 4 | 0.15 | 0 |
| 120002 | 125 | 4 | 0.15 | 0 |
| 120068 | 125 | 2 | 0.01 | 1 |
| 120050 | 125 | 3 | 0.05 | 1 |
| 120139 | 125 | 2 | 0 | 0 |
| 120156 | 125 | 2 | 0 | 0 |
This is how I envision it needs to look, but it doesn't have to be exactly this. I just need to have 10 questions by 3 categories detail data in a way that would allow me to sum and average the Sort 1-10 column below. The 999's could be null or whatever as long as I can sum whats important and present the details.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 125 | 6 | 1 | 1 | 1 |
| 120138 | 125 | 4 | 0.15 | 1 | 2 |
| 120068 | 125 | 2 | 0.01 | 1 | 3 |
| 120127 | 125 | 6 | 0.9 | 0 | 4 |
| 120124 | 125 | 6 | 0.8 | 0 | 5 |
| 120125 | 125 | 6 | 0.7 | 0 | 6 |
| 120130 | 125 | 6 | 0.6 | 0 | 7 |
| 120166 | 125 | 6 | 0.5 | 0 | 8 |
| 120161 | 125 | 6 | 0.4 | 0 | 9 |
| 120069 | 125 | 4 | 0.15 | 0 | 10 |
| 120022 | 125 | 4 | 0.15 | 0 | 999 |
| 120002 | 125 | 4 | 0.15 | 0 | 999 |
| 120050 | 125 | 3 | 0.05 | 1 | 999 |
| 120139 | 125 | 2 | 0 | 0 | 999 |
| 120156 | 125 | 2 | 0 | 0 | 999 |
One last thing, the category no longer matters once the X (3) threshold is met. So a 4th category would just sort normally.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 126 | 6 | 1 | 1 | 1 |
| 120138 | 126 | 4 | 0.75 | 1 | 2 |
| 120068 | 126 | 2 | 0.50 | 1 | 3 |
| 120127 | 126 | 6 | 0.9 | 0 | 4 |
| 120124 | 126 | 6 | 0.8 | 0 | 5 |
| 120125 | 126 | 6 | 0.7 | 0 | 6 |
| 120130 | 126 | 6 | 0.6 | 0 | 7 |
| 120166 | 126 | 6 | 0.5 | 0 | 8 |
| 120050 | 126 | 3 | 0.45 | 1 | 9 |********
| 120161 | 126 | 6 | 0.4 | 0 | 10 |
| 120069 | 126 | 4 | 0.15 | 0 | 999 |
| 120022 | 126 | 4 | 0.15 | 0 | 999 |
| 120002 | 126 | 4 | 0.15 | 0 | 999 |
| 120139 | 126 | 2 | 0 | 0 | 999 |
| 120156 | 126 | 2 | 0 | 0 | 999 |
I really appreciate any help. Been banging my head on this for a few days.
With such matters I like to proceed with a 'building blocks' approach. Following the maxim of first make it work, then if you need to make it fast, this first step is often enough.
So, given
CREATE TABLE WeightedScores
([Question] int, [Student] int, [Category] int, [Score] dec(3,2))
;
and your sample data
INSERT INTO WeightedScores
([Question], [Student], [Category], [Score])
VALUES
(120161, 123, 6, 1), (120166, 123, 6, 0.64), (120138, 123, 4, 0.57), (120069, 123, 4, 0.5),
(120068, 123, 2, 0.33), (120022, 123, 4, 0.18), (120061, 123, 6, 0), (120002, 123, 4, 0),
(120124, 123, 6, 0), (120125, 123, 6, 0), (120137, 123, 6, 0), (120154, 123, 6, 0),
(120155, 123, 6, 0), (120156, 123, 6, 0), (120139, 124, 2, 1), (120156, 124, 2, 1),
(120050, 124, 3, 0.88), (120068, 124, 2, 0.87), (120161, 124, 6, 0.87), (120138, 124, 4, 0.85),
(120069, 124, 4, 0.51), (120166, 124, 6, 0.5), (120022, 124, 4, 0.43), (120002, 124, 4, 0),
(120130, 124, 6, 0), (120125, 124, 6, 0), (120124, 124, 6, 0), (120127, 124, 6, 0),
(120149, 124, 6, 0), (120149, 125, 6, 1), (120127, 125, 6, 0.9), (120124, 125, 6, 0.8),
(120125, 125, 6, 0.7), (120130, 125, 6, 0.6), (120166, 125, 6, 0.5), (120161, 125, 6, 0.4),
(120138, 125, 4, 0.15), (120069, 125, 4, 0.15), (120022, 125, 4, 0.15), (120002, 125, 4, 0.15),
(120068, 125, 2, 0.01), (120050, 125, 3, 0.05), (120139, 125, 2, 0), (120156, 125, 2, 0),
(120149, 126, 6, 1), (120138, 126, 4, 0.75), (120068, 126, 2, 0.50), (120127, 126, 6, 0.9),
(120124, 126, 6, 0.8), (120125, 126, 6, 0.7), (120130, 126, 6, 0.6), (120166, 126, 6, 0.5),
(120050, 126, 3, 0.45), (120161, 126, 6, 0.4), (120069, 126, 4, 0.15), (120022, 126, 4, 0.15),
(120002, 126, 4, 0.15), (120139, 126, 2, 0), (120156, 126, 2, 0)
;
let's proceed.
The complicated part here is identifying the top three top-in-category questions; the others of the ten questions of interest per student are simply sorted by score, which is easy. So let's start with identifying the top three top-in-category questions.
First, assign to each row a row number giving the ordering of that score within the category, for the student:
;WITH Numbered1 ( Question, Student, Category, Score, SeqInStudentCategory ) AS
(
SELECT Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student, Category ORDER BY Score DESC) SeqInStudentCategory
FROM WeightedScores
)
Now we are only interested in rows where SeqInStudentCategory is 1. Considering only such rows, let's order them by score within student, and number those rows:
-- within the preceding WITH
, Numbered2 ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
Numbered1
WHERE
SeqInStudentCategory = 1
)
Now we are only interested in rows where SeqInStudent is at most 3. Let's pull them out, so that we know to include it (and exclude it from the simple sort by score, that we will use to make up the remaining seven rows):
-- within the preceding WITH
, TopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT Question, Student, Category, Score, SeqInStudent FROM Numbered2 WHERE SeqInStudent <= 3
)
Now we have the three top-in-category questions for each student. We now need to identify and order by score the not top-in-category questions for each student:
-- within the preceding WITH
, NotTopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
WeightedScores WS
WHERE
NOT EXISTS ( SELECT 1 FROM TopInCat T WHERE T.Question = WS.Question AND T.Student = WS.Student )
)
Finally we combine TopInCat with NotTopInCat, applying an appropriate offset and restriction to NotTopInCat.SeqInStudent - we need to add 3 to the raw value, and take the top 7 (which is 10 - 3):
-- within the preceding WITH
, Combined ( Question, Student, Category, Score, CombinedSeq ) AS
(
SELECT
Question, Student, Category, Score, SeqInStudent AS CombinedSeq
FROM
TopInCat
UNION
SELECT
Question, Student, Category, Score, SeqInStudent + 3 AS CombinedSeq
FROM
NotTopInCat
WHERE
SeqInStudent <= 10 - 3
)
To get our final results:
SELECT * FROM Combined ORDER BY Student, CombinedSeq
;
You can see the results on sqlfiddle.
Note that here I have assumed that every student will always have answers from at least three categories. Also, the final output doesn't have a TopInCat column, but hopefully you will see how to regain that if you want it.
Also, "(both # of questions and # of categories will be variable eventually)" should be relatively straightforward to deal with here. But watch out for my assumption that (in this case) 3 categories will definitely be present in the answers of each student.

Resources