I have a cell with an array of integers (column created using the syntax: some_arrays integer[])
| id | some_arrays |
|----+-----------------|
| 0 | 10, 15, 20, 25 |
|----+-----------------|
| 1 | 1, 3, 5, 7 |
|----+-----------------|
| 2 | 33, 674, 2, 4 |
|----+-----------------|
For any given row, I want to iterate through each array and for each element return the product of all the elements in the array other than itself.
ex:
some_arrays[0]
-> [10, 15, 20, 25]
-> [15 * 20 * 25, 10 * 20 * 25, 10 * 15 * 25, 10 * 15 * 20]
-> [7500, 5000, 3750, 3000]
In Python we could do something like:
def get_prod(array, n):
prod = 1
# calculate prod all elements in array
for i in range(n):
prod *= array[i]
# replace each element by the product divided by that element
for i in range(n):
array[i] = prod // array[i]
array = [10, 15, 20, 25]
n = len(array)
get_prod(array, n)
for i in range(n):
print(array[i], end=" ")
In postgreSQL what's the correct query to select a cell, iterate through each element, and obtain the product of the other elements?
Doubling each element of each cell and displaying it to a new column is possible:
SELECT id, (SELECT SUM(a) FROM unnest(some_arrays) AS a) AS total FROM this_table;
SELECT some_arrays,
ARRAY(SELECT x * 2
FROM unnest(some_arrays) AS x) AS doubles
FROM table;
This yields:
| some_arrays | doubled |
|-----------------+-----------------
| 10, 15, 20, 25 | 20, 30, 40, 50 |
|-----------------+-----------------|
| 1, 3, 5, 7 | 2, 6, 10, 14 |
|-----------------+-----------------|
| 66, 674, 2, 4 | 33, 1348, 4, 8 |
|-----------------+-----------------|
(Using Postgres 11.10)
Thank you for any tips on writing a comprehensive for loop!
Explicit looping would be one way to approach this task, but I prefer to use relational calculus,
postgresql doesn't now how to do a product aggregate, but you can teach it:
create aggregate product(integer)
(stype=bigint,sfunc=int84mul,initcond=1);
Here int84mul is a built-in function multiplies an 8 byte integer by a 4 byte integer. it's not described in the documentations but is visiblwe via instrospection eg
the psql command \dfS+ *mul*
then
select cell.*, product(a)
from cell,
lateral unnest(some_arrays) as a group by 1,2 order by 1;
id | some_arrays | exclusive_products
----+---------------+------------------------
0 | {10,15,20,25} | {7500,5000,3750,3000}
1 | {1,3,5,7} | {105,35,21,15}
2 | {33,674,2,4} | {5392,264,88968,44484}
will get you the result:
id | some_arrays | product
----+---------------+---------
0 | {10,15,20,25} | 75000
1 | {1,3,5,7} | 105
2 | {33,674,2,4} | 177936
this is close.
if you then divide the product by the cell.
with b as (
select cell.*, product(a)
from cell,
lateral unnest(some_arrays) as a group by 1,2 )
select id,some_arrays,
ARRAY(SELECT product / x FROM unnest(some_arrays) AS x)
AS exclusive_products from b order by 1;
gets the result
id | some_arrays | exclusive_products
----+---------------+------------------------
0 | {10,15,20,25} | {7500,5000,3750,3000}
1 | {1,3,5,7} | {105,35,21,15}
2 | {33,674,2,4} | {5392,264,88968,44484}
I have a Dataframe containing 2 columns
| VPN | UPC |
+--------+-----------------+
| 1 | [4,2] |
| 2 | [1,2] |
| null | [4,7] |
I need a result column with the values of vpn (string) appended to the array UPC. the result should look something like this below.
| result |
+--------+
| [4,2,1]|
| [1,2,2]|
| [4,7,] |
One option is to use concat + array. First use array to convert VPN column to an array type, then concatenate the two array columns with concat method:
df = spark.createDataFrame([(1, [4, 2]), (2, [1, 2]), (None, [4, 7])], ['VPN', 'UPC'])
df.show()
+----+------+
| VPN| UPC|
+----+------+
| 1|[4, 2]|
| 2|[1, 2]|
|null|[4, 7]|
+----+------+
df.selectExpr('concat(UPC, array(VPN)) as result').show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+
Or more pythonic:
from pyspark.sql.functions import array, concat
df.select(concat('UPC', array('VPN')).alias('result')).show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+
I cannot for the life of me figure out how to get a weighted ranking for scores across X categories. For example, the student needs to answer 10 questions across 3 categories (both # of questions and # of categories will be variable eventually). To get a total score the top 1 score in each of the X (3) categories will be added to whatever is left to add up to 10 total question scores.
Here is the data. I used a CASE WHEN Row_Number() to get the TopInCat
http://sqlfiddle.com/#!6/e6e9f/1
The fiddle has more students.
| Question | Student | Category | Score | TopInCat |
|----------|---------|----------|-------|----------|
| 120149 | 125 | 6 | 1 | 1 |
| 120127 | 125 | 6 | 0.9 | 0 |
| 120124 | 125 | 6 | 0.8 | 0 |
| 120125 | 125 | 6 | 0.7 | 0 |
| 120130 | 125 | 6 | 0.6 | 0 |
| 120166 | 125 | 6 | 0.5 | 0 |
| 120161 | 125 | 6 | 0.4 | 0 |
| 120138 | 125 | 4 | 0.15 | 1 |
| 120069 | 125 | 4 | 0.15 | 0 |
| 120022 | 125 | 4 | 0.15 | 0 |
| 120002 | 125 | 4 | 0.15 | 0 |
| 120068 | 125 | 2 | 0.01 | 1 |
| 120050 | 125 | 3 | 0.05 | 1 |
| 120139 | 125 | 2 | 0 | 0 |
| 120156 | 125 | 2 | 0 | 0 |
This is how I envision it needs to look, but it doesn't have to be exactly this. I just need to have 10 questions by 3 categories detail data in a way that would allow me to sum and average the Sort 1-10 column below. The 999's could be null or whatever as long as I can sum whats important and present the details.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 125 | 6 | 1 | 1 | 1 |
| 120138 | 125 | 4 | 0.15 | 1 | 2 |
| 120068 | 125 | 2 | 0.01 | 1 | 3 |
| 120127 | 125 | 6 | 0.9 | 0 | 4 |
| 120124 | 125 | 6 | 0.8 | 0 | 5 |
| 120125 | 125 | 6 | 0.7 | 0 | 6 |
| 120130 | 125 | 6 | 0.6 | 0 | 7 |
| 120166 | 125 | 6 | 0.5 | 0 | 8 |
| 120161 | 125 | 6 | 0.4 | 0 | 9 |
| 120069 | 125 | 4 | 0.15 | 0 | 10 |
| 120022 | 125 | 4 | 0.15 | 0 | 999 |
| 120002 | 125 | 4 | 0.15 | 0 | 999 |
| 120050 | 125 | 3 | 0.05 | 1 | 999 |
| 120139 | 125 | 2 | 0 | 0 | 999 |
| 120156 | 125 | 2 | 0 | 0 | 999 |
One last thing, the category no longer matters once the X (3) threshold is met. So a 4th category would just sort normally.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 126 | 6 | 1 | 1 | 1 |
| 120138 | 126 | 4 | 0.75 | 1 | 2 |
| 120068 | 126 | 2 | 0.50 | 1 | 3 |
| 120127 | 126 | 6 | 0.9 | 0 | 4 |
| 120124 | 126 | 6 | 0.8 | 0 | 5 |
| 120125 | 126 | 6 | 0.7 | 0 | 6 |
| 120130 | 126 | 6 | 0.6 | 0 | 7 |
| 120166 | 126 | 6 | 0.5 | 0 | 8 |
| 120050 | 126 | 3 | 0.45 | 1 | 9 |********
| 120161 | 126 | 6 | 0.4 | 0 | 10 |
| 120069 | 126 | 4 | 0.15 | 0 | 999 |
| 120022 | 126 | 4 | 0.15 | 0 | 999 |
| 120002 | 126 | 4 | 0.15 | 0 | 999 |
| 120139 | 126 | 2 | 0 | 0 | 999 |
| 120156 | 126 | 2 | 0 | 0 | 999 |
I really appreciate any help. Been banging my head on this for a few days.
With such matters I like to proceed with a 'building blocks' approach. Following the maxim of first make it work, then if you need to make it fast, this first step is often enough.
So, given
CREATE TABLE WeightedScores
([Question] int, [Student] int, [Category] int, [Score] dec(3,2))
;
and your sample data
INSERT INTO WeightedScores
([Question], [Student], [Category], [Score])
VALUES
(120161, 123, 6, 1), (120166, 123, 6, 0.64), (120138, 123, 4, 0.57), (120069, 123, 4, 0.5),
(120068, 123, 2, 0.33), (120022, 123, 4, 0.18), (120061, 123, 6, 0), (120002, 123, 4, 0),
(120124, 123, 6, 0), (120125, 123, 6, 0), (120137, 123, 6, 0), (120154, 123, 6, 0),
(120155, 123, 6, 0), (120156, 123, 6, 0), (120139, 124, 2, 1), (120156, 124, 2, 1),
(120050, 124, 3, 0.88), (120068, 124, 2, 0.87), (120161, 124, 6, 0.87), (120138, 124, 4, 0.85),
(120069, 124, 4, 0.51), (120166, 124, 6, 0.5), (120022, 124, 4, 0.43), (120002, 124, 4, 0),
(120130, 124, 6, 0), (120125, 124, 6, 0), (120124, 124, 6, 0), (120127, 124, 6, 0),
(120149, 124, 6, 0), (120149, 125, 6, 1), (120127, 125, 6, 0.9), (120124, 125, 6, 0.8),
(120125, 125, 6, 0.7), (120130, 125, 6, 0.6), (120166, 125, 6, 0.5), (120161, 125, 6, 0.4),
(120138, 125, 4, 0.15), (120069, 125, 4, 0.15), (120022, 125, 4, 0.15), (120002, 125, 4, 0.15),
(120068, 125, 2, 0.01), (120050, 125, 3, 0.05), (120139, 125, 2, 0), (120156, 125, 2, 0),
(120149, 126, 6, 1), (120138, 126, 4, 0.75), (120068, 126, 2, 0.50), (120127, 126, 6, 0.9),
(120124, 126, 6, 0.8), (120125, 126, 6, 0.7), (120130, 126, 6, 0.6), (120166, 126, 6, 0.5),
(120050, 126, 3, 0.45), (120161, 126, 6, 0.4), (120069, 126, 4, 0.15), (120022, 126, 4, 0.15),
(120002, 126, 4, 0.15), (120139, 126, 2, 0), (120156, 126, 2, 0)
;
let's proceed.
The complicated part here is identifying the top three top-in-category questions; the others of the ten questions of interest per student are simply sorted by score, which is easy. So let's start with identifying the top three top-in-category questions.
First, assign to each row a row number giving the ordering of that score within the category, for the student:
;WITH Numbered1 ( Question, Student, Category, Score, SeqInStudentCategory ) AS
(
SELECT Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student, Category ORDER BY Score DESC) SeqInStudentCategory
FROM WeightedScores
)
Now we are only interested in rows where SeqInStudentCategory is 1. Considering only such rows, let's order them by score within student, and number those rows:
-- within the preceding WITH
, Numbered2 ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
Numbered1
WHERE
SeqInStudentCategory = 1
)
Now we are only interested in rows where SeqInStudent is at most 3. Let's pull them out, so that we know to include it (and exclude it from the simple sort by score, that we will use to make up the remaining seven rows):
-- within the preceding WITH
, TopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT Question, Student, Category, Score, SeqInStudent FROM Numbered2 WHERE SeqInStudent <= 3
)
Now we have the three top-in-category questions for each student. We now need to identify and order by score the not top-in-category questions for each student:
-- within the preceding WITH
, NotTopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
WeightedScores WS
WHERE
NOT EXISTS ( SELECT 1 FROM TopInCat T WHERE T.Question = WS.Question AND T.Student = WS.Student )
)
Finally we combine TopInCat with NotTopInCat, applying an appropriate offset and restriction to NotTopInCat.SeqInStudent - we need to add 3 to the raw value, and take the top 7 (which is 10 - 3):
-- within the preceding WITH
, Combined ( Question, Student, Category, Score, CombinedSeq ) AS
(
SELECT
Question, Student, Category, Score, SeqInStudent AS CombinedSeq
FROM
TopInCat
UNION
SELECT
Question, Student, Category, Score, SeqInStudent + 3 AS CombinedSeq
FROM
NotTopInCat
WHERE
SeqInStudent <= 10 - 3
)
To get our final results:
SELECT * FROM Combined ORDER BY Student, CombinedSeq
;
You can see the results on sqlfiddle.
Note that here I have assumed that every student will always have answers from at least three categories. Also, the final output doesn't have a TopInCat column, but hopefully you will see how to regain that if you want it.
Also, "(both # of questions and # of categories will be variable eventually)" should be relatively straightforward to deal with here. But watch out for my assumption that (in this case) 3 categories will definitely be present in the answers of each student.