Spark dataframe create explode with order - arrays

I have a data like below
Input Df
+----------+-----------------------------------+--------------|
|SALES_NO |SALE_LINE_NUM | CODE_1 | CODE_3 | CODE_2 |
+----------+----------------------------|------+---|----------|
|123 |1 | ABC | E456 | GHF989 |
|123 |2 | EDF | EFHJ | WAEWA |
|234 |1 | 2345 | 985E | AWW |
|234 |2 | WERWE | | |
|234 |3 | ERC | AERER | |
|456 |1 | WER | AWER | |
+----------+-----------------------------------+--------------|
Output will be created like : for each unique sales_no, sales_line_num create a each new row for different code column if code is not null and order for the same.
For code_1, order will be 1.
For code_2, order will be 2.
Output df
SALES_NO SALES_LINE_NUM CODE ORDER
123 1 ABC 1
123 1 E456 2
123 1 GHF989 3
123 2 EDF 1
123 2 EFHJ 2
123 2 WAEWA 3
234 1 2345 1
234 1 985E 2
234 1 AWW 3
234 2 WERWE 1
234 3 ERC 1
234 3 AERER 2
456 1 WER 1
456 1 AWER 2
Can anyone please help? Thanks in advance

For this dataset:
var ds = spark.sparkContext.parallelize(Seq(
(123, 1, "ABC", "E456", "GHF989"),
(123, 2, "EDF", "EFHJ", "WAEWA"),
(234, 1, "2345", "985E", "AWW"),
(234, 2, "WERWE", "", ""),
(234, 3, "ERC", "AERER", ""),
(456, 1, "WER", "AWER", ""),
)).toDF("SALES_NO", "SALE_LINE_NUM", "CODE_1", "CODE_3", "CODE_2")
We need to unpivot through stack as below:
ds = ds.selectExpr(
"SALES_NO",
"SALE_LINE_NUM",
"stack(3, CODE_1, '1', CODE_2, '2', CODE_3, '3') as (CODE, ORDER)"
)
Which should give what you want:
+--------+-------------+------+-----+
|SALES_NO|SALE_LINE_NUM|CODE |ORDER|
+--------+-------------+------+-----+
|123 |1 |ABC |1 |
|123 |1 |GHF989|2 |
|123 |1 |E456 |3 |
|123 |2 |EDF |1 |
|123 |2 |WAEWA |2 |
|123 |2 |EFHJ |3 |
|234 |1 |2345 |1 |
|234 |1 |AWW |2 |
|234 |1 |985E |3 |
|234 |2 |WERWE |1 |
+--------+-------------+------+-----+
More about unpivoting can be found here.
Good luck!

Related

PostgreSQL filter entity with intermediate table

I would like to create a query, which filters all entities.
Like that ->
FIRST_TABLE
---------------------
|A_ID | TITLE |
|--------------------
|1 | TEST1 |
|2 | TEST2 |
|3 | TEST3 |
|4 | TEST4 |
---------------------
SECOND_TABLE
---------------------
|B_ID | NAME |
|--------------------
|1 | NAME1 |
|2 | NAME2 |
|3 | NAME3 |
|4 | NAME4 |
---------------------
INTERMEDIATE_TABLE
-----------------
|A_FK | B_FK|
|----------------
|2 | 1 |
|2 | 2 |
|2 | 3 |
|3 | 1 |
-----------------
QUERY
SELECT * FROM FIRST_TABLE ft
JOIN INTERMEDIATE_TABLE it
ON ft.A_ID = it.A_FK
WHERE it.B_FK = 1
AND it.B_FK = 2
Then it should only show the entity 2 from first_table because this entity has a relation with NAME1 and NAME2.
How can I make this work?

SQL Move the data from different rows with same ID in same rows but different column

How to move the data from different rows with same ID in same rows but different column?
For example,
I have this table
tblEx
--------------
|ID|Buy |Sell|
|--+----+----|
|1 |10 | |
|1 | |11 |
|2 |20 | |
|2 | |0 |
|3 |0 | |
|3 | |30 |
--------------
Desired Output:
--------------
|ID|Buy |Sell|
|--+----+----|
|1 |10 |11 |
|2 |20 |0 |
|3 |0 |30 |
--------------
Based from the given example and desired result, you can use MAX()
SELECT ID, MAX(Buy) AS Buy, MAX(Sell) AS Sell
FROM TableName
GROUP BY ID

Rank by 2 different levels of partitioning/grouping

I have this set of data using Microsoft SQL Server Management Studio
Category|pet name| date |food price|vet expenses|vat
A | jack |2017-08-28| 12.98 | 2424 |23
A | jack |2017-08-29| 2339 | 2424 |23
A | smithy |2017-08-28| 22.35 | 2324 |12
A | smithy |2017-08-29| 123.35 | 2432 |23
B | casio |2017-08-28| 11.38 | 44324 |32
B | casio |2017-08-29| 2.24 | 3232 |43
B | lala |2017-08-28| 343.36 | 42342 |54
B | lala |2017-08-29| 34.69 | 22432 |54
C | blue |2017-08-28| 223.02 | 534654 |78
C | blue |2017-08-29| 321.01 | 6654 |67
C | collie |2017-08-28| 232.05 | 4765 |43
C | collie |2017-08-29| 233.03 | 4654 |65
What I want to do is rank by food price, but group by category, order by category, pet name, date and then rank by vet expenses, but group by category, order by category, pet name, date and then rank by vat, but group by category, order by category, pet name, date.
I'm thinking this will be a join statement for the table above?
Something exactly like below:
Category|pet name| date |food price|vet expenses|vat|Rankfp|Rankve|Rankvat
A | jack |2017-08-28| 12.98 | 2424 |23 | 2 | 1 |1
A | jack |2017-08-29| 2339 | 2424 |23 | 1 | 2 |1
A | smithy |2017-08-28| 22.35 | 2324 |12 | 1 | 2 |2
A | smithy |2017-08-29| 123.35 | 2432 |22 | 2 | 1 |2
B | casio |2017-08-28| 11.38 | 44324 |32 | 2 | 1 |2
B | casio |2017-08-29| 2.24 | 3232 |43 | 2 | 2 |2
B | lala |2017-08-28| 343.36 | 42342 |54 | 1 | 2 |1
B | lala |2017-08-29| 34.69 | 22432 |54 | 1 | 1 |1
C | blue |2017-08-28| 223.02 | 534654 |78 | 2 | 1 |1
C | blue |2017-08-29| 321.01 | 6654 |67 | 1 | 1 |1
C | collie |2017-08-28| 232.05 | 4765 |43 | 1 | 2 |2
C | collie |2017-08-29| 233.03 | 4654 |65 | 2 | 2 |2
NB: this is not needed in the final output but to make it more readable I have ordered the outcome by category, pet name, date:
Category|pet name| date |food price|vet expenses|vat|Rankfp|Rankve|Rankvat
A | jack |2017-08-28| 12.98 | 2424 |23 | 2 | 1 |1
A | smithy |2017-08-28| 22.35 | 2324 |12 | 1 | 2 |2
A | jack |2017-08-29| 2339 | 2424 |23 | 1 | 2 |1
A | smithy |2017-08-29| 123.35 | 2432 |22 | 2 | 1 |2
B | casio |2017-08-28| 11.38 | 44324 |32 | 2 | 1 |2
B | lala |2017-08-28| 343.36 | 42342 |54 | 1 | 2 |1
B | lala |2017-08-28| 343.36 | 42342 |54 | 1 | 2 |1
B | lala |2017-08-29| 34.69 | 22432 |54 | 1 | 1 |1
C | blue |2017-08-28| 223.02 | 534654 |78 | 2 | 1 |1
C | collie |2017-08-28| 232.05 | 4765 |43 | 1 | 2 |2
C | blue |2017-08-29| 321.01 | 6654 |67 | 1 | 1 |1
C | collie |2017-08-29| 233.03 | 4654 |65 | 2 | 2 |2
The code I have below only ranks by category, but does not group by food price, vet expenses and vat.
RANK ()OVER(PARTITION BY [Category], [Date] order by [Category] ,[Pet Name],[Date]) as 'Rank'
Would it be a case of grouping the costs separately then left joining the rankings on to the original data?
(I will be using pivots and slicers in excel so want to have all the data on one table/query)
After walking away with some time to refresh my brain i had a eureka moment and solved this. It was actually easy when I thought about it.
so
the code to get the desired table goes something like this:
select *
, rank ()OVER(PARTITION BY [Category], [date] order by [food price], [Category] ,[pet name],[date]) as 'Rankfp'
, rank ()OVER(PARTITION BY [Category], [date] order by [vet expenses], [Category] ,[pet name], [date]) as 'Rankve'
, rank ()OVER(PARTITION BY [Category], [date] order by [vat], [Category] ,[pet name], [date]) as 'Rankvat'
from petcost
order by [category, [pet name]

How to join two tables using a third table when NULLS are involved

I have two time card tables that I need to join. The two tables should be joined by the week ID, and employee resource code if applicable. However, with the exception of one week, the data that the two tables contain is from different time frames (i.e. in most cases there will not be matching data in both tables).
The first table (dt5) has that week’s ID, the employee's resource code, that employee's capacity for that week, and their actual hours worked for that week.
dt5:
+---------------+---------------+----------+---------------+
| id | Resource_code | capacity | time_reported |
+---------------+---------------+----------+---------------+
| 1 | 555 | 40 | 40 |
| 1 | 333 | 25 | 20 |
| 2 | 555 | 40 | 40 |
| 2 | 333 | 25 | 20 |
| 3 | 555 | 40 | 40 |
| 3 | 333 | 25 | 20 |
| 4 | 555 | 40 | 39 |
| 4 | 333 | 25 | 24 |
+---------------+---------------+----------+---------------+
The second table (dt4) has the week’s ID, the employee's resource code, and the employee's planned hours for that week.
dt4:
+---------------+---------------+---------------+
| id | Resource_code | planned_hours |
+---------------+---------------+---------------+
| 4 | 555 | 30 |
| 4 | 333 | 20 |
| 5 | 555 | 30 |
| 5 | 333 | 20 |
| 6 | 555 | 30 |
| 6 | 333 | 20 |
+---------------+---------------+---------------+
When an employee completes their time card, the planned hours data is removed; before this occurs, there is a short period of time when the data overlaps (when both tables have data for the same period, like period 4 in my example tables). Because the two tables will only have one period in common at any given time, I am using a third table (gtd) that contains each week's ID to help join them.
gtd:
+----+------------+----------+
| id | start_date | end_date |
+----+------------+----------+
| 1 | 10 | 20 |
| 2 | 30 | 40 |
| 3 | 50 | 60 |
| 4 | 70 | 80 |
| 5 | 90 | 100 |
| 6 | 110 | 120 |
| 7 | 130 | 140 |
| 8 | 150 | 160 |
| 9 | 170 | 180 |
| 10 | 190 | 200 |
+----+------------+----------+
dates changed to integers in this example for simplification
My result should look like this:
Note that the week 4 rows contain data from both dt4 and dt5 (capacity, time reported, planned hours), because week 4 is the only overlapping week.
+----+---------------+----------+---------------+---------------+---------------+
| id | Resource_code | capacity | time_reported | Resource_code | planned_hours |
+----+---------------+----------+---------------+---------------+---------------+
| 1 | 555 | 40 | 40 | NULL | NULL |
| 1 | 333 | 25 | 20 | NULL | NULL |
| 2 | 555 | 40 | 40 | NULL | NULL |
| 2 | 333 | 25 | 20 | NULL | NULL |
| 3 | 555 | 40 | 40 | NULL | NULL |
| 3 | 333 | 25 | 20 | NULL | NULL |
| 4 | 555 | 40 | 39 | 555 | 30 |
| 4 | 333 | 25 | 24 | 333 | 20 |
| 5 | NULL | NULL | NULL | 555 | 30 |
| 5 | NULL | NULL | NULL | 333 | 20 |
| 6 | NULL | NULL | NULL | 555 | 30 |
| 6 | NULL | NULL | NULL | 333 | 20 |
| 7 | NULL | NULL | NULL | NULL | NULL |
| 8 | NULL | NULL | NULL | NULL | NULL |
| 9 | NULL | NULL | NULL | NULL | NULL |
| 10 | NULL | NULL | NULL | NULL | NULL |
+----+---------------+----------+---------------+---------------+---------------+
Here is the SQL I have so far:
SELECT
gtd.id,
dt5.resource_code,
dt5.capacity,
dt5.time_reported,
dt4.resource_code,
dt4.planned_hours
FROM gtd
LEFT JOIN dt5 ON gtd.id = dt5.id
LEFT OUTER JOIN dt4 ON gtd.id = dt4.id
My (incorrect) results are shown below:
The errors are occurring in the week 4 rows. In two of the week 4 rows, the resource code and planned hours information from dt4 does not match up with the resource code from dt5.
+----+---------------+----------+---------------+---------------+---------------+
| id | resource_code | capacity | time_reported | resource_code | planned_hours |
+----+---------------+----------+---------------+---------------+---------------+
| 1 | 555 | 40 | 40 | NULL | NULL |
| 1 | 333 | 25 | 20 | NULL | NULL |
| 2 | 555 | 40 | 40 | NULL | NULL |
| 2 | 333 | 25 | 20 | NULL | NULL |
| 3 | 555 | 40 | 40 | NULL | NULL |
| 3 | 333 | 25 | 20 | NULL | NULL |
| 4 | 555 | 40 | 39 | 555 (Correct) | 30 |
| 4 | 555 | 40 | 39 | 333 (Wrong) | 20 |
| 4 | 333 | 25 | 24 | 555 (Wrong) | 30 |
| 4 | 333 | 25 | 24 | 333 (Correct) | 20 |
| 5 | NULL | NULL | NULL | 555 | 30 |
| 5 | NULL | NULL | NULL | 333 | 20 |
| 6 | NULL | NULL | NULL | 555 | 30 |
| 6 | NULL | NULL | NULL | 333 | 20 |
| 7 | NULL | NULL | NULL | NULL | NULL |
| 8 | NULL | NULL | NULL | NULL | NULL |
| 9 | NULL | NULL | NULL | NULL | NULL |
| 10 | NULL | NULL | NULL | NULL | NULL |
+----+---------------+----------+---------------+---------------+---------------+
Based off my research, I think that I am either incorrectly using JOINS, or that I need a CASE statement somewhere. I’ve also tried joining the tables on resource code, but that eliminated a lot of my data. Any solutions or pointers in the right direction would be much appreciated.
I am using tsql.
*Edited my question to fix inconsistencies with the column names (period_number changed to id)
There's no doubt a simpler and more elegant solution that my answer, but since I'm very tired here's a brute force approach:
Use UNION to mash the two tables together. You'll need to manufacture dummy information that is only present in one table (such as Capacity).
Take the combined table and organise the data using GROUP BY:
SELECT f1.Period, f1.RC, f1.PlanTime, f1.ActTime
FROM
(SELECT
dt5.period_number AS 'Period',
dt5.resource_code AS 'RC',
dt5.capacity AS 'ActCap',
0 AS 'PlanTime',
dt5.time_reported AS 'ActTime'
FROM dt5
UNION ALL
SELECT
dt4.period_number AS 'Period',
dt4.resource_code AS 'RC',
0 AS 'ActCap',
dt4.planned_hours AS 'PlanTime',
0 AS 'ActTime'
FROM dt4) AS f1
GROUP BY f1.Period, f1.RC
I do not think you need the gtd table. Please try and see if this work for you. Please correct me if my understanding of your request is incorrect.
SELECT COALESCE(dt5.period_number, dt4.period_number) AS period_number,
dt5.Resource_code,
dt5.capacity,
dt5.time_reported,
dt4.Resource_code,
dt4.planned_hours
FROM dt5
FULL OUTER JOIN (
SELECT *
FROM dt4 a
WHERE NOT EXISTS (
SELECT 1
FROM dt5 b
WHERE b.period_number = a.period_number
AND b.Resource_code = a.Resource_code
)
) dt4
ON dt5.period_number = dt4.period_number
AND dt4.Resource_code = dt5.Resource_code
ORDER BY COALESCE(dt5.period_number, dt4.period_number) ASC
Test Data
;WITH cte_dt5(period_number,Resource_code,capacity,time_reported) AS
(
SELECT 1, 555, 40, 40 UNION ALL
SELECT 1, 333, 25, 20 UNION ALL
SELECT 2, 555, 40, 40 UNION ALL
SELECT 2, 333, 25, 20 UNION ALL
SELECT 3, 555, 40, 40 UNION ALL
SELECT 3, 333, 25, 20 UNION ALL
SELECT 4, 555, 40, 39 UNION ALL
SELECT 4, 333, 25, 24
)
,cte_dt4 (period_number, Resource_code, planned_hours) AS
(
SELECT 4, 555, 30 UNION ALL
SELECT 4, 333, 20 UNION ALL
SELECT 5, 555, 30 UNION ALL
SELECT 5, 333, 20 UNION ALL
SELECT 6, 555, 30 UNION ALL
SELECT 6, 333, 20
)
SELECT COALESCE(dt5.period_number, dt4.period_number) AS period_number,
dt5.Resource_code,
dt5.capacity,
dt5.time_reported,
dt4.Resource_code,
dt4.planned_hours
FROM cte_dt5 AS dt5
FULL OUTER JOIN (
SELECT *
FROM cte_dt4 a
WHERE NOT EXISTS (
SELECT 1
FROM cte_dt5 b
WHERE b.period_number = a.period_number
AND b.Resource_code = a.Resource_code
)
) dt4
ON dt5.period_number = dt4.period_number
AND dt4.Resource_code = dt5.Resource_code
ORDER BY COALESCE(dt5.period_number, dt4.period_number) ASC
Result
+---------------------------------------------------------------------------------+
|period_number|Resource_code|capacity |time_reported|Resource_code|planned_hours|
+-------------|-------------|-----------|-------------|-------------|-------------+
|1 |555 |40 |40 |NULL |NULL |
|1 |333 |25 |20 |NULL |NULL |
|2 |555 |40 |40 |NULL |NULL |
|2 |333 |25 |20 |NULL |NULL |
|3 |555 |40 |40 |NULL |NULL |
|3 |333 |25 |20 |NULL |NULL |
|4 |555 |40 |39 |NULL |NULL |
|4 |333 |25 |24 |NULL |NULL |
|5 |NULL |NULL |NULL |333 |20 |
|5 |NULL |NULL |NULL |555 |30 |
|6 |NULL |NULL |NULL |333 |20 |
|6 |NULL |NULL |NULL |555 |30 |
+---------------------------------------------------------------------------------+
Code changes as per OP's request below. Commenting the Exist clause will give the result desired.
user7571220: Thank you for your help! Everything is correct except for
the planned hours and resource code (which come from dt4) in week 4. I
am trying to include data from both tables in the week that they
overlap (week 4). I'm essentially trying to get the data for week 4 to
look like the comments I've posted below.
| 4 | 555 | 40 | 39 | 555 | 30 |
| 4 | 333 | 25 | 24 | 333 | 20 |
SELECT COALESCE(dt5.period_number, dt4.period_number) AS period_number,
dt5.Resource_code,
dt5.capacity,
dt5.time_reported,
dt4.Resource_code,
dt4.planned_hours
FROM cte_dt5 AS dt5
FULL OUTER JOIN (
SELECT *
FROM cte_dt4 a
--WHERE NOT EXISTS (
-- SELECT 1
-- FROM cte_dt5 b
-- WHERE b.period_number = a.period_number
-- AND b.Resource_code = a.Resource_code
-- )
) dt4
ON dt5.period_number = dt4.period_number
AND dt4.Resource_code = dt5.Resource_code
ORDER BY COALESCE(dt5.period_number, dt4.period_number) ASC
Result
+---------------------------------------------------------------------------------+
|period_number|Resource_code|capacity |time_reported|Resource_code|planned_hours|
+-------------|-------------|-----------|-------------|-------------|-------------+
|1 |555 |40 |40 |NULL |NULL |
|1 |333 |25 |20 |NULL |NULL |
|2 |555 |40 |40 |NULL |NULL |
|2 |333 |25 |20 |NULL |NULL |
|3 |555 |40 |40 |NULL |NULL |
|3 |333 |25 |20 |NULL |NULL |
|4 |555 |40 |39 |555 |30 |
|4 |333 |25 |24 |333 |20 |
|5 |NULL |NULL |NULL |333 |20 |
|5 |NULL |NULL |NULL |555 |30 |
|6 |NULL |NULL |NULL |333 |20 |
|6 |NULL |NULL |NULL |555 |30 |
+---------------------------------------------------------------------------------+

Select top n records based on ordinal and attribute data

I have a case where I need to show only the top rows based on a setting in a table and the ordinal set.
Example dataset below shows two customers; each of the customers have a different product.
Since NumRowsToShow is "1" I only want to show one row (the top row based on ordinal) for EACH Customer.
| CustomerID | ProductID | Ordinal | NumRowsToShow |
+------------+-----------+---------+---------------+
| 1 |A |1 |1 |
| 1 |B |2 |1 |
| 1 |C |3 |1 |
| 5 |D |1 |1 |
| 5 |E |2 |1 |
| 5 |F |3 |1 |
The result set after query is run should be
| CustomerID | ProductID |
+------------+-----------+
| 1 |A |
| 5 |D |
In the same scenario if NumRowsToShow were 1 for customerID 1 and 2 for CustomerID 5 I would see something like.
| CustomerID | ProductID | Ordinal | NumRowsToShow |
+------------+-----------+---------+---------------+
| 1 |A |1 |1 |
| 1 |B |2 |1 |
| 1 |C |3 |1 |
| 5 |D |1 |2 |
| 5 |E |2 |2 |
| 5 |F |3 |2 |
The result set after query is run should be
| CustomerID | ProductID |
+------------+-----------+
| 1 |A |
| 5 |D |
| 5 |E |
How can this be done?
Including a screen cap of actual result set with highlights of what I'm trying to filter down to which may be a little helpful.
(source: harpernet.net)
It feels like "cheating in the exams":
SELECT CustomerID, ProductID
FROM tableX
WHERE Ordinal <= NumRowsToShow
If, as comments suggest, the Ordinal can have 10, 20, 30 values and not only 1, ..., n values, then this will work:
SELECT t.CustomerID, t.ProductID
FROM tableX AS t
JOIN tableX AS tt
ON tt.CustomerID = t.CustomerID
AND tt.Ordinal <= t.Ordinal
GROUP BY t.CustomerID
, t.ProductID
, t.NumRowsToShow
HAVING COUNT(*) <= t.NumRowsToShow
or even better, the:
SELECT CustomerID, ProductID
FROM
( SELECT CustomerID, ProductID, NumRowsToShow
, ROW_NUMBER() OVER( PARTITION BY CustomerID
ORDER BY Ordinal
) AS Rn
FROM tableX
) AS tmp
WHERE Rn <= NumRowsToShow ;
Test in: SQL-Fiddle
Your table looks to be not normalized. The NumRowsToShow columns has duplicate infomation and that can lead to update anomalies. This:
| CustomerID | ProductID | Ordinal | NumRowsToShow |
+------------+-----------+---------+---------------+
| 1 |A |1 |1 |
| 1 |B |2 |1 |
| 1 |C |3 |1 |
| 5 |D |1 |2 |
| 5 |E |2 |2 |
| 5 |F |3 |2 |
could be normalized to 2 tables:
| CustomerID | ProductID | Ordinal |
+------------+-----------+---------+
| 1 |A |1 |
| 1 |B |2 |
| 1 |C |3 |
| 5 |D |1 |
| 5 |E |2 |
| 5 |F |3 |
and:
| CustomerID | NumRowsToShow |
+------------+---------------+
| 1 |1 |
| 5 |2 |

Resources