Convert Spark SQL to Scala using Window function partitioned by aggregate - sql-server

I have the following Spark SQL query:
val subquery =
"( select garment_group_name , prod_name, " +
"row_number() over (partition by garment_group_name order by count(prod_name) desc) as seqnum " +
"from articles a1 " +
"group by garment_group_name, prod_name )"
val query = "SELECT garment_group_name, prod_name " +
"FROM " + subquery +
" WHERE seqnum = 1 "
val query3 = spark.sql(query)
I am trying to do that exact same thing however as a Data frame API. I wanted to just first concentrate on the subquery part and I did something like this
import org.apache.spark.sql.expressions.Window // imports the needed Window object
import org.apache.spark.sql.functions.row_number
val windowSpec = Window.partitionBy("garment_group_name")
articlesDF.withColumn("row_number", row_number.over(windowSpec))
.show()
However I get the following error
org.apache.spark.sql.AnalysisException: Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder$$anonfun$apply$33.applyOrElse(Analyzer.scala:2207)......... and so on.
I see that I need to include an orderBy clause but how can I do this if I am actually first counting from a group by on two columns and then comes an order by?
The warning gives the example: SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table, but I do not know how to do this as a data frame API and I don't see this online.

The solution is to first perform count("prod_name") in a Window which is partitioned by both "garment_group_name" and "prod_name" which is then used in windowSpec.
Starting with some example data:
val df = List(
("a", "aa1"), ("a", "aa2"), ("a", "aa3"), ("b", "bb")
)
.toDF("garment_group_name", "prod_name")
df.show(false)
gives:
+------------------+---------+
|garment_group_name|prod_name|
+------------------+---------+
|a |aa1 |
|a |aa2 |
|a |aa3 |
|b |bb |
+------------------+---------+
and the two window functions we need:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
val countWindowSpec = Window.partitionBy("garment_group_name", "prod_name")
val windowSpec = Window.partitionBy(col("garment_group_name")).orderBy(col("count").desc)
We can then use them:
df
// create the `count` column to be used by `windowSpec`
.withColumn("count", count(col("prod_name")).over(countWindowSpec))
.withColumn("seqnum", row_number.over(windowSpec))
// take only the first row of each partition
.filter(col("seqnum") === 1)
// select only the rows we care about
.select("garment_group_name", "prod_name")
.show(false)
which gives:
+------------------+---------+
|garment_group_name|prod_name|
+------------------+---------+
|a |aa1 |
|b |bb |
+------------------+---------+
Comparing this to your SQL implementation, using the same df:
df.createOrReplaceTempView("a1")
val subquery =
"( select garment_group_name , prod_name, " +
"row_number() over (partition by garment_group_name order by count(prod_name) desc) as seqnum " +
"from a1 " +
"group by garment_group_name, prod_name )"
val query = "SELECT garment_group_name, prod_name " +
"FROM " + subquery +
" WHERE seqnum = 1 "
spark.sql(query).show(false)
we get the same result of:
+------------------+---------+
|garment_group_name|prod_name|
+------------------+---------+
|a |aa1 |
|b |bb |
+------------------+---------+

Related

Remove duplicate rows from query based on value in one column

I have a table (SQL Server) where I store the run status of all of our automated test cases and I'm trying to come up with an SQL query to retrieve the status per runid, but I'm running into some problems with it.
Example of data within the table:
KRUNID | KTIME | FK_TC_ID | NOPART | STATUS | ENV
-------+----------------+------------+--------+--------+-----
4180-2 | 20190109080000 | TC0001 | 123456 | Passed | INT
4180-2 | 20190109080100 | TC0002 | 123457 | Failed | INT
4180-2 | 20190109080200 | TC0003 | 123458 | Passed | INT
4180-2 | 20190109080400 | TC0002 | 123459 | Passed | INT
Right now, I have this query (the join statements are used to display the actual test case name and business domain):
SELECT KRUNID, TD_NAME, TS_NAME, FK_TC_ID, TC_DISPLAYNAME, NOPARTENAIRE,
ENV, STATUS FROM RU_RUNSTATUS
INNER JOIN TC_TESTCASES ON K_TC_ID = FK_TC_ID
INNER JOIN TS_TCSUBDOMAINS ON K_TS_ID = FK_TS_ID
INNER JOIN TD_TCDOMAINS on K_TD_ID = FK_TD_ID
WHERE KRUNID = '418-2'
ORDER BY FK_TS_ID, K_TC_ID
The query is basic and it works fine except that I will have 2 lines for TC0002 when I only want to have the last one based on KTIME (for various reasons I don't want to filter based on STATUS).
I haven't found the right way to modify my query to get the result I want. How can I do that?
Thanks
I think this article can be respond : Get top 1 row of each group
Look of your query with limit partioned by FK_TC_ID and ordered by KTIME
;WITH cte AS
(
SELECT FK_TS_ID, KRUNID, TD_NAME, TS_NAME, FK_TC_ID , TC_DISPLAYNAME, NOPARTENAIRE,
ENV, STATUS, ROW_NUMBER() OVER (PARTITION BY FK_TC_ID ORDER BY KTIME DESC) AS rn
FROM RU_RUNSTATUS
INNER JOIN TC_TESTCASES ON K_TC_ID = FK_TC_ID
INNER JOIN TS_TCSUBDOMAINS ON K_TS_ID = FK_TS_ID
INNER JOIN TD_TCDOMAINS on K_TD_ID = FK_TD_ID
WHERE KRUNID = '418-2'
)
SELECT *
FROM cte
WHERE rn = 1
ORDER BY FK_TS_ID, K_TC_ID

SQL Server - Each GROUP BY expression must contain at least one column that is not an outer reference

I need to identify all records that have MostRecent=-1, OilWell=-1, plus are duplicate records with the same Api, and join these to get the associated CompanyName.
With the query:
SELECT
BLMAPDCONTACT.CompanyName, APD.Api, APD.ID, APD.MostRecent,
APD.Project_Nu, APD.Unit_Lease, APD.Well_Nu, APD.OilWell
FROM
APD
INNER JOIN
BLMAPDCONTACT ON APD.BLM_APD_Cont = BLMAPDCONTACT.OBJECTID
WHERE
(APD.Api IN (SELECT APD.Api
FROM APD AS Tmp
WHERE APD.MostRecent = -1 AND APD.OilWell = -1
GROUP BY APD.Api
HAVING Count(APD.Api) > 1))
ORDER BY
APD.Api DESC;
I get this error:
Each GROUP BY expression must contain at least one column that is not an outer reference.
This error appeared after I added the JOIN clause; without it, it worked.
Example desired output will match on the following records from the APD table:
APD.Api | APD.MostRecent | APD.OilWell
--------------------------------------
123 | -1 | -1
123 | -1 | -1
And not:
APD.Api | APD.MostRecent | APD.OilWell
--------------------------------------
321 | 0 | -1
321 | -1 | -1
did you try this:
SELECT BLMAPDCONTACT.CompanyName, APD.Api, APD.ID, APD.MostRecent, APD.Project_Nu, APD.Unit_Lease, APD.Well_Nu, APD.OilWell
FROM APD INNER JOIN BLMAPDCONTACT ON APD.BLM_APD_Cont = BLMAPDCONTACT.OBJECTID
WHERE (APD.Api IN
(SELECT tmp.Api
FROM APD As Tmp
WHERE tmp.MostRecent=-1 AND tmp.OilWell=-1
GROUP BY tmp.Api HAVING Count(tmp.Api)>1))
ORDER BY APD.Api DESC;
The aliases and table names are confusing me a bit. If you run the following, do you still get the same error?
SELECT b.CompanyName
, a.Api
, a.ID
, a.MostRecent
, a.Project_Nu
, a.Unit_Lease
, a.Well_Nu
, a.OilWell
FROM APD a INNER JOIN BLMAPDCONTACT b
ON a.BLM_APD_Cont = b.OBJECTID
WHERE a.Api IN (
SELECT tmp.Api
FROM APD As Tmp
WHERE tmp.MostRecent = -1 AND tmp.OilWell = -1
GROUP BY tmp.Api
HAVING Count(tmp.Api) > 1
)
ORDER BY a.Api DESC;
Also, just double check that I've translated tables to aliases correctly.

pg removing common part in array

Suppose a postgresql query that output rows of 2 columns each being array of int[][2]
track0 track1
{{1,2},{5847,5848},{5845,5846}......} {{1,2},{5847,5848},{10716,10715}........}
{{13,14},{1,2},{5847,5848},{284,285}........} {{13,14},{1,2},{5847,5848},{1284,1285}................}
How can we remove the leading arrays common to both columns except the last one?
In the first row {1,2} should be removed from the two columns.
In the second row {13,14},{1,2} should be removed from the two columns.
Can it be done with sql or is it necessary to use plpgsql?
I could manage the plpgsql but would like sql solution.
given your sample with no gaps in sequence of equal pairs,
t=# with d(t0,t1) as (values
('{{1,2},{5847,5848},{5845,5846}}'::int[][2],'{{1,2},{5847,5848},{10716,10715}}'::int[][2])
, ('{{13,14},{1,2},{5847,5848},{284,285}}'::int[][2],'{{13,14},{1,2},{5847,5848},{1284,1285}}'::int[][2])
)
, u as (
select *
from d
join unnest(t0) with ordinality t(e0,o0) on true
left outer join unnest(t1) with ordinality t1(e1,o1) on o0=o1
)
, p as (
select *
, case when (
not lead(e0=e1) over w
or not lead(e0=e1,2) over w
or e0!=e1
) AND (o1%2) = 1
then ARRAY[e0,lead(e0) over w] end r0
, case when (
not lead(e0=e1) over w
or not lead(e0=e1,2) over w
or e0!=e1
) AND (o1%2) = 1
then ARRAY[e1,lead(e1) over w] end r1
from u
window w as (partition by t0,t1 order by o0)
)
select t0,t1,array_agg(r0),array_agg(r1)
from p
where r0 is not null or r1 is not null
group by t0,t1
;
t0 | t1 | array_agg | array_agg
---------------------------------------+-----------------------------------------+---------------------------+-----------------------------
{{13,14},{1,2},{5847,5848},{284,285}} | {{13,14},{1,2},{5847,5848},{1284,1285}} | {{5847,5848},{284,285}} | {{5847,5848},{1284,1285}}
{{1,2},{5847,5848},{5845,5846}} | {{1,2},{5847,5848},{10716,10715}} | {{5847,5848},{5845,5846}} | {{5847,5848},{10716,10715}}
(2 rows)
you can skip part of complication if you add some trick for multidimentional arrays, look here and here

MSSQL/TSQL Joining against a subquery

I'm analyzing IIS log files from sharepoint and need to match each entry to it's SPWeb.
This SQL code works for a single value (#var1):
DECLARE #var1 varchar(128);
set #var1 = '/sites/Site1/Subsite1/Subsite2/Documents/marketing.docx';
select
TOP 1 *,
charindex(urlstub, #var1) as found
from
spwebs
where
charindex(urlstub, #var1) = 1
order by
urlstub DESC;
I'm looking for a way to get this to work for a tables worth of data instead of just the single variable #var1.
Example data
SPwebs:
/sites/Site1
/sites/Site1/Subsite1
/sites/Site1/Subsite1/Subsite2
/sites/Site2
etc..
IISlog: (this is the table I'd like to take the place of #var1 above)
/sites/Site1/Subsite1/Subsite2/Documents/marketing.docx
/sites/Site1/Subsite1/Subsite2/Documents/sales.docx
/sites/Site1/Subsite1/Subsite2/Documents/hr.docx
/sites/Site1/research/funding.docx
The expected outcome of the above would be:
Foreach record in the IISLog table:
Find the best/deepest matching record from the spwebs table:
|table | matchingSPweb |
|---------------------------------------------------------| --------------------------------|
| /sites/Site1/Subsite1/Subsite2/Documents/marketing.docx | /sites/Site1/Subsite1/Subsite2/ |
| /sites/Site1/Subsite1/Subsite2/Documents/sales.docx | /sites/Site1/Subsite1/Subsite2/ |
| /sites/Site1/Subsite1/Subsite2/Documents/hr.docx | /sites/Site1/Subsite1/Subsite2/ |
| /sites/Site1/research/funding.docx | /sites/Site1 |
I've tried
select iislogs2.*, spwebs.urlstub
from
iislogs2
inner join
(
select TOP 1 urlstub, csURIStem as found
from spwebs
where charindex(urlstub, iislogs2.csUriStem) = 1
order by urlstub DESC
) as x
on x.csuristem = iislogs2.csUriStem
but this just errors, it doesn't seem to understand csUriStem in the context of the subselect statement.
The easiest ways to fix your issue are either to change your current query to use a subquery in the select statement, e.g.:
SELECT iislogs2.*,
urlstub = (SELECT TOP 1 urlstub FROM spwebs WHERE CHARINDEX(urlstub, iislogs2.csUriStem) = 1 ORDER BY urlstub DESC)
from iislogs2;
... or change your current join to a cross apply, e.g.:
SELECT iislogs2.*, x.urlstub
from iislogs2
cross apply (SELECT TOP 1 urlstub FROM spwebs WHERE CHARINDEX(urlstub, iislogs2.csUriStem) = 1 ORDER BY urlstub DESC) AS x;
EDIT:
The query optimiser might do all sorts of weird sorts and spools, so one option to avoid that might be to use an explicit join with a CTE and then left join this back to your original table. For example:
;WITH CTE AS
(
SELECT i.csUriStem, s.urlstub, RN = ROW_NUMBER() OVER (PARTITION BY i.csUriStem ORDER BY s.urlstub DESC)
FROM iislogs2 AS i
JOIN spwebs AS s
ON i.csUriStem LIKE s.urlstub + '%'
)
SELECT i.*, c.urlstub
FROM iislogs2 AS i
LEFT JOIN CTE AS c
ON c.csUriStem = i.csUriStem
AND c.RN = 1;
Unfortunately, with strings and substrings, it's hard to get an execution plan that is really optimal for what you want to do, but I expect this sort of query will perform better with indexes than the other two.

EF6 - Generating unneeded nested queries

I have the following tables:
MAIN_TBL:
Col1 | Col2 | Col3
------------------
A | B | C
D | E | F
And:
REF_TBL:
Ref1 | Ref2 | Ref3
------------------
A | G1 | Foo
D | G1 | Bar
Q | G2 | Xyz
I wish to write the following SQL query:
SELECT M.Col1
FROM MAIN_TBL M
LEFT JOIN REF_TBL R
ON R.Ref1 = M.Col1
AND R.Ref2 = 'G1'
WHERE M.Col3 = 'C'
I wrote the following LINQ query:
from main in dbContext.MAIN_TBL
join refr in dbContext.REF_TBL
on "G1" equals refr.Ref2
into refrLookup
from refr in refrLookup.DefaultIfEmpty()
where main.Col1 == refr.Col1
select main.Col1
And the generated SQL was:
SELECT
[MAIN_TBL].[Col1]
FROM (SELECT
[MAIN_TBL].[Col1] AS [Col1],
[MAIN_TBL].[Col2] AS [Col2],
[MAIN_TBL].[Col3] AS [Col3]
FROM [MAIN_TBL]) AS [Extent1]
INNER JOIN (SELECT
[REF_TBL].[Ref1] AS [Ref1],
[REF_TBL].[Ref2] AS [Ref2],
[REF_TBL].[Ref3] AS [Ref3]
FROM [REF_TBL]) AS [Extent2] ON [Extent1].[Col1] = [Extent2].[Ref1]
WHERE ('G1' = [Extent2].[DESCRIPTION]) AND ([Extent2].[Ref1] IS NOT NULL) AND CAST( [Extent1].[Col3] AS VARCHAR) = 'C') ...
Looks like it is nesting a query within another query, while I just want it to pull from the table. What am I doing wrong?
I may be wrong, but it looks like you don't do the same in linq query and sql query, especially on your left joining clause.
I would go for this, if you want something similar to your sql query.
from main in dbContext.MAIN_TBL.Where(x => x.Col3 == "C")
join refr in dbContext.REF_TBL
on new{n = "G1", c = main.Col1} equals new{n = refr.Ref2, c = refr.Col1}
into refrLookup
from r2 in refrLookup.DefaultIfEmpty()
select main.Col1
By the way, it doesn't make much sense to left join on a table which is not present in the select clause : you will just get multiple identical Col1 if there's more than one related item in the left joined table...

Resources