Spark SQL eliminate non repeating values across tables

Spark SQL eliminate non repeating values across tables - sql-server

I have three tables and expected output highlighted below in green.
A ,2 4 adhoc values should not come in the output as it is non repeating in other tables.
I need to show only values that are repeating in tables
// package com.allaboutscala.chapter.one.tutorial_04
// val spark = SparkSession.builder.master("local[2]").appName("kafkaConsumer").getOrCreate()
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
import spark.implicits._
case class Person(T1:String,T2:String,T3:String,T4:String)
// val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// import spark.implicits._
import org.apache.spark.sql.functions._
val Table1=Seq(Person("A","1","3","adhoc"),Person("A","2","4","adhoc"),Person("A","3","5","adhoc")).toDF()
val Table2=Seq(Person("A","1","3","adhoc")).toDF()
val Table3=Seq(Person("A","3","5","adhoc")).toDF()
// val a=Table1.toDF()
// val b=Table2.toDF()
// val c=Table3.toDF()
Table1.createOrReplaceTempView("a")
Table2.createOrReplaceTempView("b")
Table3.createOrReplaceTempView("c")
val b1=spark.sql(""" select distinct
a.T1 as tbl1_ta , a.T2 as tbl1_t2 ,a.T3 as tbl1_t3 ,a.T4 as tbl1_t4,
b.T1 as tbl2_ta , b.T2 as tbl2_t2 ,b.T3 as tbl2_t3 ,b.T4 as tbl2_t4,
c.T1 as tbl3_ta , c.T2 as tbl3_t2 ,c.T3 as tbl3_t3 ,c.T4 as tbl3_t4 from a left join b on a.T1=b.T1 and a.T2=b.T2 and a.T3=b.T3 and a.T4=b.T4 left join c
on a.T1=c.T1 and a.T2=c.T2 and a.T3=c.T3 and a.T4=c.T4 """)
b1.printSchema
b1.createOrReplaceTempView("b1c")
spark.sql(""" select tbl1_ta,tbl1_t2,tbl1_t3,tbl1_t4,tbl2_ta,tbl2_t2,tbl2_t3,tbl2_t4,tbl3_ta,tbl3_t2,tbl3_t4,tbl3_t4 from b1c group by
tbl1_ta,tbl1_t2,tbl1_t3,tbl1_t4,tbl2_ta,tbl2_t2,tbl2_t3,tbl2_t4,tbl3_ta,tbl3_t2,tbl3_t4,tbl3_t4 having count(*) >1 """).show(false)
left join gives all rows, tried with group by count > 1 and exclude?
used databricks community edition cluster,can anyone give someidea

Related

Parameterized raw sql query much slower than query with actual values

I am trying to execute a raw sql query with connections[].cursor() in a Django app that connects to SQL server. The query executes much faster (<1s) when I provide the actual vlaues in the query string.
from django.db import connections
with connections['default'].cursor() as cursor:
cursor.execute("""
select c.column1 as c1
, ve.column2 as c2
from view_example c
left join view_slow_view ve on c.k1 = ve.k2
where c.column_condition = value_1 and c.column_cd_2 = value2
""")
result = dictfetchall(cursor)
But when I provide the values as params in the cursor.execute() method, the query becomes much slower (2 minutes).
from django.db import connections
with connections['default'].cursor() as cursor:
cursor.execute("""
select c.column1 as c1
, ve.column2 as c2
from view_example c
left join view_slow_view ve on c.k1 = ve.k2
where c.column_condition = %s and c.column_condition_2 = %s
""", [value_1, value_2])
contracts_dict_lst = dictfetchall(cursor)
I should also mention that the query is actually slow when executed on SSMS ONLY IF a condition is NOT provided:
where c.column_condition = value_1 and c.column_cd_2 = value2
It is as if when Django sends the query, it is executed without the parameters (hence the long response time) and then the parameters are provided so the result is filtered.
The values in question are provided by the user, so they change and have to be passed as params and not directly in the query to avoid sql injection.
The query is also much more complex than the example given above and doesn't map cleanly to a model so I have to use connection[].cursor()

This is probably parameter sniffing issue. If that's the case, there are couple of solutions. The easiest solution is using query hint.
Option 1:
from django.db import connections
with connections['default'].cursor() as cursor:
cursor.execute("""
select c.column1 as c1
, ve.column2 as c2
from view_example c
left join view_slow_view ve on c.k1 = ve.k2
where c.column_condition = %s and c.column_condition_2 = %s
OPTION(RECOMPILE) -- add this line to your query
""", [value_1, value_2])
contracts_dict_lst = dictfetchall(cursor)
Option 2:
from django.db import connections
with connections['default'].cursor() as cursor:
cursor.execute("""
declare v1 varchar(100) = %s -- declare variable and use them
declare v2 varchar(100) = %s
select c.column1 as c1
, ve.column2 as c2
from view_example c
left join view_slow_view ve on c.k1 = ve.k2
where c.column_condition = v1 and c.column_condition_2 = v2
""", [value_1, value_2])
contracts_dict_lst = dictfetchall(cursor)
This is a good link for more reading.

SQL Server : nested looping over two Selects

I have the following two queries that produce the results I need. Now the final output I truly need I would usually use python for after the results are returned, but unfortunately only SQL can be used.
Query A:
SELECT *
FROM openquery(PROD, 'SELECT `status`, computer_name, device_type
FROM assets
WHERE (device_type="SERVER")
AND (status="ACTIVE")')
Query B:
SELECT *
FROM openquery(AppMap, 'SELECT `t1`.`uaid` AS `uaid`, `t3`.`computer_name`,
FROM ((`applications` `t1`
JOIN `app_infrastructure` `t2` ON (((`t1`.`uaid` = `t2`.`uaid`))))
JOIN `infrastructure` `t3` ON ((`t2`.`infrastructure_id` = `t3`.`infrastructure_id`)));')
How I would want to process the results:
if a computer_name is in both A and B:
final_row = ['computer_name', 1]
elseif a computer_name is in A but not B:
final_row = ['computer_name', 0]
elseif a computer_name is in B but not A:
final_row = ['computer_name', 2]
So my final query results need to look like those rows, does that make sense?

In a stored procedure, use both queries to load table variables.
Then do a FULL OUTER JOIN query, joining the two table variables on computer_name, and use a CASE expression to get your final_row value for each computer name.

Exclude value from query where already returned in rows above

I am writing a query which will be used to populate a report used for putting away stock in a warehouse.
The report has 3 parameters, a source stock location, source bin number and a destination stock location.
The stock will be currently held in source stock location and bin number.
The destination stock location is where the stock needs to be moved to.
Each Variant has a default bin number in each stock location.
Multiple variants can have the same default bin number.
Bin numbers may not be alphabetical around the warehouse, and so each bin number is assigned a walk route, as the most efficient walking route around the warehouse.
The report will look at the default bin number associated with the item in the destination stock location, and if empty, offer that as the put away suggestion.
If the default bin number is not empty, it will look for the next available empty bin in the destination stock location (where walk route is higher than default bin), and then offer that as the put away suggestion.
The query works fine, and does exactly that, however it is reporting the same bin as previous "NextBinNo" suggestions in rows above.
How can I get the OUTER APPLY NextBinNo to filter out any previously suggested bins in higher rows of data? Also if two items have the same default bin number, it should use the NextBinNo for the second row with this default bin no.
My current query:
Select
row_number() Over(Order by DestSL.sl_id) as RowNo,
Stock_location.sl_name,
bin_number.bn_bin_number,
variant_detail.vad_variant_code,
variant_detail.vad_description,
variant_transaction_header.vth_current_quantity,
variant_transaction_header.vth_batch_number,
purchase_order_header.poh_order_number,
supplier_detail.sd_ow_account,
DestSL.sl_id as 'DestinationSLID',
DestSL.sl_name as 'DestinationStockLocation',
DestDefaultBin.bn_bin_number as 'DestinationDefaultBin',
DestDefaultBin.bn_walk_route as 'DestinationDefaultWalkRoute',
isnull(DestDefaultBinQty.BinQty,0) as QtyInDefaultBin,
NextBinNo.NextBinNo as 'NextBinNo',
NextBinNo.NextWalkRoute as 'NextBinWalkRoute',
isnull(NextBinNo.BinQty,0) as 'NextBinQty',
case when DestDefaultBin.bn_bin_number is null
then 'Not Stocked in This Location'
Else
case when isnull(DestDefaultBinQty.BinQty,0) > 0
then
case when NextBinNo.NextBinNo is NULL
then 'No Free Bin'
Else NextBinNo.NextBinNo
End
Else DestDefaultBin.bn_bin_number
End
End as 'Put Away Destination'
From variant_transaction_header
join bin_number on bin_number.bn_id = variant_transaction_header.vth_bn_id
join stock_location on stock_location.sl_id = variant_transaction_header.vth_sl_id
join variant_detail on variant_detail.vad_id = variant_transaction_header.vth_vad_id
join transaction_type on transaction_Type.tt_id = variant_transaction_header.vth_tt_id
left join purchase_order_line on purchase_order_line.pol_id = variant_transaction_header.vth_pol_id
left join purchase_order_header on purchase_order_header.poh_id = purchase_order_line.pol_poh_id
left join supplier_detail on supplier_detail.sd_id = purchase_order_header.poh_sd_id
join stock_location DestSL on DestSL.sl_id = #DestinationStockLoc
left join variant_stock_location DestVSL on DestVSL.vsl_vad_id = variant_detail.vad_id and DestVSL.vsl_sl_id = DestSL.sl_id
left join bin_number DestDefaultBin on DestDefaultBin.bn_id = DestVSL.vsl_bn_id
left join
(select sum(variant_transaction_header.vth_current_quantity) as BinQty,
variant_transaction_header.vth_bn_id
from variant_transaction_header
join transaction_type on transaction_Type.tt_id = variant_transaction_header.vth_tt_id
Where variant_transaction_header.vth_current_quantity > 0
and transaction_type.tt_transaction_type = 'IN' and transaction_Type.tt_update_current_qty = 1
Group by variant_transaction_header.vth_bn_id) as DestDefaultBinQty on DestDefaultBinQty.vth_bn_id = DestDefaultBin.bn_id
Outer Apply
(select top 1
row_number() Over(Order by NextBin.bn_bin_number) as RowNo,
NextBin.bn_bin_number as NextBinNo,
NextBin.bn_walk_route as NextWalkRoute,
BinQty.BinQty
from
Stock_location DestSL
Join bin_number NextBin on NextBin.bn_sl_id = DestSL.sl_id
left join
(select sum(variant_transaction_header.vth_current_quantity) as BinQty,
variant_transaction_header.vth_bn_id
from variant_transaction_header
join transaction_type on transaction_Type.tt_id = variant_transaction_header.vth_tt_id
Where variant_transaction_header.vth_current_quantity > 0
and transaction_type.tt_transaction_type = 'IN' and transaction_Type.tt_update_current_qty = 1
Group by variant_transaction_header.vth_bn_id) as BinQty on BinQty.vth_bn_id = NextBin.bn_id
Where NextBin.bn_sl_id = #DestinationStockLoc
and NextBin.bn_walk_route > DestDefaultBin.bn_walk_route
And isnull(BinQty.BinQty,0) = 0
order by NextBin.bn_walk_route, nextbin.bn_bin_number) as NextBinNo
where variant_transaction_header.vth_current_quantity > 0
and transaction_type.tt_transaction_type = 'IN' and transaction_Type.tt_update_current_qty = 1
and stock_location.sl_id = #SourceStockLoc and bin_number.bn_id = #SourceBinNo
You can see my current results below:
Row2 is using the NextBinNo as the default bin has stock.
Row3 is also suggesting using AA08A2 as the next bin.
Row 6 is currently suggesting AA01A2 but that has already been suggested in Row1.

Just to answer this, in the end I could not achieve what I wanted directly in SQL.
The data is ultimately being returned to a report writer which can run Visual Basic code.
I had to execute the NextBinNo sub query separately in VB.
In VB I can define a string which contains a list of all the "used" bin numbers and so on each row this is referenced in the WHERE of the query to check it has not been used in a row above.
I could then return this value as a new column dynamically inserted in the dataset at run time.

ibm 1 sql creating - need to add 2 tables

I have following view which is working but not sure how to add 2 tables to join.
This table is adres1 and it will join on the IDENT# and IDSFX# to table
prodta.adres1 called adent# and adsfx#, there I need a col. ads15.
then i also need to get the ship to, row in this adres1. this we get first from the order table, prodta. oeord1 in col. odgrc#. This grc# is 11 pos and is combined 8 and 3 of the ent and suf. these 2 represent the ship to record and looking in same table adres1 (we do have many logical views on them if it's easier, like adres15) we can get col. ADSTTC for the ship to state.
Not sure if can included these 2 new parts to the current view created code below. Please ask if something not clear, it's an old system and somewhat developed convoluted.
CREATE VIEW Prolib.SHPWEIGHTP AS SELECT
T01.IDORD#,
T01.IDDOCD,
T01.IDPRT#,
t01.idsfx#,
T01.IDSHP#,
T01.IDNTU$,
T01.IDENT#,
(T01.IDNTU$ * T01.IDSHP#) AS LINTOT,
T02.IAPTWT,
T02.IARCC3,
T02.IAPRLC,
T03.PHVIAC,
T03.PHORD#,
PHSFX#,
T01.IDORDT,
T01.IDHCD3
FROM PRODTA.OEINDLID T01
INNER JOIN PRODTA.ICPRTMIA T02 ON T01.IDPRT# = T02.IAPRT#
INNER JOIN
(SELECT DISTINCT
PHORD#,
PHSFX#,
PHVIAC,
PHWGHT
FROM proccdta.pshippf) AS T03 ON t01.idord# = T03.phord#
WHERE T01.IDHCD3 IN ('MDL','TRP')

I'm not exactly clear on what you're asking, and it looks like some of the column-names are missing from your description, but this should get you pretty close:
CREATE VIEW Prolib.SHPWEIGHTP AS
SELECT T01.IDORD#,
T01.IDDOCD,
T01.IDPRT#,
t01.idsfx#,
T01.IDSHP#,
T01.IDNTU$,
T01.IDENT#,
( T01.IDNTU$ * T01.IDSHP# ) AS LINTOT ,
T02.IAPTWT,
T02.IARCC3,
T02.IAPRLC,
T03.PHVIAC,
T03.PHORD#,PHSFX#,
T01.IDORDT,
T01.IDHCD3,
t04.ads15
FROM PRODTA.OEINDLID T01
INNER JOIN PRODTA.ICPRTMIA T02
ON T01.IDPRT# = T02.IAPRT#
INNER JOIN (SELECT DISTINCT
PHORD#,
PHSFX#,
PHVIAC,
PHWGHT
FROM proccdta.pshippf) AS T03
ON t01.idord# = T03.phord#
JOIN prodta.adres1 as t04
on t04.adent# = t01.adent#
and t04.adsfx# = t01.adsfx#
JOIN prodta.oeord1 t05
on t05.odgrc# = T01.IDENT# || T01.SUFFIX
WHERE T01.IDHCD3 IN ('MDL','TRP')
Let me know if you need more details.
HTH !

Converting complex sql stored proc into linq

I'm using Linq to Sql and have a stored proc that won't generate a class. The stored proc draws data from multiple tables into a flat file resultset.
The amount of data returned must be as small as possible, the number of round trips to the Sql Server need to be limited, and the amount of server-side processing must be limited as this is for an ASP.NET MVC project.
So, I'm trying to write a Linq to Sql Query however am struggling to both replicate and limit the data returned.
Here's the stored proc that I'm trying to convert:
SELECT AdShops.shop_id as ID, Users.image_url_75x75, AdShops.Advertised,
Shops.shop_name, Shops.title, Shops.num_favorers as hearts, Users.transaction_sold_count as sold,
(select sum(L4.num_favorers) from Listings as L4 where L4.shop_id = L.shop_id) as listings_hearts,
(select sum(L4.views) from Listings as L4 where L4.shop_id = L.shop_id) as listings_views,
L.title AS listing_title, L.price as price, L.listing_id AS listing_id, L.tags, L.materials, L.currency_code,
L.url_170x135 as listing_image_url_170x135, L.url AS listing_url, l.views as listing_views, l.num_favorers as listing_hearts
FROM AdShops INNER JOIN
Shops ON AdShops.shop_id = Shops.shop_id INNER JOIN
Users ON Shops.user_id = Users.user_id INNER JOIN
Listings AS L ON Shops.shop_id = L.shop_id
WHERE (Shops.is_vacation = 0 AND
L.listing_id IN
(
SELECT listing_id
FROM (SELECT l2.user_id , l2.listing_id, RowNumber = ROW_NUMBER() OVER (PARTITION BY l2.user_id ORDER BY NEWID())
FROM Listings l2
INNER JOIN (
SELECT user_id
FROM Listings
GROUP BY
user_id
HAVING COUNT(*) >= 3
) cnt ON cnt.user_id = l2.user_id
) l2
WHERE l2.RowNumber <= 3 and L2.user_id = L.user_id
)
)
ORDER BY Shops.shop_name
Now, so far I can return a flat file but am not able to limit the number of listings. Here's where I'm stuck:
Dim query As IEnumerable = From x In db.AdShops
Join y In (From y1 In db.Shops
Where y1.Shop_name Like _Search + "*" AndAlso y1.Is_vacation = False
Order By y1.Shop_name
Select y1) On y.Shop_id Equals x.shop_id
Join z In db.Users On x.user_id Equals z.User_id
Join l In db.Listings On l.Shop_id Equals y.Shop_id
Select New With {
.shop_id = y.Shop_id,
.user_id = z.user_id,
.listing_id = l.Listing_id
} Take 24 ' Fields ommitted for briefity...
I assume to select a random set of 3 listings per shop, I'd need to use a lambda expression however am not sure how to do this. Also, need to add in somewhere consolidated totals for listing fieelds against individual shops...
Anyone have any thoughts?
UPDATE:
Here's the current solution that I'm looking at:
Result class wrapper:
Public Class NewShops
Public Property Shop_id As Integer
Public Property listing_id As Integer
Public Property tl_listing_hearts As Integer?
Public Property tl_listing_views As Integer?
Public Property listing_creation As Date
End Class
Linq + code:
Using db As New Ads.DB(Ads.DB.Conn)
Dim query As IEnumerable(Of IGrouping(Of Integer, NewShops)) =
(From x In db.AdShops
Join y In (From y1 In db.Shops
Where (y1.Shop_name Like _Search + "*" AndAlso y1.Is_vacation = False)
Select y1
Skip ((_Paging.CurrentPage - 1) * _Paging.ItemsPerPage)
Take (_Paging.ItemsPerPage))
On y.Shop_id Equals x.shop_id
Join z In db.Users On x.user_id Equals z.User_id
Join l In db.Listings On l.Shop_id Equals y.Shop_id
Join lt In (From l2 In db.Listings _
Group By id = l2.Shop_id Into Hearts = Sum(l2.Num_favorers), Views = Sum(l2.Views), Count() _
Select New NewShops With {.tl_listing_views = Views,
.tl_listing_hearts = Hearts,
.Shop_id = id})
On lt.Shop_id Equals y.Shop_id
Select New NewShops With {.Shop_id = y.Shop_id,
.tl_listing_views = lt.tl_listing_views,
.tl_listing_hearts = lt.tl_listing_hearts,
.listing_creation = l.Creation,
.listing_id = l.Listing_id
}).GroupBy(Function(s) s.Shop_id).OrderByDescending(Function(s) s(0).tl_listing_views)
Dim Shops as New Dictionary(Of String, List(Of NewShops))
For Each item As IEnumerable(Of NewShops) In query
Shops.Add(item(0).shop_name, (From i As NewShops In item
Order By i.listing_creation Descending
Select i Take 3).ToList)
Next
End Using
Anyone have any other suggestions?

From the looks of that SQL and code, I'd not be turning it into LINQ queries. It'll just obfuscate the logic and probably take you days to get it correct.
If SQLMetal doesn't generate it properly, have you considered using the ExecuteQuery method of the DataContext to return a list of the items you're after?
Assuming that your sproc you're trying to convert is called sp_complicated, and takes in one parameter, something like the following should do the trick
Protected Class TheResults
Public Property ID as Integer
Public Property image_url_75x75 as String
'... and so on and so forth for all the returned columns. Be careful with nulls
End Class
'then, when you want to use it
Using db As New Ads.DB(Ads.DB.Conn)
dim results = db.ExecuteQuery(Of TheResults)("exec sp_complicated {0}", _Search)
End Using
Before you freak out, that's not susceptible to SQL Injection. L2SQL uses proper SQLParameters, as long as you use the squigglies and don't just concatenate the strings yourself.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Spark SQL eliminate non repeating values across tables - sql-server

Related

Parameterized raw sql query much slower than query with actual values

SQL Server : nested looping over two Selects

Exclude value from query where already returned in rows above

ibm 1 sql creating - need to add 2 tables

Converting complex sql stored proc into linq

Categories

Resources