Optimizing Large Table Join in PySpark

Optimizing Large Table Join in PySpark - query-optimization

I have a large fact table, roughly 500M rows per day. The table is partitioned by region_date.
I have to scan through 6 months of data every day, left outer join with another smaller subset (1M rows) based on an id & date column and calculate two aggregate values: sum(fact) if id exists in right table & sum(fact)
My SparkSQL looks like this:
SELECT
a.region_date,
SUM(case
when t4.id is null then 0
else a.duration_secs
end) matching_duration_secs
SUM(a.duration_secs) total_duration_secs
FROM fact_table a LEFT OUTER JOIN id_lookup t4
ON a.id = t4.id
and a.region_date = t4.region_date
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
GROUP BY a.region_date
What is the best way to optimize and distribute/partition the data? The query runs for more than 3 hours now. I tried spark.sql.shuffle.partitions = 700
If I roll-up the daily data at "id" level, it's about 5M rows per day. Should I rollup the data first and then do the join?
Thanks,
Ram.

Because there are some filter conditions in your query, I thought you can split your query into two queries to decrease the amount of data first.
table1 = select * from fact_table
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
Then you can use the new table which is much smaller than the original table to join id_lookup table

Related

SQL query to insert stats of students to a table

I would like some help on how to write an sql server query in order to insert the monthly stats of students into a table.
My monthly stats table is something like this:
| StudentID | Year | Month | Grade1 | Grade2| Absences
Now I have another table with the Students Details like StudentID, name, etc. Also multiple other tables with grades, presence etc.
My goal is to select all studentsIDs from StudentDetails and insert them to the Monthly Stats table while I calculate Grade1, Grade2, and Absences from other multiple tables.
What is the best way to write such a query?
Do I first insert the StudentsIds, Year column and Month column with a select into query and after that, I iterate somehow through every studentid that were inserted and run update queries (for calculating rest of columns) for every studentID for the specified month and year?
I just need an example or some logic on how to achieve this.
For the the first part of inserting studentids I have this:
declare #maindate date = '20230101';
insert into Monthly_Stats (StudentID, Year, Month)
(select StudentID, AllocatedYear, AllocatedMonth
from Students_Allocation
where AllocatedMonth = DATEPART(MONTH, #maindate)
and AllocatedYear = DATEPART(YEAR, #maindate)
and Active = 1)
After insertion I would like somehow to update every other column (Grade1, Grade2,Absences...) from multiple other tables for each StudentID for the aforementioned Month and Year.
Any ideas?

This is what I usually perform batch update
UPDATE Monthly_Stats
SET
Monthly_Stats.GRADE1 = T1.Somedata + T2.Somedata + T3.Somedata
FROM
Monthly_Stats MS INNER JOIN TABLE_1 as T1
left join TABLE_2 as T2 on T1.StudentID = T2.StudentID and T1.Year = T2.Year and T1.Month = T2.Month
left join TABLE_3 as T3 on T1.StudentID = T3.StudentID and T1.Year = T3.Year and T1.Month = T3.Month
ON
MS.StudentID = T1.StudentID and MS.Year = T1.Year and MS.Month = T1.Month;
Be careful with the two left join. Depending on your database normalization, you may need more conditions in the ON clause to ensure the join output is as expected.
Hope it helps

In sql, how to filter records in two joined tables

I output invoices that are made up of info in two separate data tables linked by a unique ID #. I need to update the service provided in a group of invoices (service info contained in table_B) for only a certain date period (date info contained in table_A).
Here's the two tables I am joining
Table_A
ID------|Name-----------------|Date----------|Total--------|
1-------|--ABC Company--------|--1/1/17------|--$50--------|
2-------|--John Smith---------|--3/1/17------|--$240-------|
3-------|--Mary Jones---------|--2/1/16------|--$320-------|
1-------|--ABC Company--------|--8/1/16------|--$500-------|
Table_B
Table_ID (= ID Table_A)----|-Service-----------|Unit Price--|Qty------|
1--------------------------|--Service A--------|--$50.00----|--10-----|
--
2--------------------------|--Service B--------|--$20.00----|--12-----|
--
3--------------------------|--Service B--------|--$20.00----|--16-----|
--
1--------------------------|--Service A--------|--$50.00----|--10-----|
I am able to join the two tables using:
Select * from Table_B b inner join Table_A a on b.Table_ID = a.ID
which results in following:
Results
Table_ID-|-Service-----|-Unit Price-|-Qty-|-ID--|-Name-----|-Date----|Total--|
1--------|-Service A- |$50.00------|-10--|-1---|-ABC Co.--|-1/1/17--|$500--|
2--------|-Service B- |$20.00------|-12--|-2---|-John S.--|-3/1/17--|$240--|
3--------|-Service B- |$20.00------|-16--|-3---|-Mary J.--|-2/1/16--|$320--|
1--------|-Service A- |&50.00------|-10--|-1---|-ABC Co.--|-8/1/16--|$500--|
Now, I want only rows that are for dates greater 12/31/16. However, when I add a where clause for the date (see below) my results don't change.
Select * from Table_B b inner join Table_A a on b.Table_ID = a.ID where date > 12/31/16
I would expect just two rows for services on 1/1/17 and 3/1/17. How can I filter for just rows with a particular date value in this newly joined table?

Assuming your date is contained in a column intended for storing dates, and not string, try making sure that the date you're passing in really is being interpreted as a date:
SELECT
*
FROM
table_b b
INNER JOIN
table_a a
on b.Table_ID = a.ID
WHERE
a.date > CONVERT(datetime , '20161231' , 112 )
I suspect that SQLSERVER is interpreting your date 12/31/16 as "twelve divided by thirty one divided by sixteen" - a floating point number approximately 0.0241935
The way dates are handled, internally, they are convertable to floating point numbers representing the number (and fraction of) days since a certain point in time, I believe 1 Jan 1900. Hence your 0.024 floating point number will represent a date about 35 minutes past midnight on 01 jan 1900.. and that's why your results aren't filtering, because all the dates satisfy the where clause (theyre all later than 01-01-1900 00:35)!

What result are you getting with your current implementation, because I don't see any issue with your current query

Please test below script, it may give you expected output.
Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
and date > convert(date , '20161231' , 112 )

Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
where date > '12/31/16'
Can you try to use the quotes with your date.
Or best way is to use
Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
where Date BETWEEN '12/31/2016' and '1/1/2018'

How to add column A (date column) to Column B ( number of business days) in teradata to get the new date?

Here's my data;
table A.pickup_date is a date column
table A.biz_days is the business days I want to add up to A.pickup_date
table B.date
table B.is_weekend (Y or N)
table B. is_holiday (Y or N)
Basically from table B, I know for each date, if any date is a business day or not. Now I want to have a third column in table A for the exact date after I add A.business_days to A.pickup_date.
Can anyone provide me with either a case when statement or procedure statement for this? Unfortunately we are not allowed to write our own functions in Teradata.

This is pretty darned ugly, but I think it should get you started.
First I created a volatile table to represent your table a:
CREATE VOLATILE TABLE vt_pickup AS
(SELECT CURRENT_DATE AS pickup_date,
8 AS Biz_Days) WITH DATA PRIMARY INDEX(pickup_date)
ON COMMIT PRESERVE ROWS;
INSERT INTO vt_pickup VALUES ('2015-02-24',5);
Then I joined that with sys_calendar.calendar to get the days of the week:
CREATE VOLATILE TABLE VT_Days AS
(
SELECT
p.pickup_date,
day_of_week
FROM
vt_pickup p
INNER JOIN sys_calendar.CALENDAR c
ON c.calendar_date >= p.pickup_date
AND c.calendar_date < (p.pickup_date + Biz_Days)
) WITH DATA
PRIMARY INDEX(pickup_date)
ON COMMIT PRESERVE ROWS
Then I can use all that to generate the actual delivery date:
SELECT
p.pickup_date,
p.biz_days,
biz_days + COUNT(sundays.day_of_week) + COUNT (saturdays.day_of_week) AS TotalDays,
COUNT (sundays.day_of_week) AS Suns,
COUNT (saturdays.day_of_week) AS Sats,
p.pickup_date + totaldays AS Delivery_Date,
FROM
vt_pickup p
LEFT JOIN vt_days AS Sundays ON
p.pickup_date = sundays.pickup_date
AND sundays.day_of_week = 1
LEFT JOIN vt_days AS saturdays ON
p.pickup_date = saturdays.pickup_date
AND saturdays.day_of_week = 7
GROUP BY 1,2
You should be able to use the logic with another alias for your holidays.

The easiest way to do this is calculating a sequential number of business days (add it as a new column to your calendar table if it's a recurring operation, otherwise using WITH):
SUM(CASE WHEN is_weekend = 'Y' OR is_holiday = 'Y' THEN 0 ELSE 1 END)
OVER (ORDER BY calendar_date
ROWS UNBOUNDED PRECEDING) AS biz_day#
Then you need two joins:
SELECT ..., c2.calendar_date
FROM tableA AS a
JOIN tableB AS c1
ON a.pickup_date = c1.calendar_date
JOIN tableB AS c2
ON c2.biz_day# = c1.biz_day# + a.biz_days
AND is_weekend = 'N'
AND is_holiday = 'N'

sql cross table calculations

Hi i need to write a query that does multiple things, i made it so it can get the details of orders from within a certain time frame as well as for ages between 20 and 30, however i need to check if the orders product cost more then a set amount
however that data is in multiple tables
one table has the orderid the prodcode and quantity, while the other day has the prod information such as code and price, and im 3rd from another table
So i need to access the price of the product with the prodcode and quantity to do a cross table calculation and see if its above 100 and trying to do this with an and where command
so if i have 3 tables
Orderplaced table with oid odate custno paid
ordered table with oid itemid quant
items itemid itemname price
and i need to do a calcultion across those tabkes in my query
SELECT DISTINCT Orderplaced.OID, Orderplaced.odate, Orderplaced.custno, Orderplaced.paid
FROM Cust, Orderplaced, items, Ordered
WHERE Orderplaced.odate BETWEEN '01-JUL-14' AND '31-DEC-14'
AND Floor((sysdate-Cust.DOB) / 365.25) Between '20' AND '30'
AND Cust.SEX='M'
AND items.itemid=ordered.itemid
AND $sum(ordered.quan*item.PRICE) >100;
no matter what way i try to get the calculation to work it doesnt seem to work always returns the same result even on orders under 100 dollars
so any advice on this would be good as its for my studies but is troubling me a lot

I think this is what you want. (I not familiar with $sum, I've replaced it with SUM())
SELECT
Orderplaced.OID,
Orderplaced.odate,
Orderplaced.custno,
Orderplaced.paid,
sum(ordered.quan * item.PRICE)
FROM
Cust
JOIN Orderplaced ON Cust.CustNo = Orderplaced.custno
JOIN Ordered ON Ordered.Oid = Orderplaced.Oid
JOIN items ON items.itemid = ordered.itemid
WHERE
Orderplaced.odate BETWEEN date 2014-07-01 AND date 2014-12-31
AND Floor((sysdate-Cust.DOB) / 365.25) Between 20 AND 30
AND Cust.SEX = 'M'
GROUP BY
Orderplaced.OID,
Orderplaced.odate,
Orderplaced.custno,
Orderplaced.paid
HAVING
sum(ordered.quant * item.PRICE) > 100;

I think you want to try something like this...
SELECT DISTINCT Orderplaced.OID, Orderplaced.odate, Orderplaced.custno, Orderplaced.paid
FROM Cust
JOIN Orderplaced ON
Cust.<SOMEID> = OrderPlaces.<CustId>
AND Orderplaced.odate BETWEEN '01-JUL-14' AND '31-DEC-14'
WHERE Floor((sysdate-Cust.DOB) / 365.25) Between 20 AND 30
AND Cust.SEX='M'
AND (
SELECT SUM(Ordered.quan*Item.PRICE)
FROM Ordered
JOIN Item ON Item.ItemId = Ordered.ItemId
WHERE Ordered.<SomeId> = OrderPlaced.<SomeId>) > 100
Couple of pointers:
1. Floor returns a number... you are comparing it to a string
2. Typically, when referencing a table in a query, the table has to be joined on its primary keys, ie. In your query you're referencing Item and ordered, without joining any of those tables on any key columns.
Hope that helps

Reads are not getting low after putting a Index

The requirement is to load 50 records in paging with all 65 columns of table "empl" with minimum IO. There are 280000+ records in table. There is only one clustered index over the PK.
Pagination query is as following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.*
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN 0 AND 50
Following are the stats after running the above query from profiler:
CPU: 1500, Reads: 25576, Duration: 25704
Then I put the following index over the table empl:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([del],[md],[pub],[tid],[coid],[sid],[ptid],[cid],[uon])
GO
After putting index CPU and Reads are still higher. I don't know what's wrong with the index or something wrong with the query?
Edit:
The following query is also taking high reads after putting index. And there are only 3 columns and 1 count.
SELECT TOP (2147483647)
ame.aid ID, ame.name name,
COUNT(empl.pid) [Count], ps.uff uff FROM ame with (NOLOCK)
JOIN pam AS pa WITH (NOLOCK) ON pa.aid = ame.aid
JOIN empl WITH (NOLOCK) ON empl.pid = pa.pid
LEFT JOIN psam AS ps
ON ps.psid = 1001
AND ps.aid = ame.aid
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = empl.ptid
WHERE
empl.del = 0 AND empl.pub = 1 AND empl.sid = 2
AND empl.md = 0
AND (empl.tid = 3)
AND (empl.coid = 2)
AND (empl.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1)
AND ame.pub = 1 AND ame.del = 0
GROUP BY ame.aid, ame.name, ps.uff
ORDER BY ame.name ASC
Second Edit:
Now I had put the following index on "uon" column:
CREATE NONCLUSTERED INDEX [ci_empl_uon]
ON [dbo].[empl] (uon)
GO
But still CPU and Reads are Higher.
Third Edit:
DTA is suggesting me index with all columns included for the first query so I altered the suggested index convert it to a filter index for the basic four filters to make it more effective.
I added the line below after Include while creating the index.
Where e.del = 0 AND e.pub = 1 AND e.sid = 2 AND e.md = 0 AND e.coid = 2
But still the reads are high on both development and production machine.
Fourth Edit:
Now I had come to a solution that has improved the performance, but still not up to the goal. The key is that it's not going for ALL THE DATA.
The query is a following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.pID pID
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set join empl on result_set.pID = empl.pID
WHERE
[row_number] BETWEEN #start AND #end
And recreated the index with key column alterations, include and filter:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([ptid],[cid],[tid],[uon])
INCLUDE ([pID])
Where
[coID] = 2 and
[sID] = 2 and
[pub] = 1 and
[del] = 0 and
[md] = 0
GO
It improves the performance, but not up to the goal.

You are selecting the top 50 rows ordered by e.uon desc. An index that starts with uon will speed up the query:
create index IX_Empl_Uon on dbo.empl (uon)
The index will allow SQL Server to scan the top N rows of a this index. N is the highest number in your pagination: for the 3rd page of 50 elements, N equals 150. SQL Server then does 50 key lookups to retrieve the full rows from the clustered index. As far as I know, this is a textbook example of where an index can make a big difference.
Not all query optimizers will be smart enough to notice that row_number() over ... as rn with where
rn between 1 and 50 means the top 50 rows. But SQL Server 2012 does. It uses the index both for first and consecutive pages, like row_number() between 50 and 99.

You are trying to find the X through X+Nth row from a dataset, based on an order specified by column uon.
I’m assuming here that uon is the mentioned primary key. If not, without an index where uon is the first (if not only) column, a table scan is inevitable.
Next wrinkle: You don’t want that direct span of columns, you want that span of columns as filtered by a fairly extensive assortment of filters. The clustered index might pull the first 50 columns, but the WHERE may filter none, some, or all of those out. More will almost certainly have to read in order to "fill your span".
More fun: you perform a left outer join on table empl_add (e.g. retaing the empl row even if there is no empl_add found), and then require filter out all rows where empladd.ptgid is not found in the subquery. Might as well make this an inner join, it may speed things up and certainly will not make them slower. It is also a "filtering factor" that cannot be addressed with an index on table empl.
So: as I see it (i.e. I’m not testing it all out locally), SQL has to first assemble the data, filter out the invalid rows (which involves table joins), order what remains, and return that span of rows you are interested in. I believe that, with or without the index on uon, SQL is identifying a need to read all the data and filter/sort before it can pick out the desired range.
(Your new index would appear to be insufficient. The sixth column is sid, but sid is not referenced in the query, so it might only be able to help “so far”. This raises lots of questions about data cardinality and whatnot, at which point I defer to #Aarons’ point that we have insufficient information on the overall problem set for a full analysis.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight