Create index for huge query in MS SQL Server

Create index for huge query in MS SQL Server - sql-server

I have the following huge request :
SELECT TOP (100000)
B.UserId AS [UserId], B.ProfileId AS [ProfileId]
FROM
tBounce AS B
JOIN
tBounceType AS BT ON BT.Id = B.BounceTypeId
WHERE
BT.IsHard = 1 AND B.IsProcessedByAutoUnsubscriber = 0
GROUP BY
B.UserId, B.ProfileId
HAVING
count(*) > 2
The tBounce table contains billions of records, which columns do I need to include in an index ? Need i add grouping fields as well ?

Related

Show records in three tables with defiate number of rows in each table in SSRS report

I have a table in report like
I want to show the records in three tables on every page, each table contains only 20 records.
Page1:
Page2:
How can I achieve this type of pattern?

I can think of 2 ways to do this, as a MATRIX style report where the column group is your columns, and as a normal table where you JOIN the data to produce 3 copies of name, ID, and any other fields you want. The MATRIX style is definitely more elegant and flexible, but the normal table might be easier for customers to modify if you're turning the report over to power users.
Both solutions start with tagging the data with PAGE, ROW, and COLUMN information. Note that I'm sorting on NAME, but you could sort on any field. Also note that this solution does not depend on your ID being sequential and in the order you want, it generates it's own sequence numbers based on NAME or whatever else you choose.
In this demo I'm setting RowsPerPage and NumberofColumns as hard coded constants, but they could easily be user selected parameters if you use the MATRIX format.
DECLARE #RowsPerPage INT = 20
DECLARE #Cols INT = 3
;with
--Fake data generation BEGIN
cteSampleSize as (SELECT TOP 70 ROW_NUMBER () OVER (ORDER BY O.name) as ID
FROM sys.objects as O
), cteFakeData as (
SELECT N.ID, CONCAT(CHAR(65 + N.ID / 26), CHAR(65 + ((N.ID -1) % 26))
--, CHAR(65 + ((N.ID ) % 26))
) as Name
FROM cteSampleSize as N
),
--Fake data generation END, real processing begins below
cteNumbered as ( -- We can't count on ID being sequential and in the order we want!
SELECT D.*, ROW_NUMBER () OVER (ORDER BY D.Name) as SeqNum
--Replace ORDER BY D.Name with ORDER BY D.{Whatever field}
FROM cteFakeData as D --Replace cteFakeData with your real data source
), ctePaged as (
SELECT D.*
, 1+ FLOOR((D.SeqNum -1) / (#RowsPerPage*#Cols)) as PageNum
, 1+ ((D.SeqNum -1) % #RowsPerPage) as RowNum
, 1+ FLOOR(((D.SeqNum-1) % (#RowsPerPage*#Cols) ) / #RowsPerPage) as ColNum
FROM cteNumbered as D
)
--FINAL - use this for MATRIX reports (best)
SELECT * FROM ctePaged ORDER BY SeqNum
If you want to use the JOIN method to allow this in a normal table, replace the --FINAL query above with this one. Note that it's pretty finicky, so test it with several degrees of fullness in the final report. I tested with 70 and 90 rows of sample data so I had a partial first column and a full first and partial second.
--FINAL - use this for TABLE reports (simpler)
SELECT C1.PageNum , C1.RowNum , C1.ID as C1_ID, C1.Name as C1_Name
, C2.ID as C2_ID, C2.Name as C2_Name
, C3.ID as C3_ID, C3.Name as C3_Name
FROM ctePaged as C1 LEFT OUTER JOIN ctePaged as C2
ON C1.PageNum = C2.PageNum AND C1.RowNum = C2.RowNum
AND C1.ColNum = 1 AND (C2.ColNum = 2 OR C2.ColNum IS NULL)
LEFT OUTER JOIN ctePaged as C3 ON C1.PageNum = C3.PageNum
AND C1.RowNum = C3.RowNum AND (C3.ColNum = 3 OR C3.ColNum IS NULL)
WHERE C1.ColNum = 1

1) Add the dataset with the below query to get Page number and Table number. You can change the number 20 and 60 as per requirement. In my case, I need 20 records per section and having 3 sections, so total records per page are 60.
Select *,(ROW_NUMBER ( ) OVER ( partition by PageNumber order by Id )-1)/20 AS TableNumber from (
Select (ROW_NUMBER ( ) OVER ( order by Id )-1)/60 AS PageNumber
,* from Numbers
)Src
2)Add the table of one column and select the prepared dataset.
3)Add PageNumber in Group expression for Details group.
4)Add the Column parent group by right-clicking on detail row. Select Group by TableNumber.
5) Delete the first two rows. Select Delete rows only.
6) Add one more table and select the ID and Name.
7) Drag this newly created table into the cell of the previously created table. And increase the size of the table.
Result:
Each table section contains 20 records. and it will continue in next pages also.

Issue with query in DB2

I was using below query in sql server to update the table "TABLE" using the same table "TABLE". In sql server the below query is working fine.But in DB2 its getting failed.Not sure whether I need to make any change in this query to work in DB2.
The error I am getting in DB2 is
ExampleExceptionFormatter: exception message was: DB2 SQL Error:
SQLCODE=-204, SQLSTATE=42704
This is my input Data and there you can see ENO 679 is repeating in both round 3 and round 4.
My expected output is given below. Here I am taking the ID and round value from round 4 and updating rownumber 3 with the ID value from rownumber 4.
My requirement is to find the ENO which is exist in both round 3 and round 4 and update the values accordingly.
UPDATE TGT
SET TGT.ROUND = SRC.ROUND,
TGT.ID = SRC.ID
FROM TABLE TGT INNER JOIN TABLE SRC
ON TGT.ROUND='3' and SRC.ROUND='4' and TGT.ENO = SRC.ENO
Could someone help here please. I tried something like this.But its not working
UPDATE TABLE
SET ID = (SELECT t.ID
FROM TABLE t, TABLE t2
WHERE t.ENO = t2.ENO AND t.ROUND= ='4' AND t2.ROUND='3'
) ,
ROUND= (SELECT t.ROUND
FROM TABLE t, TABLE t2
WHERE t.ENO = t2.ENO AND t.ROUND= ='4' AND t2.ROUND='3')
where ROUND='3'

You may try this. I think the issue is you are not relating your inner subquery with outer main table
UPDATE TABLE TB
SET TB.ID = (SELECT t.ID
FROM TABLE t, TABLE t2
WHERE TB.ENO=t.ENO ---- added this
and t.ENO = t2.ENO AND t.ROUND= ='4' AND t2.ROUND='3'
) ,
TB.ROUND= (SELECT t.ROUND
FROM TABLE t, TABLE t2
WHERE TB.ENO=t.ENO --- added this
and t.ENO = t2.ENO AND t.ROUND= ='4' AND t2.ROUND='3')
where tb.ROUND='3'

Try this:
UPDATE MY_SAMPLE TGT
SET (ID, ROUND) = (SELECT ID, ROUND FROM MY_SAMPLE WHERE ENO = TGT.ENO AND ROUND = 4)
WHERE ROUND = 4 AND EXISTS (SELECT 1 FROM MY_SAMPLE WHERE ENO = TGT.ENO AND ROUND = 4);
The difference with yours is that the correlated subquery has to be a row-subselect, it has to guarantee zero or one row (and will assign nulls in case of returning zero rows). The EXISTS subquery excludes rows for which the correlated subquery will not return rows.

Optimizing Large Table Join in PySpark

I have a large fact table, roughly 500M rows per day. The table is partitioned by region_date.
I have to scan through 6 months of data every day, left outer join with another smaller subset (1M rows) based on an id & date column and calculate two aggregate values: sum(fact) if id exists in right table & sum(fact)
My SparkSQL looks like this:
SELECT
a.region_date,
SUM(case
when t4.id is null then 0
else a.duration_secs
end) matching_duration_secs
SUM(a.duration_secs) total_duration_secs
FROM fact_table a LEFT OUTER JOIN id_lookup t4
ON a.id = t4.id
and a.region_date = t4.region_date
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
GROUP BY a.region_date
What is the best way to optimize and distribute/partition the data? The query runs for more than 3 hours now. I tried spark.sql.shuffle.partitions = 700
If I roll-up the daily data at "id" level, it's about 5M rows per day. Should I rollup the data first and then do the join?
Thanks,
Ram.

Because there are some filter conditions in your query, I thought you can split your query into two queries to decrease the amount of data first.
table1 = select * from fact_table
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
Then you can use the new table which is much smaller than the original table to join id_lookup table

Query in sql server?

how to replace this query in sql server ?
DELETE FROM student
WHERE id=2
AND list_column_name='lastName'
OR list_column_name='firstName'
LIMIT 3;

There is no ORDER BY in your original query.
If you just want to delete an arbitary three records matching your WHERE you can use.
DELETE TOP(3) FROM student
WHERE id = 2
AND list_column_name = 'lastName'
OR list_column_name = 'firstName';
For greater control of TOP (as ordered by what?) you need to use a CTE or similar.

using CTE and TOP 3 where equal to LIMIT 3
;WITH CTE
AS (SELECT TOP 3 Studentname
FROM student
WHERE id = 2
AND list_column_name = 'lastName'
OR list_column_name = 'firstName')
DELETE FROM CTE

Reads are not getting low after putting a Index

The requirement is to load 50 records in paging with all 65 columns of table "empl" with minimum IO. There are 280000+ records in table. There is only one clustered index over the PK.
Pagination query is as following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.*
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN 0 AND 50
Following are the stats after running the above query from profiler:
CPU: 1500, Reads: 25576, Duration: 25704
Then I put the following index over the table empl:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([del],[md],[pub],[tid],[coid],[sid],[ptid],[cid],[uon])
GO
After putting index CPU and Reads are still higher. I don't know what's wrong with the index or something wrong with the query?
Edit:
The following query is also taking high reads after putting index. And there are only 3 columns and 1 count.
SELECT TOP (2147483647)
ame.aid ID, ame.name name,
COUNT(empl.pid) [Count], ps.uff uff FROM ame with (NOLOCK)
JOIN pam AS pa WITH (NOLOCK) ON pa.aid = ame.aid
JOIN empl WITH (NOLOCK) ON empl.pid = pa.pid
LEFT JOIN psam AS ps
ON ps.psid = 1001
AND ps.aid = ame.aid
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = empl.ptid
WHERE
empl.del = 0 AND empl.pub = 1 AND empl.sid = 2
AND empl.md = 0
AND (empl.tid = 3)
AND (empl.coid = 2)
AND (empl.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1)
AND ame.pub = 1 AND ame.del = 0
GROUP BY ame.aid, ame.name, ps.uff
ORDER BY ame.name ASC
Second Edit:
Now I had put the following index on "uon" column:
CREATE NONCLUSTERED INDEX [ci_empl_uon]
ON [dbo].[empl] (uon)
GO
But still CPU and Reads are Higher.
Third Edit:
DTA is suggesting me index with all columns included for the first query so I altered the suggested index convert it to a filter index for the basic four filters to make it more effective.
I added the line below after Include while creating the index.
Where e.del = 0 AND e.pub = 1 AND e.sid = 2 AND e.md = 0 AND e.coid = 2
But still the reads are high on both development and production machine.
Fourth Edit:
Now I had come to a solution that has improved the performance, but still not up to the goal. The key is that it's not going for ALL THE DATA.
The query is a following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.pID pID
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set join empl on result_set.pID = empl.pID
WHERE
[row_number] BETWEEN #start AND #end
And recreated the index with key column alterations, include and filter:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([ptid],[cid],[tid],[uon])
INCLUDE ([pID])
Where
[coID] = 2 and
[sID] = 2 and
[pub] = 1 and
[del] = 0 and
[md] = 0
GO
It improves the performance, but not up to the goal.

You are selecting the top 50 rows ordered by e.uon desc. An index that starts with uon will speed up the query:
create index IX_Empl_Uon on dbo.empl (uon)
The index will allow SQL Server to scan the top N rows of a this index. N is the highest number in your pagination: for the 3rd page of 50 elements, N equals 150. SQL Server then does 50 key lookups to retrieve the full rows from the clustered index. As far as I know, this is a textbook example of where an index can make a big difference.
Not all query optimizers will be smart enough to notice that row_number() over ... as rn with where
rn between 1 and 50 means the top 50 rows. But SQL Server 2012 does. It uses the index both for first and consecutive pages, like row_number() between 50 and 99.

You are trying to find the X through X+Nth row from a dataset, based on an order specified by column uon.
I’m assuming here that uon is the mentioned primary key. If not, without an index where uon is the first (if not only) column, a table scan is inevitable.
Next wrinkle: You don’t want that direct span of columns, you want that span of columns as filtered by a fairly extensive assortment of filters. The clustered index might pull the first 50 columns, but the WHERE may filter none, some, or all of those out. More will almost certainly have to read in order to "fill your span".
More fun: you perform a left outer join on table empl_add (e.g. retaing the empl row even if there is no empl_add found), and then require filter out all rows where empladd.ptgid is not found in the subquery. Might as well make this an inner join, it may speed things up and certainly will not make them slower. It is also a "filtering factor" that cannot be addressed with an index on table empl.
So: as I see it (i.e. I’m not testing it all out locally), SQL has to first assemble the data, filter out the invalid rows (which involves table joins), order what remains, and return that span of rows you are interested in. I believe that, with or without the index on uon, SQL is identifying a need to read all the data and filter/sort before it can pick out the desired range.
(Your new index would appear to be insufficient. The sixth column is sid, but sid is not referenced in the query, so it might only be able to help “so far”. This raises lots of questions about data cardinality and whatnot, at which point I defer to #Aarons’ point that we have insufficient information on the overall problem set for a full analysis.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight