Please help me with this query (sql server 2008) - sql-server

ALTER PROCEDURE ReadNews
#CategoryID INT,
#Culture TINYINT = NULL,
#StartDate DATETIME = NULL,
#EndDate DATETIME = NULL,
#Start BIGINT, -- for paging
#Count BIGINT -- for paging
AS
BEGIN
SET NOCOUNT ON;
--ItemType for news is 0
;WITH Paging AS
(
SELECT news.ID,
news.Title,
news.Description,
news.Date,
news.Url,
news.Vote,
news.ResourceTitle,
news.UserID,
ROW_NUMBER() OVER(ORDER BY news.rank DESC) AS RowNumber, TotalCount = COUNT(*) OVER()
FROM dbo.News news
JOIN ItemCategory itemCat ON itemCat.ItemID = news.ID
WHERE itemCat.ItemType = 0 -- news item
AND itemCat.CategoryID = #CategoryID
AND (
(#StartDate IS NULL OR news.Date >= #StartDate) AND
(#EndDate IS NULL OR news.Date <= #EndDate)
)
AND news.Culture = #Culture
and news.[status] = 1
)
SELECT * FROM Paging WHERE RowNumber >= #Start AND RowNumber <= (#Start + #Count - 1)
OPTION (OPTIMIZE FOR (#CategoryID UNKNOWN, #Culture UNKNOWN))
END
Here is the structure of News and ItemCategory tables:
CREATE TABLE [dbo].[News](
[ID] [bigint] NOT NULL,
[Url] [varchar](300) NULL,
[Title] [nvarchar](300) NULL,
[Description] [nvarchar](3000) NULL,
[Date] [datetime] NULL,
[Rank] [smallint] NULL,
[Vote] [smallint] NULL,
[Culture] [tinyint] NULL,
[ResourceTitle] [nvarchar](200) NULL,
[Status] [tinyint] NULL
CONSTRAINT [PK_News] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [ItemCategory](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[ItemID] [bigint] NOT NULL,
[ItemType] [tinyint] NOT NULL,
[CategoryID] [int] NOT NULL,
CONSTRAINT [PK_ItemCategory] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
This query reads news of a specific category (sport, politics, ...).
#Culture parameter specifies the language of news, like 0 (english), 1 (french), etc.
ItemCategory table relates a news record to one or more categories.
ItemType column in ItemCategory table specifies which type of itemID is there. for now, we have only ItemType 0 indicating that ItemID refers to a record in News table.
Currently, I have the following index on ItemCategory table:
CREATE NONCLUSTERED INDEX [IX_ItemCategory_ItemType_CategoryID__ItemID] ON [ItemCategory]
(
[ItemType] ASC,
[CategoryID] ASC
)
INCLUDE ( [ItemID])
and the following index for News table (suggested by query analyzer):
CREATE NONCLUSTERED INDEX [_dta_index_News_8_1734000549__K1_K7_K13_K15] ON [dbo].[News]
(
[ID] ASC,
[Date] ASC,
[Culture] ASC,
[Status] ASC
)
With these indexes, when I execute the query, the query executes in less than a second for some parameters, and for another parameters (e.g. different #Culture or #CategoryID) may take up to 2 minutes! I have used OPTIMIZE FOR (#CategoryID UNKNOWN, #Culture UNKNOWN) to prevent parameter sniffing for #CategoryID and #Culture parameters but seems not working for some parameters.
There are currently around 2,870,000 records in News table and 4,740,000 in ItemCategory table.
Now I greatly appreciate any advice on how to optimize this query or its indexes.
update:
execution plan: (in this image, ItemNetwork is what I referred to as ItemCategory. they are the same)

Have you had a look at some of the inbuilt SQL tools to help you with this:
I.e. from the management studio menu:
'Query'->'Display Estimated Execution Plan'
'Query'->'Include Actual Execution Plan'
'Tools'->'Database Engine Tuning Advisor'

Shouldn't the OPTION OPTIMIZE clause be part of the inner SQL, rather than of the SELECT on the CTE?

You should look at indexing the culture field in the news table, and the itemid and categoryid fields in the item category table. You may not need all these indexes - I would try them one at a time, then in combination until you find something that works. Your existing indexes do not seem to help your query very much.

Really need to see the query plan - one thing of note is you put the clustered index for News on News.ID, but it is not an identity field but the FK for the ItemCategory table, this will result in some fragmentation on the news table over time, so it less than ideal.
I suspect the underlying problem is your paging is causing the table to scan.
Updated:
Those Sort's are costing you 68% of the query execution time from the plan, and that makes sense, one of those sorts at least must be to support the ranking function you are using that is based on news.rank desc, but you have no index that can support that ranking natively.
Getting an index in to support that will be interesting, you can try a simple NC index on news.rank first off, SQL may chose to join indexes and avoid the sort, but it will take some experimentation.

Try using for ItemCategory table nonclustered index on itemId,categoryId and on News table also nonclustered index on Rank,Culture.

I have finally come up with the following indexes which are working great and the stored procedure executes in less than a second. I have just removed TotalCount = COUNT(*) OVER() from the query and I couldn't find any good index for that. Maybe I write a separate stored procedure to calculate the total number of records. I may even decide to use a "more" button like in Twitter and Facebook without pagination buttons.
for news table:
CREATE NONCLUSTERED INDEX [IX_News_Rank_Culture_Status_Date] ON [dbo].[News]
(
[Rank] DESC,
[Culture] ASC,
[Status] ASC,
[Date] ASC
)
for ItemNetwork table:
CREATE NONCLUSTERED INDEX [IX_ItemNetwork_ItemID_NetworkID] ON ItemNetwork
(
[ItemID] ASC,
[NetworkID] ASC
)
I just don't know whether ItemNetwork needs a clustered index on ID column. I am never retrieving a record from this table using the ID column. Do you think it's better to have a clustered index on (ItemID, NetworkID) columns?

Please try to change
FROM dbo.News news
JOIN ItemCategory itemCat ON itemCat.ItemID = news.ID
to
FROM dbo.News news
HASH JOIN ItemCategory itemCat ON itemCat.ItemID = news.ID
or
FROM dbo.News news
LOOP JOIN ItemCategory itemCat ON itemCat.ItemID = news.ID
I don't really know what is in your data, but the joining of this tables may be a bottleneck.

Related

How to write T-SQL Recursive Query to return hierarchical data using common self-joining table to store tree data

Every time I try to revisit recursive queries I feel like I am starting over. I am wanting to query data from a table that stores hierarchical data using a common method, having a table that self-joins.
First of all, there is this table that stores "Groups", i think of as being "folders" using a Windows Explorer analogy. The ID is the PK and there is an associated group Name.
CREATE TABLE [dbo].[BPAGroup](
[id] [varchar](10) NOT NULL,
[name] [nvarchar](255) NOT NULL,
CONSTRAINT [PK_BPAGroup] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Second and lastly, there is a relationship table that associates a group with its parent group. "GroupID" identifies "MemberID"'s group. For example, if you join up BPAGroupGroup.MemberID to [BPAGroup].ID, [BPAGroup].Name would contain the name of the group. if you join up BPAGroupGroup.GroupID to [BPAGroup].ID, [BPAGroup].Name would contain the name of the PARENT group.
CREATE TABLE [dbo].[BPAGroupGroup](
[memberid] [varchar](10) NOT NULL,
[groupid] [varchar](10) NOT NULL,
CONSTRAINT [PK_BPAGroupGroup] PRIMARY KEY CLUSTERED
(
[memberid] ASC,
[groupid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
-------------------------
ALTER TABLE [dbo].[BPAGroupGroup] WITH CHECK ADD CONSTRAINT [FK_BPAGroupGroup_BPAGroup_groupid] FOREIGN KEY([groupid])
REFERENCES [dbo].[BPAGroup] ([id])
GO
ALTER TABLE [dbo].[BPAGroupGroup] CHECK CONSTRAINT [FK_BPAGroupGroup_BPAGroup_groupid]
GO
ALTER TABLE [dbo].[BPAGroupGroup] WITH CHECK ADD CONSTRAINT [FK_BPAGroupGroup_BPAGroup_memberid] FOREIGN KEY([memberid])
REFERENCES [dbo].[BPAGroup] ([id])
GO
ALTER TABLE [dbo].[BPAGroupGroup] CHECK CONSTRAINT [FK_BPAGroupGroup_BPAGroup_memberid]
GO
Here's some sample data in the form
Level1
Level2
Level3
and the SQL for the data
INSERT [dbo].[BPAGroup] ([id], [name]) VALUES (N'A', N'Level1')
INSERT [dbo].[BPAGroup] ([id], [name]) VALUES (N'B', N'Level2')
INSERT [dbo].[BPAGroup] ([id], [name]) VALUES (N'C', N'Level3')
INSERT [dbo].[BPAGroupGroup] ([memberid], [groupid]) VALUES (N'B', N'A')
INSERT [dbo].[BPAGroupGroup] ([memberid], [groupid]) VALUES (N'C', N'B')
How can I write a recursive T-SQL Server Query that returns all of the groups names, the recursion "level number" and IDs for all of the groups. Of course, the root of the tree would have a NULL ParentID and ParentName?
For example, these fields would be in the result set.
Level, GroupID, GroupName, ParentId, ParentName
I realize that there are multiple ways to store this type of data. I don't have flexibility to change the db design.
Ideally, the result should show all group names, even the root node, that doesn't have a parent.
Based on the latest data, this appears to get you the result you want:
WITH rCTE AS(
SELECT 1 AS Level,
id AS GroupID,
[name] AS GroupName,
CONVERT(nvarchar(10),NULL) AS ParentID, --This'll be uniqueidentifier in your real version
CONVERT(nvarchar(255),NULL) AS ParentName
FROM BPAGroup G
WHERE NOT EXISTS (SELECT 1
FROM BPAGroupGroup e
WHERE e.memberid = G.id)
UNION ALL
SELECT r.Level + 1,
G.id AS GroupID,
G.[name] AS GroupName,
r.GroupID AS ParentID,
r.[GroupName] AS ParentName
FROM BPAGroup G
JOIN BPAGroupGroup GG ON G.id = GG.memberid
JOIN rCTE r ON GG.groupid = r.GroupID)
SELECT *
FROM rCTE;
db<>fiddle
It's important you understand how this works though. As you said in your post, you seem to need to revisit these each time. There's nothing wrong with needing to check the syntax for something (there's some things I fail miserably at remembering sometimes, especially the new OPENJSON stuff), but do you understand how this works? If not, which bit don't you?

What indexes should I create to speed up my queries on this table in SQL Server?

I have a table with the following definition:
CREATE TABLE [dbo].[Transactions]
(
[ID] [varchar](18) NOT NULL,
[TIME_STAMP] [datetime] NOT NULL,
[AMT] [decimal](18, 4) NOT NULL,
[CID] [varchar](90) NOT NULL,
[DEPARTMENT] [varchar](4) NULL,
[SOURCE] [varchar](14) NULL,
PRIMARY KEY NONCLUSTERED
(
[ID] ASC
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The table has 75 million rows in it. Somehow, it takes up 20 GB of disk space!
The following 2 queries...
SELECT
SUM(AMT)
FROM
Transactions
WHERE
TIME_STAMP >= '2017-11-11 00:00:00' AND
TIME_STAMP < '2017-11-12 00:00:00' AND
DEPARTMENT = 'Shoes' AND
SOURCE = 'Website'
SELECT
COUNT(DISTINCT(CID))
FROM
Transactions
WHERE
TIME_STAMP >= '2017-11-11 00:00:00' AND
TIME_STAMP < '2017-11-12 00:00:00' AND
DEPARTMENT = 'Accessories' AND
SOURCE = 'Mobile'
...each take about 2 minutes to run!
The DEPARTMENT and SOURCE fields are of low cardinality, they contain only a few distinct values.
Please advise on what I need to do, which indexes I need to create with which settings to optimize performance of these queries.
Thank you!
The best way to solve this specific query would be a composite index (one index with multiple columns) in this order:
Department
Source
Timestamp
Try to put the most selective column first, so if source has more possible variations than department, put it first. The date will obviously go last since it will trigger an index scan.
CREATE INDEX IX_Transactions ON Transactions(TIME_STAMP,DEPARTMENT,SOURCE) INCLUDE(AMT,CID)
I would create an index using Timestamp, Department and Source. I would also add AMT and CID as included columns. This means both your queries could be satisfied by reading the index and not having to hit the parent table at all.
CREATE INDEX IX_Transactions ON Transactions(TIME_STAMP,DEPARTMENT,SOURCE) INCLUDE(AMT,CID)
One additional option to consider is to run the Execution Plan and see if it recommends an index. I do this a lot when considering indexes because I have seen performance improvements from Execution Plan recommended indexes, over indexes that I thought were good, but were not intuitive.

Cluster index on varchar on small table

Hy guys,
I inherited a database with the following table with only 200 rows:
CREATE TABLE [MyTable](
[Id] [uniqueidentifier] NOT NULL,
[Name] [varchar](255) NULL,
[Value] [varchar](8000) NULL,
[EffectiveStartDate] [datetime] NULL,
[EffectiveEndDate] [datetime] NULL,
[Description] [varchar](2000) NOT NULL DEFAULT (''),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
As you can see there is a Clustered PK on a UniqueIdentifier column. I was doing some performance checks and the most expensive query so far (CPU and IO) is the following:
SELECT #Result = Value
FROM MyTable
WHERE #EffectiveDate BETWEEN EffectiveStartDate AND EffectiveEndDate
AND Name=#VariableName
The query above is encapsulated in a UDF and usually the udf is not called in a select list or where clause, instead it's return is usually assigned to a variable.
The execution plan shows a Clustered Index Scan
Our system is based in a large number of aggregations and math processing in real time. Every time our Web Application refreshes the main page, it calls a bunch of Stored Procedures and UDFs and the query above is run around 500 times per refresh per user.
My question is: Should I change the PK to nonclustered and create a clustered index on the Name, EffectiveStartDate, EffectiveEndDate in a such small table?
No you should not. You can just add another index which will be covering index:
CREATE INDEX [IDX_Covering] ON dbo.MyTable(Name, EffectiveStartDate, EffectiveEndDate)
INCLUDE(Value)
If #VariableName and #EffectiveDate are variables with correct types you should now see index seek.
I am not sure this will help, but you need to try, because index scan of 200 rows is just nothing, but calling it 500 times may be a problem. By the way if those 200 rows are in one page I suspect this will not help. The problem may be somewhere else, like opening a connection 500 times or something like that...

How to improve large table paging speed in SQL Server 2008

I have a table called Users with 10 million records in it. This is the table structure:
CREATE TABLE [dbo].[Users](
[UsersID] [int] IDENTITY(100000,1) NOT NULL,
[LoginUsersName] [nvarchar](50) NOT NULL,
[LoginUsersPwd] [nvarchar](50) NOT NULL,
[Email] [nvarchar](80) NOT NULL,
[IsEnable] [int] NOT NULL,
[CreateTime] [datetime] NOT NULL,
[LastLoginTime] [datetime] NOT NULL,
[LastLoginIp] [nvarchar](50) NOT NULL,
[UpdateTime] [datetime] NOT NULL,
CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED
(
[UsersID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
I have a nonclustered index on the UpdateTime column.
The paging sql:
;WITH UserCTE AS (
SELECT * FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY UpdateTime DESC) AS row,UsersID as rec_id -- select primary key only
FROM
dbo.Users WITH (NOLOCK)
) A WHERE row BETWEEN 9700000 AND 9700020
)
SELECT
*
FROM
dbo.Users WITH (NOLOCK) WHERE UsersID IN (SELECT UserCTE.rec_id FROM UserCTE)
The query above:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 3 ms.
(21 row(s) affected)
SQL Server Execution Times:
CPU time = 2574 ms, elapsed time = 3549 ms.
Anyone give me some suggests about how to improve paging speed will appreciate. Thanks!
That looks about as good as it is going to get without changing the way it works or doing some sort of pre-calculation.
The index used to locate the UserIds on the page is as narrow as it can be (the leaf pages will contain just the UpdateTime and the clustered index key of UsersID. You could make the index slightly narrower by changing to datetime2 but this won't make a significant difference. Also you could check that this index doesn't have excessive fragmentation.
If you had an indexed sequential integer column of UpdateTimeOrder then you could just do
SELECT *
FROM dbo.Users
WHERE UpdateTimeOrder BETWEEN 9700000 AND 9700020
But maintaining such a column along with concurrent INSERTS/UPDATES/DELETES will be difficult. One easier but less effective precalculation would be to create an indexed view.
CREATE VIEW dbo.UserCount
WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS Count
FROM [dbo].[Users]
GO
CREATE UNIQUE CLUSTERED INDEX IX ON dbo.UserCount(Count)
Then retrieve the pre-calculated count and call a different query with ROW_NUMBER() OVER (ORDER BY UpdateTime ASC) if looking for rows more than halfway through the index (and subtracting the original row numbers from the count of course)
But why do you actually need this anyway? Do you actually get people visiting page 485,000?

How to speed up Xpath performance in SQL Server when searching for an element with specific text

I am supposed to remove whole rows and part of XML-documents from a table with an XML column based on a specific value in the XML column. However the table contains millions of rows and gets locked when I perform the operation. Currently it will take almost a week to clean it up, and the system is too critical to be taken offline for so long.
Are there any ways to optimize the xpath expressions in this script:
declare #slutdato datetime = '2012-03-01 00:00:00.000'
declare #startdato datetime = '2000-02-01 00:00:00.000'
declare #lev varchar(20) = 'suppliername'
declare #todelete varchar(10) = '~~~~~~~~~~'
CREATE TABLE #ids (selId int NOT NULL PRIMARY KEY)
INSERT into #ids
select id from dbo.proevesvar
WHERE leverandoer = #lev
and proevedato <= #slutdato
and proevedato >= #startdato
begin transaction /* delete whole rows */
delete from dbo.proevesvar
where id in (select selId from #ids)
and ProeveSvarXml.exist('/LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]') = 1
and Proevesvarxml.exist('/LaboratoryReport/LaboratoryResults/Result[Value!=sql:variable(''#todelete'')]') = 0
commit
go
begin transaction /* delete single results */
UPDATE dbo.proevesvar SET ProeveSvarXml.modify('delete /LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]')
where id in (select selId from #ids)
commit
go
The table definitions is:
CREATE TABLE [dbo].[ProeveSvar](
[ID] [int] IDENTITY(1,1) NOT NULL,
[CPRnr] [nchar](10) NOT NULL,
[ProeveDato] [datetime] NOT NULL,
[ProeveSvarXml] [xml] NOT NULL,
[Leverandoer] [nvarchar](50) NOT NULL,
[Proevenr] [nvarchar](50) NOT NULL,
[Lokationsnr] [nchar](13) NOT NULL,
[Modtaget] [datetime] NOT NULL,
CONSTRAINT [PK_ProeveSvar] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [IX_ProeveSvar_1] UNIQUE NONCLUSTERED
(
[CPRnr] ASC,
[Lokationsnr] ASC,
[Proevenr] ASC,
[ProeveDato] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The first insert statement is very fast. I believe I can handle the locking by committing 50 rows at a time, so other requests can be handled in between my transactions.
The total number of rows for this supplier is about 5.5 million and the total rowcount in the table is around 13 million.
I've not really used xpath within SQL server before, but something which stands out is that you're doing lots of reads and writes in the same command (in the second statement). If possible, change your queries to..
CREATE TABLE #ids (selId int NOT NULL PRIMARY KEY)
INSERT into #ids
select id from dbo.proevesvar
WHERE leverandoer = #lev
and proevedato <= #slutdato
and proevedato >= #startdato
and ProeveSvarXml.exist('/LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]') = 1
and Proevesvarxml.exist('/LaboratoryReport/LaboratoryResults/Result[Value!=sql:variable(''#todelete'')]') = 0
begin transaction /* delete whole rows */
delete from dbo.proevesvar
where id in (select selId from #ids)
This means that the first query will only create the new temporary table, and not write anything back, which will take slightly longer than your original, but the key thing is that your second query will ONLY be deleting records based on what's in your temporary table.
What you'll probably find is because it's deleting records, it's constantly re-building indices, and causing the reads to also be slower.
I'd also delete/disable any indices/constraints that don't actually help your query run.
Also, you're creating your clustered primary key on the ID, which isn't always the best thing to do. Especially if you're doing lots of date scans.
Can you also view the estimated execution plan for the top query, it would be interesting to see the order in which it checks the conditions. If it's doing the date first, then that's fine, but if it's doing the xpath before it checks the date, you might have to separte it into 3 queries, or add a new clustered index on 'proevedato,id'. This should force the query to only run the xpath for records which actually match the date.
Hope this helps.

Resources