Optimize INNER JOIN statements through column indexes - sql-server

I'm trying to optimize my INNER JOIN statements on the following tables:
[articlegroups] contains ~700 rows
[products] contains ~150.000 rows
[products_category_mapping] contains 1 up to 3 rows for each product in [products] (so anywhere between 150.000 and 450.000 rows)
Here's my current query:
SELECT ga.label_sp,ga.label_en,ga.slug_sp,ga.slug_en,ga.pagetitle_sp,ga.pagetitle_en,ga.image_sp,ga.image_en,ga.description_sp,ga.description_en,ga.metadescription_sp,ga.metadescription_en
FROM articlegroups ga WITH (NOLOCK)
INNER JOIN products_category_mapping pcm on pcm.articlegroup_id=ga.id
INNER JOIN products gp on gp.id=pcm.artikelid
WHERE gp.id=<PRODUCTID> AND ga.catlevel=0
I read here http://www.sql-server-performance.com/2006/tuning-joins/ that a thing I can do is to add indexes to the columns on which the tables are joined.
Now I wonder what would result in the best performance:
Adding an index to products_category_mapping.artikelid and/or to products_category_mapping.articlegroup_id and what kind of index? Should I add indexes to both columns? Should I make one of them clustered and if so which one?
I have now added indexes to both columns and a clustered index on products_category_mapping.artikelid since I though that last column could have the most different results and would require the most speed. I'm not sure if I'm correct in what I'm doing now though.

ARTICLEGROUPS has just 700 rows.This is a small table and you can try not indexing this as well. The columns used here are GA.ID, GA.CATLEVEL. May be you can try the below index.
Create Index IX_id on ARTICLEGROUPS (id) Include (CATLEVEL asc);
PRODUCTS has 150000 rows, and column used is GP.ID. If LABEL and PAGE are not from PRODUCTS, try
Create Clustered Index IX_id on PRODUCTS (id);
else create
Create Index IX_id on PRODUCTS (id) Include (..); -- Pls fill include part.
PRODUCTS_CATEGORY_MAPPING has 150000 < rows. Columns used PCM.ARTICLEGROUP_ID and PCM.ARTIKELID. try the below and see.
Create Index IX_agid on PRODUCTS_CATEGORY_MAPPING (ARTICLEGROUP_ID) Include (..) --If LABEL or PAGE is from this table add thosem coumns in the include part)
Create Index IX_aid on PRODUCTS_CATEGORY_MAPPING (ARTIKELID) Include (..) --If LABEL or PAGE is from this table add those columns in the include part)
See the execution plan for the query after adding index. I am adding this in the answer part, as i found it very clumsy when I wrote this in comment part, hope this will help you.

Related

How do indexes work behind the scenes

Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.

Index Scan with PROBE instead of an Index Seek

I have a query that looks like this:
--Updated To remove Distinct per Aaron Bertrand's suggestion in the comments
SELECT TOP 100 ord.OrderId
FROM Customer cust
JOIN CustomerOrder ord
ON ord.CustomerId = cust.CustomerId
WHERE cust.FirstName LIKE (#firstName + '%')
ORDER BY ord.CreatedWhen DESC
And I have an index like this:
CREATE NONCLUSTERED INDEX [IX_MyIndex] ON CustomerOrder
(
OrderId DESC,
CustomerId DESC,
CreatedWhen Desc
)
GO
When I run my query, the index gets used, but it is an index scan. And it gives this message:
PROBE([Bitmap1011],[MyDatabase].[order].[CustomerOrder].[OrderId] as [ord].[OrderId],N'[IN ROW]')
The output list consists of the OrderId and CreatedWhen.
What is this PROBE doing and why I don't get an Index Seek?
UPDATE:
The FirstName column on the Customer table does have an index that is being used in an IndexSeek.
CREATE NONCLUSTERED INDEX [IX_Customer_FirstName] ON Customer
(
[FirstName] ASC
)
GO
The reason that an Index Scan gets used is because your WHERE clause predicate is based on CustomerId, but it appears as the SECOND column in the list of columns in your non-clustered index [IX_MyIndex].
If you want an Index Seek to be performed, you would need to specify a new non-clustered index just on column CustomerId.
And that would essentially be a good practice - have two separate NC indices for OrderId and CustomerId. So when you join Customer and CustomerOrder tables, it will use the NC Index for CustomerId, and when you join Order and CustomerOrder tables, it will use the NC index for OrderId.
Refer to this article to read more about the difference between a multi-column non-clustered index (which you currently have) and multiple non-clustered indexes (which I proposed using).
[UPDATE]
But creating separate non-clustered indexes is not sufficient in getting an Index Seek everytime. That will depend on the columns being selected in the query, and the size of the data being read - based on that the query optimizer will accordingly make a decision on whether to use an Index Seek or an Index Scan. See this answer for more information.
[UPDATE Feb 8, 2021]
At a high-level, the PROBE function in question would essentially try to verify whether the CustomerOrder.OrderId column value is present in the Customer table. This is achieved internally through the using of bitmaps and hash keys, and you can read in detail about it here.
Note that a PROBE is not specific to an Index Scan or an Index Seek. It is simply a function that is utilized for verifying matches (based on a certain hash keyed column(s)) between two tables in a join.
Simple reason: your FirstName column isn't in the index. It must scan every row to see if the row matches the pattern you want.

Tuning Select statement to obtain faster results

I have benefited from this website for a long time now. This is my first question on the site. It is regarding performance tuning a reporting query. Here it goes.
1.
SELECT Count(b1.primkey)
from tableA b1 --WITH (NOLOCK)
join tableA b2 --WITH (NOLOCK)
on b1.email = b2.email
and DateDiff(day, b2.BookedDate , b1.BookedDate) > 1
tableA has around 7 million rows. Email is a varchar(100) field. Bookeddate is a datetime field. primkey is a primary key column that is an int.
My purpose of writing this query is to find out the count entries that have same email ids but have come in one day late. This query take about 45 minutes to run. I really want to reduce the time it takes to execute.
Since this is for reporting, i tried in vain to use --WITH (NOLOCK) option to improve the read time. I have a column store index on tableA and I know that it is being used by the SQL optimizer - can see in the execution plan. I am using SQL Server 2012.
Can someone tell me in such a case, what would be better? Using a nonclustered index on email or a nonclustered columnstore index on tableA?
Please help me.
Your query is relatively complex. You are essentially joining two tables that have 7 million records each on a column that is not unique.
How about the following query instead:
select Email
from TableA
group by Email
having MAX(BookedDate) > MIN(BookedDate) + 1
Also make sure you have an index with Email and BookedDate.
Hope this helps.
You have 3 options here:
Create clustered index on email field at least for a larger table.
But I suppose there are other queries running on these tables, and
clustered index is needed on other fields
Move emails to another table, and store email id's in TableA and
TableB; join on int field would be much faster than on varchar
fields
Create indexes on email fields with included columns BookedDate (no
need to include primkey, you can count on another field, or count(*). Code: create index idx_email on TableA include(BoodedDate)
I think that third option is the one you should go with. There's not much work to be done, and there will be great performance gain. The only problem is that index on varchar field will take a lot of space and impact insert/update operations; but you said that this is a reporting db, so I think you can allow that.

Need some assistance understanding a SQL Server 2012 query plan

I have the following query:
Select TOP 5000
CdCl.SubId
From dbo.PanelCdCl CdCl WITH (NOLOCK)
Inner Join dbo.PanelHistory PH ON PH.SubId = CdCl.SubId
Where CdCl.PanelCdClStatusId IS NULL And PH.LastProcessNumber >= 1605
Order By CdCl.SubId
The query plan looks as follows:
Both the PanelCdCl and PanelHistory tables have a clustered index / primary key on SubId, and it's the only column in the index. There is exactly one row for each SubId in each table. Both tables have ~35M total rows in them.
I'm curious why the query plan is showing a clustered index scan on PanelHistory when the join is being done on the clustered index column.
It's not scanning PanelHistory's clustered index(SubId) to find a SubId, it's scanning on it to find all rows where LastProcessNumber >= 1605. This is the first logical step.
Then it likewise scans PanelCdCl to find all non-null PanelCdClStatusId rows. Then since they had the same index (SubId), they are both already sorted on the Join column, so it can do a Merge-Join without an additional sort. (Merge-Join is almost always the most efficient if it doesn't have to re-sort the input rows).
Then it doesn't have to do a Sort for the ORDER BY, because it's already in SubId order.
And finally, it does the TOP, which has to be after everything else (by the rules of SQL clause logical execution ordering).
So the only place it tests SubId values is in the Merge-Join, it never pushes it down to the scans. This would probably remain true if it did a Hash-Join instead. Only for a Nested-Loop Join would it have to push the SubId test down as a seek on a table, and that should only be the lower branch, not the upper one.
The merge join operator needs two sorted inputs. The clustered key is SubId in both tables which means that the scan in PanelHistory will give the rows in correct order. The clustered key is included in all non clustered key indexes so because of that you will have all rows in NCI IX_PanelCdCl_PanelCdClStatusId where PanelCdClStatusId is null ordered by SubId as well so that can also be used directly by the merge join.
What you see here is actually two scans, one of the clustered key in PanelHistory with a residual predicate on LastProcessNumber > 1605 and one index range scan in IX_PanelCdCl_PanelCdClStatusId as long as PanelCdClStatusId is null.
They will however not scan the entire table/index. The query is executed from left to right in the query plan where select is asking for one row at a time until there is no more rows to be had. That means that the top operator will stop asking for new rows from the merge join when it has the required 5000 rows.

delete duplicate rows

anyone know how can i delete duplicate rows by writing new way from script below to improve performance.
DELETE lt1 FROM #listingsTemp lt1, #listingsTemp lt2
WHERE lt1.code = lt2.code and lt1.classification_id > lt2.classification_id and (lt1.fap < lt2.fap or lt1.fap = lt2.fap)
Delete Duplicate Rows in a SQL Table :
delete table_a
where rowid not in
(select min(rowid) from table_a
group by column1, column2);
1 - Create an Identity Column (ID) for your table (t1)
2 - Do a Group by on your table with your conditions and get IDs of duplicated records.
3 - Now, simply Delete records from t1 where IDs IN duplicated IDs set.
Look into BINARY_CHECKSUM .... you could possibly use it when creating your temp tables to more quickly determine if the data is the same.... for example create a new field in both temp tables storing the binary_checksum value... then just delete where those fields equal
The odiseh answer seems to be valid (+1), but if for some reason you can't alter the structure of the table (because you have not the code of the applications that are using it or something) you could write a job that run every night and delete the duplicates (using the Moayad Mardini code).

Resources