Nonclustered index functionality relative to clustered index seek - sql-server

the question is quite simple, but we've had so many issues with index/statistics updates not always resulting in the proper new execution plans in low-load environments that I need to check this with you guys here to be sure.
Say that you have the following tables:
/*
TABLES:
TABLE_A (PK_ID INT, [random columns], B_ID INT (INDEXED, and references TABLE_B.PK_ID))
TABLE_B (PK_ID INT, [random columns], C_ID INT (INDEXED, and references TABLE_C.PK_ID))
TABLE_C (PK_ID INT, [random columns])
*/
SELECT *
FROM TABLE_A A
JOIN TABLE_B B ON B.PK_ID = A.B_ID
JOIN TABLE_C C ON C.PK_ID = B.C_ID
WHERE A.randcolumn1 = 'asd' AND B.randcolumn2 <> 5
Now, since B is joined to A with its clustered PK column, shouldn't that mean that the index on B.C_ID will not be used as the information is already returned through the B.PK_ID clustered index? In fact, is it not true that the index on B.C_ID will never be used unless the query specifically targets the ID values on that index?
This may seem like a simple and even stupid question, but I want to make absolutely sure I'm getting this right. I'm thinking of making adjustments on our indexing, since we have a lot of unused indexes which have been inherited from an old datamodel and they're taking up quite a bit of space in a DB this size. And experience has shown that we cannot fully trust the execution plans on any environment apart from the production thanks to its extreme load compared to testing environments, which makes it difficult to test this out reliably.
Thanks!

The query optimizer is free to do as it pleases. It could execute the second join by scanning the C table, and for each row, looking up the matching row in B. The index you describe would help with that lookup.
SQL Server provides statistics to tell you if an index is actually used:
select db_name(ius.database_id) as Db
, object_name(ius.object_id) as [Table]
, max(ius.last_user_lookup) as LastLookup
, max(ius.last_user_scan) as LastScan
, max(ius.last_user_seek) as LastSeek
, max(ius.last_user_update) as LastUpdate
from sys.dm_db_index_usage_stats as ius
where ius.[database_id] = db_id()
and ius.[object_id] = object_id('YourTableName')
group by
ius.database_id
, ius.object_id
If the index isn't used for more than 2 months, it is usually safe to drop it.

Related

Can joins effectively ignore field indexes if they are a constant?

It is much easier to explain this with an example.
Table A has PK on (store,line).
Table B has PK on (id,store,line).
[id] is int, [store] is nvarchar(100) and [line] is int in both cases.
If I run:
select *
from A inner join B
on A.store=B.store and A.line=B.line
where B.id=0
will the engine be able to make a fast (i'm thinking merge) join? Or will it be helpful to add a dummy column id valued 0 in A?
Your statement will work but if you do it like this the optimizer will be more effective:
select *
from A inner
join B on A.store=B.store and A.line=B.line and B.id=0
Here is is able to exclude items where b.id does not equal zero before it does the merge. Depending on table size topology etc this could be quite significant.
For example consider the case where you have 50 million rows shared across 5 nodes table b and 1 node for table a -- in your code all records would have to be moved to the node with the a table while with the code above only the records that have id = 0 would need to be moved.
This can be very non-intuitive when a is a small table (which are often only on one node.)

Too many parameter values slowing down query

I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here

How to optimize T-SQL UI queries

I have UI form which shows to user different aggregate information (fact, plan etc. - 6 different T-SQL queries run in parallel). Execution of pure SQL queries takes up to 3 seconds.
I use stored procedures with parameters, but there is no problem - call of SPs takes absolutely the same time.
Here I use example of one table and one query, another 5 queries and tables have the same structure. I use MS SQL Server 2012, it's possible to upgrade up to 2014 if any optimization reason.
Now I try to find all possible ways to improve it. And it should be only SQL ways.
Aggregate table structure:
create table dbo.plan_Total(
VersionId int not null,
WarehouseId int not null,
ChannelUnitId int not null,
ProductId] int not null,
Month date not null,
Volume float not null,
constraint PK_Total primary key clustered
(VersionId asc, WarehouseId asc, ChannelUnitId asc, ProductId asc, Month asc)) on PRIMARY
SP query structure:
ALTER PROCEDURE dbo.plan_GetTotals
#versionId INT,
#geoIds ID_LIST READONLY, -- lists from UI filters
#productIds ID_LIST READONLY,
#channelUnitIds ID_LIST READONLY
AS
begin
SELECT Id INTO #geos
FROM #geoIds
SELECT Id INTO #products
FROM #productIds
SELECT Id INTO #channels
FROM #channelUnitIds
CREATE CLUSTERED INDEX IDX_Geos ON #geos(Id)
CREATE CLUSTERED INDEX IDX_Products ON #products(Id)
CREATE CLUSTERED INDEX IDX_ChannelUnits ON #channels(Id)
SELECT Month, SUM(Volume) AS Volume
FROM plan_Total t
JOIN #geos g ON t.WarehouseId = g.Id
JOIN #products p ON t.ProductId = p.Id
JOIN #channels cu ON t.ChannelUnitId = cu.Id
WHERE VersionId = #versionId
GROUP BY Month
ORDER BY Month -- no any performance impact
END
Approx. execution time 600-800 ms. Time of another queries almost the same.
How can I dramatically decrease execution time? Is it possible?
What I've done already:
- Try columnstore indexes (clustered is not good because foreign key problem);
- Disable of non-clustered columnstore index is not solution, because in some tables need to update data online (user can change information);
- Rebuild all current indexes;
- Can't gather all tables in one.
Here is actual plan link:
Actual execution plan - for this plan i add real tables in joins instead of temp tables.
BR, thanks for any help!
Have you considered not asking not joining channel, product etc.?
At least channels - if you do not have 10.000 you can just load them "on demand" or "on application start" and cache them. This is a client side dictionary lookup.
Also Month, SUM(Volume)..... consider precalculating this, making a materialized view. Calculating this on demand is not what reporting should do and goes against data warehousing best practices.
All your solutions will not change that - they do not address the real problem: too much processing in the query.
See if this way works better
Create the TABLE type to have a PRIMARY KEY
Specify option RECOMPILE: force compiler to include cardinality of TABLE variables
Specify option OPTIMIZE FOR UNKNOWN: prevent parameter sniffing for #versionId
CREATE TYPE dbo.ID_LIST AS TABLE (
Id INT PRIMARY KEY
);
GO
CREATE PROCEDURE dbo.plan_GetTotals
#versionId INT,
#geoIds ID_LIST READONLY,
#productIds ID_LIST READONLY,
#channelUnitIds ID_LIST READONLY
AS
SELECT
Month,
SUM(Volume) AS Volume
FROM
plan_Total AS t
INNER JOIN #geoIds AS g ON g.Id=t.WarehouseId
INNER JOIN #productIds AS p ON p.Id=t.ProductId
INNER JOIN #channelUnitIds AS c ON c.Id=t.ChannelUnitId
WHERE
t.VersionId=#versionId
GROUP BY
Month
ORDER BY
Month
OPTION(RECOMPILE, OPTIMIZE FOR UNKNOWN);
GO
Ok, here I just show what I can find and how I increased speed of mu query.
List of addins:
Best way is to add Clustered columnstore index. For that you need to delete FK's but you can use triggers for example. This increase the query up to 3-4 times.
How you can see I use temp tables in query joins. I've changed one join (doesn't matter which) to IN operand like this "and t.productid in (select id from #productids)" it increased query pure speed twice.
This two made most impact to the query. Below I want to show the final query:
select [month], sum(volume) as volume
from #geos g
left join dbo.plan_Total t on t.warehouseid = g.id
join #channels cu on t.channelunitid = cu.id
where versionid = #versionid
and t.productid in (select id from #productids)
group by [month]
order by [Month]
With this changes I decrease query execution time from 0.8 to 0.2 ms.

Query optimization to avoid matching on a column with only a few distinct values

I have two tables in sql-server.
System{
AUTO_ID -- identity auto increment primary key
SYSTEM_GUID -- index created, unique key
}
File{
AUTO_ID -- identity auto increment primary key
File_path
System_Guid -- foreign key refers system.System_guid, index is created
-- on this column
}
System table has 100,000 rows.
File table has 200,000 rows.
File table has only 10 distinct values for System_guid.
My Query is as follows:
Select * from
File left outer join System
on file.system_guid = system.system_guid
SQL server is using hash match join to give me result which is taking a long time.
I want to optimize this query to make it go faster. The fact that there are only 10 distinct system_guid probably means the hash match wastes energy. How can utilize this knowledge to speed up the query?
When an indexed column has almost non-changing values, the purpose of index fails. If all you want is to extract records from System where system_guid is one of the ones in File then you may be better off (in your case) with a query like:
select * from System
where system_guid in (select distinct system_guid from File).
Is the LEFT join really necessary? How does the query perform as an INNER join? Do you get a different join.
I doubt hash join is much of a problem with this amount of I/O.
You could do a UNION like this... maybe coax a different plan out of it.
Select * from File
WHERE System_Guid NOT IN (SELECT system_guid from system)
union all
Select * from
File inner join System
on file.system_guid = system.system_guid

What's the difference between these T-SQL queries using OR?

I use Microsoft SQL Server 2008 (SP1, x64). I have two queries that do the same, or so I think, but they are have completely different query plans and performance.
Query 1:
SELECT c_pk
FROM table_c
WHERE c_b_id IN (SELECT b_id FROM table_b WHERE b_z = 1)
OR c_a_id IN (SELECT a_id FROM table_a WHERE a_z = 1)
Query 2:
SELECT c_pk
FROM table_c
LEFT JOIN (SELECT b_id FROM table_b WHERE b_z = 1) AS b ON c_b_id = b_id
LEFT JOIN (SELECT a_id FROM table_a WHERE a_z = 1) AS a ON c_a_id = a_id
WHERE b_id IS NOT NULL
OR a_id IS NOT NULL
Query 1 is fast as I would expect, whereas query 2 is very slow. The query plans look quite different.
I would like query 2 to be as fast as query 1. I have software that uses query 2, and I cannot change that into query 1. I can change the database.
Some questions:
why are the query plans different?
can I "teach" SQL Server somehow that query 2 is equal to query 1?
All tables have (clustered) primary keys and proper indexes on all columns:
CREATE TABLE table_a (
a_pk int NOT NULL PRIMARY KEY,
a_id int NOT NULL UNIQUE,
a_z int
)
GO
CREATE INDEX IX_table_a_z ON table_a (a_z)
GO
CREATE TABLE table_b (
b_pk int NOT NULL PRIMARY KEY,
b_id int NOT NULL UNIQUE,
b_z int
)
GO
CREATE INDEX IX_table_b_z ON table_b (b_z)
GO
CREATE TABLE table_c (
c_pk int NOT NULL PRIMARY KEY,
c_a_id int,
c_b_id int
)
GO
CREATE INDEX IX_table_c_a_id ON table_c (c_a_id)
GO
CREATE INDEX IX_table_c_b_id ON table_c (c_b_id)
GO
The tables are not modified after filling initially. I'm the only one querying them. They contains millions of records (table_a: 5M, table_b: 4M, table_c: 12M), but using only 1% gives similar results.
Edit: I tried adding FOREIGN KEYs for c_a_id and c_b_id, but that only made query 1 slower...
I hope someone can have a look at the query plans and explain the difference.
Join are slower, let me say by design. First query uses a sub-query (cacheable) to filter records so it'll produce less data (and less accesses to each table).
Did you read these:
http://www.sql-server-performance.com/2006/tuning-joins/
http://blogs.msdn.com/b/craigfr/archive/2006/12/04/semi-join-transformation.aspx
What I mean is that with IN the DB can do better optimizations like removing duplicates, stop at first match and similar (and these are from school memories so I'm sure it'll do much better). So I guess the question isn't why QP is different but how smart how deep optimizations can go.
You are comparing non equivalent queries also you are using left join in quite unusual way.
Generally if yours intention was to select all entries in table_c which has linked records either in table_a or table_b you should use exists statement:
SELECT c_pk
FROM table_c
WHERE Exists(
SELECT 1
FROM table_b
WHERE b_z = 1 and c_b_id = b_id
) OR Exists(
SELECT 1
FROM table_a
WHERE a_z = 1 and c_a_id = a_id
)
Since you can't change the query, at least you can improve the query's environment.
Highlight your query, right-click on it in SSMS and select "Analyze
Query in Database Engine Tuning Advisor."
Run the analysis to find out if you need any additional indexes or
statistics built.
Heed SQL Server's advice.

Resources