Why is columnstore index not being used - sql-server

I have a non-clustered columnstore index on all columns a 40m record non-memory optimized table on SQL Server 2016 Enterprise Edition.
A query forcing the use of the columnstore index will perform significantly faster but the optimizer continues to choose to use the clustered index and other non-clustered indexes. I have lots of available RAM and am using appropriate queries against a dimensional model.
Why won't the optimizer choose the columnstoreindex? And how can I encourage its use (without using a hint)?
Here is a sample query not using columnstore:
SELECT
COUNT(*),
SUM(TradeTurnover),
SUM(TradeVolume)
FROM DWH.FactEquityTrade e
--with (INDEX(FactEquityTradeNonClusteredColumnStoreIndex))
JOIN DWH.DimDate d
ON e.TradeDateId = d.DateId
JOIN DWH.DimInstrument i
ON i.instrumentid = e.instrumentid
WHERE d.DateId >= 20160201
AND i.instrumentid = 2
It takes 7 seconds without hint and a fraction of a second with the hint.
The query plan without the hint is here.
The query plan with the hint is here.
The create statement for the columnstore index is:
CREATE NONCLUSTERED COLUMNSTORE INDEX [FactEquityTradeNonClusteredColumnStoreIndex] ON [DWH].[FactEquityTrade]
(
[EquityTradeID],
[InstrumentID],
[TradingSysTransNo],
[TradeDateID],
[TradeTimeID],
[TradeTimestamp],
[UTCTradeTimeStamp],
[PublishDateID],
[PublishTimeID],
[PublishedDateTime],
[UTCPublishedDateTime],
[DelayedTradeYN],
[EquityTradeJunkID],
[BrokerID],
[TraderID],
[CurrencyID],
[TradePrice],
[BidPrice],
[OfferPrice],
[TradeVolume],
[TradeTurnover],
[TradeModificationTypeID],
[InColumnStore],
[TradeFileID],
[BatchID],
[CancelBatchID]
)
WHERE ([InColumnStore]=(1))
WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0) ON [PRIMARY]
GO
Update. Plan using Count(EquityTradeID) instead of Count(*)
and with hint included

You're asking SQL Server to choose a complicated query plan over a simple one. Note that when using the hint, SQL Server has to concatenate the columnstore index with a rowstore non-clustered index (IX_FactEquiteTradeInColumnStore). When using just the rowstore index, it can do a seek (I assume TradeDateId is the leading column on that index). It does still have to do a key lookup, but it's simpler.
I can see two options to get this behavior without a hint:
First, remove InColumnStore from the columnstore index definition and cover the entire table. That's what you're asking from the columnstore - to cover everything.
If that's not possible, you can use a UNION ALL to explicitly split the data:
WITH workaround
AS (
SELECT TradeDateId
, instrumentid
, TradeTurnover
, TradeVolume
FROM DWH.FactEquityTrade
WHERE InColumnStore = 1
UNION ALL
SELECT TradeDateId
, instrumentid
, TradeTurnover
, TradeVolume
FROM DWH.FactEquityTrade
WHERE InColumnStore = 0 -- Assuming this is a non-nullable BIT
)
SELECT COUNT(*)
, SUM(TradeTurnover)
, SUM(TradeVolume)
FROM workaround e
JOIN DWH.DimDate d
ON e.TradeDateId = d.DateId
JOIN DWH.DimInstrument i
ON i.instrumentid = e.instrumentid
WHERE d.DateId >= 20160201
AND i.instrumentid = 2;

Your index is a filtered index (it has a WHERE predicate).
Optimizer would use such index only when the query's WHERE matches the index's WHERE. This is true for classic indexes and most likely true for columnstore indexes. There can be other limitations when optimizer would not use filtered index.
So, either add WHERE ([InColumnStore]=(1)) to your query, or remove it from the index definition.
You said in the comments: "the InColumnStore filter is for efficiency when loading data. For all tests so far the filter covers 100% of all rows". Does "all rows" here mean "all rows of the whole table" or just "all rows of the result set"? Anyway, most likely optimizer doesn't know that (even though it could have derived that from statistics), which means that the plan which uses such index has to explicitly do extra checks/lookups, which optimizer considers too expensive.
Here are few articles on this topic:
Why isn’t my filtered index being used? by
Rob Farley
Optimizer Limitations with Filtered Indexes by Paul White.
An Unexpected Side-Effect of Adding a Filtered Index by Paul White.
How filtered indexes could be a more powerful feature by Aaron Bertrand, see the section Optimizer Limitations.

Try this one:
Bridge your query
Select *
Into #DimDate
From DWH.DimDate
WHERE DateId >= 20160201
Select COUNT(1), SUM(TradeTurnover), SUM(TradeVolume)
From DWH.FactEquityTrade e
Inner Join DWH.DimInstrument i ON i.instrumentid = e.instrumentid
And i.instrumentid = 2
Left Join #DimDate d ON e.TradeDateId = d.DateId
How fast this query running ?

Related

Indexes turn SQL query too slow

I'm having a huge issue on a SQL query, after I added an index.
declare #DateFromCT date, #DateToCT date;
declare #DateFromCT2 date, #DateToCT2 date;
set dateformat dmy;
set #DateFromCT= '1/1/2015'; set #DateToCT= '31/3/2015';
set #DateFromCT2= '1/4/2015'; set #DateToCT2= '30/4/2015';
Select distinct CT.CodCliente,ct.codacesso FROM CT_Contabilidade CT
Inner join CD_PlanoContas PC ON CT.CodAcesso = PC.Cod
WHERE NOT exists (
SELECT 1 FROM ct_contabilidade CT2
WHERE CT2.CodAcesso = CT.CodAcesso
and CT2.Data between #DateFromCT2 and #DateToCT2
And ( CT2.CodEmpresa = 1) And CT2.codcliente = ct.codcliente )
and CT.Data between #DateFromCT and #DateToCT
AND PC.subgrupo = 'C'
And ( CT.CodEmpresa = 1 ) And ct.codCliente > 0
The CT_Contabilidade's PK is a Sequential (bigint identity), clustered index.
It has 1.5 million records.
Without other non-clustered indexes, it performs well, took less than 1 second. That's OK for me.
I create an index over the CodAcesso to match CD_PlanoContas key (cod);
The CD_PlanoContas PK (clustered index) is Cod.
It's still performing well. No notable difference...
So I create a index over the codCliente (since it also refers another table)
... And after this, the query is TOO slow; it is taking 7 or 8 MINUTES.
If I drop the CodAcesso index, it turn to be ok.
If I drop the CodCliente index, it is ok too.
If I let them both, but change the query , taking of the Inner Join with CD_Planocontas (and consequently , the filter "AND PC.subgrupo = 'C'") it is OK.
I can't imagine the indexes are causing the query to behave that way.
It's a HUGE difference, not just a "loss of performance". I try some other things, as take out each filter... not changed.
The execution plan suggests an index:
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[CT_Contabilidade] ([CodEmpresa],[Data],[CodCliente])
INCLUDE ([CodAcesso])
I created it, and the query works fine, even with the 2 other indexes (codCliente and codAcesso)
But I didn't like to create a specific index to this query (it's just one of many queries that uses these tables).
If runs well without no index, I think it should runs at least EQUAL with this 2 indexes.
What causes the performance to change so drastically? What do I need to change to speed things up?
try using an index optimizer hint to control which index is being used.
for example:
select *
from titles with (index (titleind))
where title = 'The Gourmet Microwave'
use the 'set statistics io on' command to see the number of pages being scanned with each query/index combo and use the 'rightclick/show execution plan' option to see how the query is being executed
It is not always good idea to follow the suggestion of execution plan.
I suggest you compare the execution plan before and after adding index and see the difference. Maybe that index cause SQL engine to choose a bad plan.
Also try update statistics on your table and index and see how if affects.

How do indexes work behind the scenes

Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.

Performance in SQL Server when there is PK and UQ index on the same column

In SQL Server 2005 I have encountered a table with unique ID column (with unique index on it) and primary key clustered index on it too (so there are explicitly 2 indexes on this column). Is this a major performance factor on insert/update/delete on this table?
I'm trying to boost performance on a database created long ago and I wonder if removing such reduntant unique indexes could help. Does the database check/rebuild both of these indexes on every content modification? Or would the performance gain be too small to even bother with this?
Here is a sample index usage output:
INDEX UserSeeks UserScans UserLookups UserUpdates
--------------------------------------------------------
1_PK 45517046 42911 245353 0
1_UQ 45517046 42911 245353 0
1_Other 45517046 42911 245353 0
--------------------------------------------------------
2_PK 21538111 5685 231030 1121
2_UQ 21538111 5685 231030 1121
3_other 21538111 5685 231030 1121
And here is the query I used to get that data:
SELECT OBJECT_NAME(I.OBJECT_ID) AS ObjectName,
I.NAME AS IndexName,
S.user_seeks AS UserSeeks,
S.user_scans AS UserScans,
S.user_lookups AS UserLookups,
S.user_updates AS UserUpdates
FROM sys.indexes I
JOIN sys.dm_db_index_usage_stats S
ON (S.OBJECT_ID = I.OBJECT_ID)
WHERE(database_id = DB_ID())
And fixed join condition:
SELECT OBJECT_NAME(I.OBJECT_ID) AS ObjectName,
I.NAME AS IndexName,
S.user_seeks AS UserSeeks,
S.user_scans AS UserScans,
S.user_lookups AS UserLookups,
S.user_updates AS UserUpdates
FROM sys.indexes I
JOIN sys.dm_db_index_usage_stats S
ON (S.OBJECT_ID = I.OBJECT_ID)
AND(S.index_id = I.index_id)
WHERE(database_id = DB_ID())
That might not be a redundant index.
Having a much narrower non clustered index which just contains the IDs may well have been put there as a deliberate strategy to benefit certain queries and/or to make foreign key validation more efficient.
I suggested that approach in my answer here to (successfully) resolve a deadlock problem the OP was having.
Indexes generally trade update/insert/delete performance for better select performance. That's almost always a good thing.
Does the database check/rebuild both of these indexes on every content
modification?
It must, or the indexes would be out of date, and return wrong results.
Or would the performance gain be too small to even bother with this?
That depends on the activity on the table.
A unique constraint also has a function: it keeps a column unique. Since you can't have a unique constraint without an index, removing it does not seem like an option in your case.

Suitable indexes for sorting in ranking functions

I have a table which keeps parent-child-relations between items. Those can be changed over time, and it is necessary to keep a complete history so that I can query how the relations were at any time.
The table is something like this (I removed some columns and the primary key etc. to reduce noise):
CREATE TABLE [tblRelation](
[dtCreated] [datetime] NOT NULL,
[uidNode] [uniqueidentifier] NOT NULL,
[uidParentNode] [uniqueidentifier] NOT NULL
)
My query to get the relations at a specific time is like this (assume #dt is a datetime with the desired date):
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY r.uidNode ORDER BY r.dtCreated DESC) ix, r.*
FROM [tblRelation] r
WHERE (r.dtCreated < #dt)
) r
WHERE r.ix = 1
This query works well. However, the performance is not yet as good as I would like. When looking at the execution plan, it basically boils down to a clustered index scan (36% of cost) and a sort (63% of cost).
What indexes should I use to make this query faster? Or is there a better way altogether to perform this query on this table?
The ideal index for this query would be with key columns uidNode, dtCreated and included columns all remaining columns in the table to make the index covering as you are returning r.*. If the query will generally only be returning a relatively small number of rows (as seems likely due to the WHERE r.ix = 1 filter) it might not be worthwhile making the index covering though as the cost of the key lookups might not outweigh the negative effects of the large index on CUD statements.
The window/rank functions on SQL Server 2005 are not that optimal sometimes (based on answers here). Apparently better in SQL Server 2008
Another alternative is something like this. I'd have a non-clustered index on (uidNode, dtCreated) INCLUDE any other columns required by SELECT. Subject to what Martin Smith said about lookups.
WITH MaxPerUid AS
(
SELECT
MAX(r.dtCreated) AS MAXdtCreated, r.uidNode
FROM
MaxPerUid
WHERE
r.dtCreated < #dt
GROUP BY
r.uidNode
)
SELECT
...
FROM
MaxPerUid M
JOIN
MaxPerUid R ON M.uidNode = R.uidNode AND M.MAXdtCreated = R.dtCreated

SQL Server won't use my index

I have a fairly simple query:
SELECT
col1,
col2…
FROM
dbo.My_Table
WHERE
col1 = #col1 AND
col2 = #col2 AND
col3 <= #col3
It was performing horribly, so I added an index on col1, col2, col3 (int, bit, and datetime). When I checked the query plan it was ignoring my index. I tried reordering the columns in the index in every possible configuration and it always ignored the index. When I run the query it does a clustered index scan (table size is between 700K and 800K rows) and takes 10-12 seconds. When I force it to use my index it returns instantly. I was careful to clear the cache and buffers between tests.
Other things I’ve tried:
UPDATE STATISTICS dbo.My_Table
CREATE STATISTICS tmp_stats ON dbo.My_Table (col1, col2, col3) WITH FULLSCAN
Am I missing anything here? I hate to put an index hint in a stored procedure, but SQL Server just can’t seem to get a clue on this one. Anyone know any other things that might prevent SQL Server from recognizing that using the index is a good idea?
EDIT: One of the columns being returned is a TEXT column, so using a covering index or an INCLUDE won't work :(
You have 800k rows indexed by col1, col2, col3. Col2 is a bit, so its selectivity is 50%. Col3 is a checked on a range (<=), so it's selectivity will be roughly at about 50% too. Which leaves col1. The query is compiled for the generic, parametrized plan, so it has to account for the general case. If you have 10 distinct values of col1, then your index will return approximately 800k /10 * 25% that is about ~20k keys to lookup in the clustered index to retrieve the '...' part. If you have 10k distinct col1 values then the index will return just 20 keys to look up. As you can see, what matters is not how you build your index in this case, but the actual data. Based on the selectivity of col1, the optimizer will choose a plan based on a clustered index scan (as better than 20k key lookups, each lookup at a cost of at least 3-5 page reads) or one based on the non-clustered index (if col1 is selective enough). In real life the distribution of col1 also plays a role, but going into that would complicate the explanation too much.
You can come with the benefit of hindsight and claim the plan is wrong, but the plan is the best cost estimate based on the data available at compile time. You can influence it with hints (index hint as you suggests, or optimize for hints as Quassnoi suggests) but then your query may perform better for your test set, and far worse for a different set of data, say for the case when #col1 = <the value that matches 500k records>. You can also make the index covering, thus eliminating the '...' in the projection list that require the clustered index lookup necessary, in which case the non-clustered index is always a better cost match than the clustered scan.
Kimberley Tripp has a blog article covering this subject, she calls it the 'index tipping point' which explains how come an apparently perfect candidate index is being ignored: a non-clustered index that does not cover the projection list and has poor selectivity will be seen as more costly than a clustered scan.
SQL Server optimizer is not good in optimizing queries that use variables.
If you are sure that you will always benefit from using the index, just put a hint.
If you will put the literal values to the query instead of variables, it will pick the correct statistics and will use the index.
You may also try to put a more light hint:
OPTION (OPTIMIZE FOR (#col1 = 1, #col2 = 0, #col3 = '2009-07-09'))
, which will calculate the best execution plan for these values of the variables, using statistics, and won't stick to using index no matter what.
The order of the index is important for this query:
CREATE INDEX MyIndex ON MyTable (col3 DESC, col2 ASC, col1 ASC)
It's not so much the ASC/DESC as that when sql server goes to match that where clause, it can match on col3 first and walk the index along that value.
Have you tried tossing out the bit from the index?
create index ix1 on My_Table(Col3, Col1) INCLUDE(Col2)
-- include other columns from the select list if needed
Also, you've left out the rest of the columns from the select list. You might want to consider including those if there aren't many either in the index or as INCLUDE statement to create a covering index for the query.
Try masking your parameters to prevent paramter sniffing:
CREATE PROCEDURE MyProc AS
#Col1 INT
-- etc...
AS
DECLARE #MaskedCol1 INT
SET #MaskedCol1 = #Col1
-- etc...
SELECT
col1,
col2…
FROM
dbo.My_Table
WHERE
col1 = #MaskecCol1 AND
-- etc...
Sounds stupid but I've seen SQL server do some weird things because of parameter sniffing.
I bet SQL Server thinks the price of getting the rest of the columns (designated by ... in your example) from the clustered index outweighs the benefit of the index so it just scans the clustered key. If so, see if you can make this a covering index.
Or does it use another index instead?
Are the columns nullable? Sometimes Sql Server thinks it has to scan the table to find NULL values.
Try adding "and col1 is not null" to the query, it mgiht make sqlserver use the index wtihout hint.
Also, check if the statistics are really up to date:
SELECT
object_name = Object_Name(ind.object_id),
IndexName = ind.name,
StatisticsDate = STATS_DATE(ind.object_id, ind.index_id)
FROM SYS.INDEXES ind
order by STATS_DATE(ind.object_id, ind.index_id) desc
If your SELECT is returning columns that aren't in your index SQL my find that its more efficient to scan the clustered index instead of having to do a key lookup to find the other values that you are requesting.
If you have a TEXT column try switching the data type to VARCHAR(MAX) then including the values in the nonclustered index.

Resources