Context: SQL Server 2008. There are 2 tables to inner join.
The fact table, which has 40 million rows, contains the patient key and the medications administered and other facts. There is a unique index (nonclustered) on medication key and patient key combined in that order.
The dimension table is the medication list (70 rows).
The join is to get the medication code (business code) based on medication key (surrogate key).
Query:
SELECT a.PKey, a.SomeFact, b.MCode
FROM tblFact a
JOIN tblDIM b ON a.MKey = b.MKey
All the columns returned are integer.
The above query runs in 7 minutes and its execution plan shows the index on (MKey,PKey) is used. The index was rebuilt right before the run.
When I disabled the index on the fact table (or copy data to a new table with same structure but without index), the same query takes only 1:40 minutes.
IO Statistics are also stunning.
With index: Table 'tblFACT'. Scan count 70, logical reads 190296338, physical reads 685138, read-ahead reads 98713
Without index: Table 'tblFACT_copy'. Scan count 17, logical reads 468891, physical reads 0, read-ahead reads 419768
Question: why does it try to use the index and head down the inefficient path?
You need to add SomeFact as an INCLUDE on the tblFact index to make it covering.
Currently, the table will be accessed twice: once for the index and then again for a lookup to get SomeFact either as a RID or key lookup (depends on if there is a clustered index)
This doesn't apply to tblDIM because I assume that MKey is the clustered index which makes it covering implicitly
In rare cases, the database chooses an incorrect execution plan. In this case, the index is used for the join, but since all data is fetched from both tables, it would be faster to just scan the whole table.
The indexed version will be much faster if you add a WHERE clause to the query, because without indexes it will still need to scan the whole table, instead of grabbing just the handful of records it needs.
There may be directives to encourage the database not to use indexes or use different indexes, but I don't know SQL server that well.
Are your statistics up to date? Check with:
SELECT object_name = Object_Name(ind.object_id)
, IndexName = ind.name
, StatisticsDate = STATS_DATE(ind.object_id, ind.index_id)
FROM SYS.INDEXES ind
order by
STATS_DATE(ind.object_id, ind.index_id) desc
Update with:
exec sp_updatestats;
Related
I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.
You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).
Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.
I have a reporting table that is populated from various fact tables in my Data Warehouse. The issue is that for one customer in that reporting table, it takes 46 seconds to pull his data. The one customer has 4232424 records. In total, the table has 5336393 records in it, and has 4 columns. I'll post the table structure and the query I'm running. I need to get the result time on this down to as low as possible. I've tried In Memory Tables, various Indexes, and Indexed Views.
TABLE STRUCTURE
CREATE TABLE cache.Tree
(
CustomerID INT NOT NULL PRIMARY KEY NONCLUSTERED,
RelationA_ID INT NOT NULL,
RelationB_ID INT NOT NULL,
NestedLevel INT NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL
INDEX IX_LEGS CLUSTERED (lft, rgt),
INDEX IX_LFT NONCLUSTERED (lft)
)
The Report Query
SELECT
tp.CustomerID AS DLine,
t.CustomerID,
t.RelationA_ID,
Level = t.NestedLevel - tp.NestedLevel,
IndentedSort = t.lft
FROM cache.UnilevelTreeWithLC2 tp
INNER JOIN cache.UniLevelTreeWithLC2 t
ON t.lft between tp.lft AND tp.rgt
WHERE tp.CustomerID = 7664
Any help or guidance would be greatly appreciated.
UPDATE 1: Query Execution Plan
UPDATE 2: Solved
I was able to get permission to filter out inactive people in the tree. This has cut the query execution in almost half, if I keep the indexes I put on the table.
Try forcescan - for a query that pulls 80% of a narrow table I would expect SQL to scan but it might not because of bad stats or one of the various cardinality estimation bugs (that are fixed but require traceflags to enable).
I would also ditch the celko-sets - a single parent_id col will make your table even narrower, which should speed up these throughput bound cases, spare you the left/right maintenance, and be very fast with recursive queries.
I have the following query:
Select TOP 5000
CdCl.SubId
From dbo.PanelCdCl CdCl WITH (NOLOCK)
Inner Join dbo.PanelHistory PH ON PH.SubId = CdCl.SubId
Where CdCl.PanelCdClStatusId IS NULL And PH.LastProcessNumber >= 1605
Order By CdCl.SubId
The query plan looks as follows:
Both the PanelCdCl and PanelHistory tables have a clustered index / primary key on SubId, and it's the only column in the index. There is exactly one row for each SubId in each table. Both tables have ~35M total rows in them.
I'm curious why the query plan is showing a clustered index scan on PanelHistory when the join is being done on the clustered index column.
It's not scanning PanelHistory's clustered index(SubId) to find a SubId, it's scanning on it to find all rows where LastProcessNumber >= 1605. This is the first logical step.
Then it likewise scans PanelCdCl to find all non-null PanelCdClStatusId rows. Then since they had the same index (SubId), they are both already sorted on the Join column, so it can do a Merge-Join without an additional sort. (Merge-Join is almost always the most efficient if it doesn't have to re-sort the input rows).
Then it doesn't have to do a Sort for the ORDER BY, because it's already in SubId order.
And finally, it does the TOP, which has to be after everything else (by the rules of SQL clause logical execution ordering).
So the only place it tests SubId values is in the Merge-Join, it never pushes it down to the scans. This would probably remain true if it did a Hash-Join instead. Only for a Nested-Loop Join would it have to push the SubId test down as a seek on a table, and that should only be the lower branch, not the upper one.
The merge join operator needs two sorted inputs. The clustered key is SubId in both tables which means that the scan in PanelHistory will give the rows in correct order. The clustered key is included in all non clustered key indexes so because of that you will have all rows in NCI IX_PanelCdCl_PanelCdClStatusId where PanelCdClStatusId is null ordered by SubId as well so that can also be used directly by the merge join.
What you see here is actually two scans, one of the clustered key in PanelHistory with a residual predicate on LastProcessNumber > 1605 and one index range scan in IX_PanelCdCl_PanelCdClStatusId as long as PanelCdClStatusId is null.
They will however not scan the entire table/index. The query is executed from left to right in the query plan where select is asking for one row at a time until there is no more rows to be had. That means that the top operator will stop asking for new rows from the merge join when it has the required 5000 rows.
I've been trying to make some hands-on learning about indices, since I plan to give some lectures about them next semester. I have read the important chapters in Ramakrishnan & Gehrke, and some pages on the internet, including SQL Server documentation. I thought I had a good enough theoretical understanding of the subject, but when I began making experiments with SQL Server 2008 R2 I had some trouble to verify them.
At this point, I still want to reread the chapters from R&G on query evaluation, namely chapters 12-15, but I wanted to run these tests now to see if I'm getting it right.
I am using the AdventureWorks database, but I altered it somewhat to create examples. My goal is to give the students an empirical exploration of the subject, by giving them similar queries to compare over similar tables and deduce the influence of the indices in their performance. Towards that, I created three tables with data based on Sales.SalesOrderDetail:
table 1 (newDetailsTable) has a clustered index on the primary key (SalesOrderId, SalesOrderDetailId;
table 2 (newDetailsTable_noIndex) has no index and then on a second phase in the tests I create a non-clustered index on it on ProductId, including OrderQty and PriceUnit
table 3 (newDetailsTable_sortedInsert) has the same clustered index as table 1 and in a second phase the same non-clustered index as table 2 (both at the same time)
In table 1 and 3, the clustered index is created through a primary key constraint. Table 2 does not have a primary key.
I removed every check, default and foreign key constraints, and the automatic calculations for certain columns, so that the time spent in the queries should be only from finding the right records and not checking integrity.
These tables are loaded with the rows of a staging table that multiplies the initial rows of SalesOrderDetail. They have 3882144. These are created by
select * into stagingTable from Sales.SalesOrderDetail
and then a series of inserts like
INSERT INTO stagingtable (...) SELECT (...) FROM stagingTable,
where the select includes, in the position for [SalesOrderDetailID], [SalesOrderDetailID] + 1000000.
Successive iterations replace this added value by its double, until the last is 16000000.
I also created three copies of Sales.salesOrderHeader. None of them have primary keys. I also eliminated foreign keys, check and default constraints as before.
table 1 (newHeaderTable) has no index at all
table 2 (newHeaderTable_withIndex) has a non-clustered index on SalesOrderId
table 3 (newHeaderTable_withClustIndex) has a clustered index on SalesOrderId.
I ran each of the following queries twice each:
/* Details Table Clustered*/
/*Header Table Heap - No Indexes*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable d join newHeaderTable h on d.SalesOrderID = h.SalesOrderID
/*Header Table Clustered*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable d join newHeaderTable_withClustIndex h on d.SalesOrderID = h.SalesOrderID
/*Header Table Heap with NCI*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable d join newHeaderTable_withIndex h on d.SalesOrderID = h.SalesOrderID
/* Details Table Heap*/
/*Header Table Heap - No Indexes*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable_noIndex d join newHeaderTable h on d.SalesOrderID = h.SalesOrderID
/*Header Table Clustered*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable_noIndex d join newHeaderTable_withClustIndex h on d.SalesOrderID = h.SalesOrderID
/*Header Table Heap with NCI*/
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable_noIndex d join newHeaderTable_withIndex h on d.SalesOrderID = h.SalesOrderID
I expected that the fourth line, having no usable indices in either of the tables, would be much slower than the others. But in fact, they've all been around the same time more or less, 2m12s on my computer.
I checked the execution plans for line 1 and I see a merge join after a clustered index scan and a table scan
I checked the execution plans for line 4 and I see a hash match after two table scans.
So the plans seem consistent with what is expected, but the times do not differ in much. I also checked the statistics (I'm using both statistics time on and statistics io on) and the physical reads in line 4 are at 0! Also, I've ran all queries immediately after executing
dbcc freesystemcache ('All')
go
dbcc dropcleanbuffers
go
So, why are the times the same? Do I need more or less rows? Is it coincidence, or should I be filtering rows before the join? I remember that some years ago I was playing with a MySql database where I had retailer data similar to AW. I had about 1m rows in details, and 100k rows in headers. Initially the database had no indices and when I put them in the chages were dramatic. Why am I not getting that behaviour here?
Thanks all.
And Merry Holidays for you.
P.S: I can provide scripts as needed. I only didn't because this is long already.
Ok, I have some trouble editing a previous comment of mine. Anyway, I uploaded the scripts. They're here:
https://gist.github.com/1514951
Also, I followed Marc_s's suggestion, but I didn't get the results I expected.
I reduced the columns, and instead of
SELECT *
I made
SELECT d.SalesOrderDetailID, d.SalesOrderID FROM NewDetailsTable d JOIN newHeaderTable h ON d.SalesOrderID = h.salesOrderId
and also
select d.SalesOrderDetailID, d.SalesOrderID from NewDetailsTable_noIndex d join newHeaderTable h on d.SalesOrderID = h.salesOrderId
I updated stats. To note: the table newHEaderTable does not have indices. Then, NewDetailsTable has a clustered primary key index on SalesOrderId, SalesOrderDetailId. Table NewDetailsTable_noIndex does not have any index.
I got on the first query: 31 seconds, 117 physical reads on the details table, 5 reads on the header table;
on the second query: 28 seconds, 5 physical read on the details table, 2 reads on the header table.
I still don't understand this, I'm afraid.
I don't think you need to change the size of your data-set. I agree with Marc_s that you should try to reduce the number of columns you select and see if that makes a difference.
Have you updated the table statistics? You can see in Management Studio how old they are (or use STATS_DATE) and if any of them are from before you did your inserts from the staging table, they should certainly be updated.
Given that this is a test environment, you can safely run:
UPDATE STATISTICS NewDetailsTable WITH FULLSCAN
UPDATE STATISTICS newHeaderTable WITH FULLSCAN
....
In a live environment, especially on large tables, you might not want to do FULLSCAN but rather go for SAMPLE <num> PERCENT
When I created the tables using the script you supplied it created objects of the following sizes.
NewDetailsTable_noIndex 41485 data pages in the heap
NewDetailsTable 41624 leaf level pages and 2 upper levels
newHeaderTable 799 data pages in the heap
newHeaderTable_withClustIndex 801 leaf level data pages
newHeaderTable_withIndex 799 data pages in the heap, 59 leaf level NCI pages
You are selecting all 31,465 rows from the OrderHeader table and all related rows from the OrderDetail table without applying any filter.
If the execution plan were to use a nested loops join then this would require 31,465 index seeks into the OrderDetails table. Even for the NewDetailsTable case where an index does exist that would allow this it would be very inefficient to seek into each value individually. Each seek needs to navigate the index hierarchy meaning a minimum of 3 reads. When I forced this plan I ended up with 174,392 logical reads showing the average number of reads required to be actually over 5. This is about 4 times as many reads as scanning the table entirety (and non sequential IO too).
So having an index on OrderDetails that can be seeked into is not of particular benefit to this query both hash join and merge join will outperform the nested loops join that uses the seek.
Both hash and merge join need to scan each input one which means that all 6 of the execution plans end up with very similar IO. There are some minor differences across the various different versions of the header table but this is so much smaller than the Details table that they don't make any significant difference to the overall cost.
You might notice that for the version of the Details table with the Clustered Index logical reads are a bit higher than the actual number of leaf pages because SQL Server reads ahead the second level of the index too.
The clustered index version is able to use the merge join strategy because the index is already sorted by the joining column. Whilst this doesn't appear to lead to any measurable benefits in this case in terms of execution time it does provide a benefit in terms of not needing a memory grant and possibly needing to spill to tempdb.
Probably the most useful index you could add to the Details table would be one on (SalesOrderID,SalesOrderDetailID). This would provide a narrower index for SQL Server to scan thus reducing the IO requirement but still cover both of the columns used in your revised query (without *)