SQL Query is slow when ORDER BY statement added - sql-server

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.

As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;

If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.

You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

Related

Select query takes long time even for 10 records from a very big records table

I have a very big records table about 6.5 million records. When I try to select some records from it even 10 records I have to wait a long random time.
SELECT [Column1], [Column2], [Column3], [Column4], [Column5]
FROM [table]
WHERE deviceDataId = '640'
ORDER BY id ASC OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
My database is deployed on azure I also download and deploy it local system but it takes same time.
Query execution plan:
Now that we have your real query we can see that it appears that you have no index on your column deviceDataId; meaning that the entire table needs to be scanned. As such, even though you only want 10 rows, all 6.5M rows need to be scanned and the value of deviceDataId checked.
If you create an index on your column deviceDataId and minimally INCLUDE the others in your query, then you'll have a covering index which'll greatly help. This also assumes id is the column your CLUSTERED INDEX is ordered on.
CREATE NONCLUSTERED INDEX IX_Table_DeviceTableID_Cols1_5
ON dbo.[table] (deviceDataId)
INCLUDE ([Column1], [Column2], [Column3], [Column4], [Column5]);
Also, as I note in the comments, if deviceDataId is an int, use an int value in your WHERE clause. Don't wrap numerical values in single quotes, that is for literal strings.
WHERE deviceDataId = 640

SQL Actual Execution Plan with Sort took high cost

I have a table named DocumentItem with Id column was clustered index (primary key).
Please see these two query strings:
Query 1 (not use order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100
The result took: 00:00:09 with 168.357 rows.
Query 2 (used order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100 order by Id
The result took: 00:02:41 with 168.357 rows.
Here is the actual execution plan:
Why it took so long in the 2nd query?
SQL Server has decided that your index IX_HistoryCreateDate (not sure of the full name) is sufficiently selective that it will use it to find the rows that it needs. However, that index isn't sorted on the ID column. It does include the ID column already (whether you specified it or not) because it's the clustering key.
I'd suggest recreating your IX_HistoryCreateDate index like this:
CREATE INDEX IX_HistoryCreateDate ON DocumentItem
( HistoryCreateDate, ID)
INCLUDE (ActNodeState);
And I think you'll be fine. It's still not going to be great and it will have to do a large number of lookups, because your query uses SELECT *. Do you really need all columns returned? If so, and you do this all the time, you might consider reclustering the table in the order that you need.

How handle table updates/inserts if there is an index on table

Hi i am a bit confused about the handling of indexes in postgres. I use the 9.6 version. From my understanding after reading postgres docs and answers from stackoverflow i want to verify the following:
postgres does not support indexes with the classic notion
all indexes in postgres are non-clustered indexes
indexes does not allocate any new space but apply a sort on the table
thats way after create index a CLUSTER command shall follow.
in the docs it is stated that after updates/inserts on table the index is updated automatically
Show i created a table with col1,col2,col3,col4 and the an index based on col2, col3. Selects that have to do with col2, col3 became 15 times faster.
When i execute select * from table then results are displayed first sorted based on col2 and then based on col3.
When i add a new row in the table (with a col2 value (test_value) that already existed), this row went at the end of the table (this was checked with : select * from table).
1) Did the index got updated with this new entry automatically even if the select all showed the row at the end?
2) If a execute a query will all the rows that have the test_value on col2 what will happen? Will i get all the results through the index?
There are some wrong assumptions here.
The most important is: The order of the rows in a select is indeterminate unless you include ORDER BY. So you can get any result db engine decide is the faster way to get the data. So if select * from table return the last inserted element at the end, that doesn't tell you anything regarding the index.
How Rows are stored and Index information are separate things
1) Yes, index was updated after insert.
2) Yes, because index was already update.

How do indexes work behind the scenes

Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.

Index Scan with PROBE instead of an Index Seek

I have a query that looks like this:
--Updated To remove Distinct per Aaron Bertrand's suggestion in the comments
SELECT TOP 100 ord.OrderId
FROM Customer cust
JOIN CustomerOrder ord
ON ord.CustomerId = cust.CustomerId
WHERE cust.FirstName LIKE (#firstName + '%')
ORDER BY ord.CreatedWhen DESC
And I have an index like this:
CREATE NONCLUSTERED INDEX [IX_MyIndex] ON CustomerOrder
(
OrderId DESC,
CustomerId DESC,
CreatedWhen Desc
)
GO
When I run my query, the index gets used, but it is an index scan. And it gives this message:
PROBE([Bitmap1011],[MyDatabase].[order].[CustomerOrder].[OrderId] as [ord].[OrderId],N'[IN ROW]')
The output list consists of the OrderId and CreatedWhen.
What is this PROBE doing and why I don't get an Index Seek?
UPDATE:
The FirstName column on the Customer table does have an index that is being used in an IndexSeek.
CREATE NONCLUSTERED INDEX [IX_Customer_FirstName] ON Customer
(
[FirstName] ASC
)
GO
The reason that an Index Scan gets used is because your WHERE clause predicate is based on CustomerId, but it appears as the SECOND column in the list of columns in your non-clustered index [IX_MyIndex].
If you want an Index Seek to be performed, you would need to specify a new non-clustered index just on column CustomerId.
And that would essentially be a good practice - have two separate NC indices for OrderId and CustomerId. So when you join Customer and CustomerOrder tables, it will use the NC Index for CustomerId, and when you join Order and CustomerOrder tables, it will use the NC index for OrderId.
Refer to this article to read more about the difference between a multi-column non-clustered index (which you currently have) and multiple non-clustered indexes (which I proposed using).
[UPDATE]
But creating separate non-clustered indexes is not sufficient in getting an Index Seek everytime. That will depend on the columns being selected in the query, and the size of the data being read - based on that the query optimizer will accordingly make a decision on whether to use an Index Seek or an Index Scan. See this answer for more information.
[UPDATE Feb 8, 2021]
At a high-level, the PROBE function in question would essentially try to verify whether the CustomerOrder.OrderId column value is present in the Customer table. This is achieved internally through the using of bitmaps and hash keys, and you can read in detail about it here.
Note that a PROBE is not specific to an Index Scan or an Index Seek. It is simply a function that is utilized for verifying matches (based on a certain hash keyed column(s)) between two tables in a join.
Simple reason: your FirstName column isn't in the index. It must scan every row to see if the row matches the pattern you want.

Resources