Query equals check with 1-3 row result prefers NonClustered Index Scan - sql-server

Here is my table which has order_number column. The table has less than 500 rows at the moment. A non clustered index on order_number has been created.
CREATE TABLE [outbound_service].[shipment_line]
(
[id] [uniqueidentifier] NOT NULL,
[shipment_id] [uniqueidentifier] NOT NULL,
[order_number] [varchar](255) NOT NULL,
.... 18 other columns
CONSTRAINT [PK_SHIPMENT_LINE]
PRIMARY KEY CLUSTERED ([id] ASC)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY],
CONSTRAINT [uk_order_order_line_number]
UNIQUE NONCLUSTERED ([order_number] ASC, [order_line_number] ASC)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX IX_shipment_line_order
ON outbound_service.shipment_line(order_number ASC)
Here is my simple equals check query that might return at max 5 rows.
DECLARE #P0 nvarchar(400) ='LG-ORD-002';
SELECT TOP 1 sl.order_number
FROM outbound_service.shipment_line sl
WHERE sl.order_number = #P0
I have expected nonclustered index seek, but I see an index scan happening. Very limited data at max 5 rows per order_number:
If I run the query without bind parameters, I see the index seek:
I have another database where I expect millions of rows and am worried about this scan as it is leading to 100 CPU on this query with high concurrency and slowing down rest of the workflows.
What could be the reason here when the data to return from index is very minimal but still SQL Server seems to be liking scan instead of seek?

There is a convert_implicit from your first execution plan, I would always align parameter type as column type and size, in addition, if you want to get the result by top I would suggest using ORDER BY order_number for two reasons.
A non-clustered index IX_shipment_line_order might be chosen to use without sorting cost.
Without order by we might not guarantee the result as you expect if you use TOP.
As this query
DECLARE #P0 [varchar](255) ='LG-ORD-002';
SELECT TOP 1 sl.order_number
FROM outbound_service.shipment_line sl
WHERE sl.order_number = #P0
ORDER BY sl.order_number

Here is the final answer after digging a bit into Hibernate, SQL Driver code, Using SQL Profiler to understand the impact on queries. HIbernate logs do not give any clue about the VARCHAR conversion to NVARCHAR. It is the MS SQL Driver by default passing all String variables in JPA Entity as nvarchar inputs into SQL where clause even though the underlying mapped columns are just varchars. This was forcing SQL Engine to do implicit conversion of the column data also into nvarchar.
Solution:
Passing in sendStringParametersAsUnicode=false; in the connection URL would change the driver's default behavior. By default all Entity Strings will be varchars in the query where clause. If there is a need for nvarchar and the driver not taking it into account automatically, the specific field in the entity can be annotated with #Nationalized annotation provided by Hibernate. Hibernate would help set the right expectation in this case to the SQL Driver.
sendStringParametersAsUnicode set to false is helpful if we have too many varchar columns that would not need to store international language characters and we refer to such columns heavily in where clause. This would help avoid index scans.

Related

Selecting rows first by Id then by datetime - with or without a subquery?

I need to create statistics from several log tables. Most of the time every hour but sometimes more frequently every 5 minutes.
Selecting rows only by datetime isn't fast enough for larger logs so I thought I select only rows that are new since the last query by storing the max Id and reusing it next time:
SELECT TOP(1000) * -- so that it's not too much
FROM [dbo].[Log]
WHERE Id > lastId AND [Timestamp] >= timestampMin
ORDER BY [Id] DESC
My question: is the SQL Server smart enough to:
first filter the rows by Id and then by the Timestamp even if I change the order of the conditions or does the condition order matter or
do I need a subquery to first select the rows by Id and then filter them by the Timestamp.
with subquery:
SELECT *
FROM (
SELECT TOP(1000) * FROM [dbo].[Log]
WHERE Id > lastId
ORDER BY [Id] DESC
) t
WHERE t.[TimeStamp] >= timestampMin
The table schema is:
CREATE TABLE [dbo].[Log](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Timestamp] [datetime2](7) NOT NULL,
-- other columns
CONSTRAINT [PK_dbo_Log] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
I tried to use the query plan to find out how it works but it turns out that I cannot read it and I don't understand it.
In your case you don't have an index on TimeStamp so SQL Server will always use the Clustered Index (Id) first (the Clustered index seek you see in the query plan) to find the first row matching Id > lastId and then perform a scan on the remaining rows with the predicate [Timestamp] >= timestampMin (actually is the other way around since you are sorting in reverse order with DESC).
If you were to add a index on TimeStamp SQL Server might use it based on:
the cardinality of the predicate [Timestamp] >= timestampMin. Please note that cardinality is always an estimate based on statistics (see https://msdn.microsoft.com/en-us/library/ms190397.aspx) and the cardinality estimator (it changed from SQL 2012 to 2014+, see https://msdn.microsoft.com/en-us/library/dn600374.aspx).
how covering the non-clustered index is (since you are using the wildcard it would hardly matters anyway). If the non-clustered index is non covering SQL Server would have to add a Key Lookup (see https://technet.microsoft.com/en-us/library/bb326635(v=sql.105).aspx) operator in order to retrieve all the fields (or perform a join). This will likely make the index not worthwhile for this query.
Also note that your two queries - the one with subplan and the one without - are functionally different. The first will give you the first 1000 rows the have both Id > lastId AND [Timestamp] >= timestampMin. The second will give you only the rows having [Timestamp] >= timestampMin from the first 1000 rows having Id > lastId. So, for example, you might get 1000 rows from the first query but less than that on the second one.

Partitioning in SQL Server Standard Edition with billion of rows

hi would like to ask about how to partition the following table (see below). The problem i'm having is not in the retrieval of History records which was resolved by the clustered Index. But as you can see the index is based on the HistoryParameterID then TimeStamp, this is needed because the retrieval of rows are based on the columns stated above.
The problem here is that whenever it reaches ~1 billion records, inserts are slowing down since the scenario is there will be 15k rows\second (note this can be 30k - 100k) to be inserted and per row it corresponds to a HistoryParameterID.
Basically, the HistoryParameterID is not unique , it has a one -> many relation ship with the other columns of the table below.
My hunch is that because of the index, it slows down the inserts because inserts are not always at the bottom because it is arranged by HistoryParameterID.
I did some testing using Timestamp as index but to no avail since query performance is unacceptable.
is there any way to partition this by history ParameterID? I was trying it so i created 15k Tables for partition view. But when i created the view it didn't finish executing. Any tips? or is there any way to partition ? Please note that i'm using Standard edition and using enterprise edition is not an option.
CREATE TABLE [dbo].[HistorySampleValues]
(
[HistoryParameterID] [int] NOT NULL,
[SourceTimeStamp] [datetime2](7) NOT NULL,
[ArchiveTimestamp] [datetime2](7) NOT NULL CONSTRAINT [DF__HistorySa__Archi__2A164134] DEFAULT (getutcdate()),
[ValueStatus] [int] NOT NULL,
[ArchiveStatus] [int] NOT NULL,
[IntegerValue] [bigint] SPARSE NULL,
[DoubleValue] [float] SPARSE NULL,
[StringValue] [varchar](100) SPARSE NULL,
[EnumNamedSetName] [varchar](100) SPARSE NULL,
[EnumNumericValue] [int] SPARSE NULL,
[EnumTextualValue] [varchar](256) SPARSE NULL
) ON [PRIMARY]
CREATE CLUSTERED INDEX [Source_HistParameterID_Index] ON [dbo].[HistorySampleValues]
(
[HistoryParameterID] ASC,
[SourceTimeStamp] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
I was trying it so i created 15k Tables for partition view. But when
i created the view it didn't finish executing. Any tips? or is there
any way to partition ? Please note that i'm using Standard edition and
using enterprise edition is not an option.
If you go down the partitioned view path (http://technet.microsoft.com/en-us/library/ms190019.aspx), I suggest fewer tables (under one hundred). Without partitioned tables, the optimizer must go through a lot of work since each table of the view could be indexed differently.
I would not expect inserts to slow down with table size if HistoryParameterID is incremental. However, in the case of a random value, inserts will become progressively slower as the table size grows due to lower buffer cache efficiency. That problem will exist with a single table, partitioned table, or partitioned view. See http://www.dbdelta.com/improving-uniqueidentifier-performance/ for an example using a guid but the issue applies to any random key value.
You might try a single table with SourceTimestamp alone as the clustered index key and a non-clustered index on HistoryID nad SourceTimestamp. That would provide the best insert performance and the non-clustered index (maybe with included columns) might be good enough for your select queries.
Everything you need is here. I'll hope you can figure it out.
http://msdn.microsoft.com/en-us/library/ms188730.aspx
and for Standard Edition alternative solutions exist like this answer.
and this is an interesting article too.
also we implement that in our enterprise automation application with custom indexing around table of users and it worked well.
Here's the cons and pros of custom implementation:
Pros:
Higher performance that partitioned table because of application's logic awareness.
Cons:
Implementing routing method and updating indexes.
Un-Centralized data.

table needs indexes to improve performance

I was having timeout issue when giving long period of DateTime in below query (query runs from c# application). Table had 30 million rows with a non-clustered index on ID(not a primary key).
Found that there was no primary key so I recently updated ID as Primary Key, it’s not giving me timeout now. Can anyone help me for the below query to create index on more than one key for future and also if I remove non clustered index from this table and create on more than one column? Data is increasing rapidly and need improvement on performace
select
ID, ReferenceNo, MinNo, DateTime, DataNo from tbl1
where
DateTime BETWEEN '04/09/2013' AND '20/11/2013'
and ReferenceNo = 4 and MinNo = 3 and DataNo = 14 Order by ID
this is the create script
CREATE TABLE [dbo].[tbl1]( [ID] [int] IDENTITY(1,1) not null, [ReferenceNo] [int] not null, [MinNo] [int] not null, [DateTime] [datetime] not null, [DataNo] [int] not null, CONSTRAINT [tbl1_pk] PRIMARY KEY CLUSTERED ([ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS
= ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
Its hard to tell which index you should use without knowing more about your database and how its used.
You may want to change the ID column to a clustered index. If ID is an identity column you will get very few page splits while inserting new data. It will however require you to rebuild the table and that may be a problem depending on your usage of the database. You will be looking at some downtime.
If you want a covering index it should look something like this:
CREATE NONCLUSTERED INDEX [MyCoveringIndex] ON tbl1
(
[ReferenceNo] ASC,
[MinNo] ASC,
[DataNo] ASC,
[DateTime ] ASC
)
Its no need to include ID as a column as its already in the clusted index (clusted index columns will be included in all other indexes). This will however use up a whole lot of space (somewhere in the range of 1GB if the columns above are of the types int and datetime). It will also affect your insert, update and delete performance on the table in (most cases) a negative way.
You can create the index in online mode if you are using Enterprice Edition of SQL server. In all other cases there will be a lock on the table while creating the index.
Its also hard to know what other queries that are made against the table. You may want to tweek the order of the columns in the index to better match other queries.
Indexing all fields would be fastest, but would likely waste a ton of space. I would guess that a date index would provide the most benefit with the least storage capacity cost because the data is probably evenly spread out over a large period of time. If the MIN() MAX() dates are close together, then this will not be as effective:
CREATE NONCLUSTERED INDEX [IDX_1] ON [dbo].[tbl1] (
[DateTime] ASC
)
GO
As a side note, you can use SSMSE's "Display Estimated Execution Plan" which will show you what the DB needs to do to get your data. It will suggest missing indexes and also provide CREATE INDEX statements. These suggestions can be quite wasteful, but they will give you an idea of what is taking so long. This option is in the Standard Toolbar, four icons to the right from "Execute".

SQL Server 2005 XML data type

UPDATE: This issue is note related to the XML, I duplicated the table using an nvarchar(MAX) instead and still same issue. I will repost a new topic.
I have a table with about a million records, the table has an XML field. The query is running extremely slow, even when selecting just an ID. Is there anything I can do to increase the speed of this, I have tried setting text in row on, but SQL server will not allow me to, I receive the error "Cannot switch to in row text in table".
I would appreciate any help in a fix or knowledge that I seem to be missing.
Thanks
TABLE
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParoleeID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [xml] NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
QUERY
DECLARE #ID int
SET #ID = NULL
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
DECLARE #Created datetime
SET #Created = NULL
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where
(#ID IS NULL OR Audit.ID = #ID) AND
(#ParoleeID IS NULL OR Audit.ParoleeID = #ParoleeID) AND
(#Page IS NULL OR Audit.Page = #Page) AND
(#ObjectID IS NULL OR Audit.ObjectID = #ObjectID) AND
(#Created is null or(Audit.Created > #Created and Audit.Created < DATEADD (d, 1, #Created )) )
You need to create a primary XML index on the column. Above anything else having this will assist ALL your queries.
Once you have this, you can create indexing into the XML columns on the xml data.
From experience though, if you can store some information in the relation tables, SQL is much better at searching and indexing that than XML. Ie any key columns and commonly searched data should be stored relationally where possible.
Sql Server 2005 – Twelve Tips For Optimizing Query Performance by Tony Wright
Turn on the execution plan, and statistics
Use Clustered Indexes
Use Indexed Views
Use Covering Indexes
Keep your clustered index small.
Avoid cursors
Archive old data
Partition your data correctly
Remove user-defined inline scalar functions
Use APPLY
Use computed columns
Use the correct transaction isolation level
http://tonesdotnetblog.wordpress.com/2008/05/26/twelve-tips-for-optimising-sql-server-2005-queries/
I had the very same scenario - and the solution in our case is computed columns.
For those bits of information that you need frequently from your XML, we created a computed column on the "hosting" table, which basically reaches into the XML and pulls out the necessary value from the XML using XPath. In most cases, we're even able to persist this computed column, so that it becomes part of the table and can be queried and even indexed and query speed is absolutely no problem anymore (on those columns).
We also tried XML indices in the beginning, but their disadvantage is the fact that they're absolutely HUGE on disk - this may or may not be a problem. Since we needed to ship back and forth the whole database frequently (as a SQL backup), we eventually gave up on them.
OK, to setup a computed column to retrieve from information bits from your XML, you first need to create a stored function, which will take the XML as a parameter, extract whatever information you need, and then pass that back - something like this:
CREATE FUNCTION dbo.GetShopOrderID(#ShopOrder XML)
RETURNS VARCHAR(100)
AS BEGIN
DECLARE #ShopOrderID VARCHAR(100)
SELECT
#ShopOrderID = #ShopOrder.value('(ActivateOrderRequest/ActivateOrder/OrderHead/OrderNumber)[1]', 'varchar(100)')
RETURN #ShopOrderID
END
Then, you'll need to add a computed column to your table and connect it to this stored function:
ALTER TABLE dbo.YourTable
ADD ShopOrderID AS dbo.GetShipOrderID(ShopOrderXML) PERSISTED
Now, you can easily select data from your table using this new column, as if it were a normal column:
SELECT (fields) FROM dbo.YourTable
WHERE ShopOrderID LIKE 'OSA%'
Best of all - whenever you update your XML, all the computed columns are updated as well - they're always in sync, no triggers or other black magic needed!
Marc
Some information like the query you run, the table structure, the XML content etc would definitely help. A lot...
Without any info, I will guess. The query is running slow when selecting just an ID because you don't have in index on ID.
Updated
There are at least a few serious problems with your query.
Unless an ID is provided, the table can only be scanned end-to-end because there are no indexes
Even if an ID is provided, the condition (#ID is NULL OR ID = #ID) is no guaranteed to be SARGable so it may still result in a table scan.
And most importantly: the query will generate a plan 'optimized' for the first set of parameters it sees. It will reuse this plan on any combination of parameters, no matter which are NULL or not. That would make a difference if there would be some variations on the access path to choose from (ie. indexes) but as it is now, the query can only choose between using a scan or a seek, if #id is present. Due to the ways is constructed, it will pretty much always choose a scan because of the OR.
With this table design your query will run slow today, slower tomorrow, and impossibly slow next week as the size increases. You must look back at your requirements, decide which fields are impoortant to query on, index them and provide separate queryes for them. OR-ing together all possible filters like this is not going to work.
The XML you're trying to retrieve has absolutely nothing to do with the performance problem. You are simply brute forcing a table scan and expect SQL to magically find the records you want.
So if you want to retrieve a specific ParoleeID, Page and ObjectID, you index the fields you search on and run a run a query for those and only those:
CREATE INDEX idx_Audit_ParoleeID ON Audit(ParoleeID);
CREATE INDEX idx_Audit_Page ON Audit(Page);
CREATE INDEX idx_Audit_ObjectID ON Audit(ObjectID);
GO
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where Audit.ParoleeID = #ParoleeID
AND Audit.Page = #Page
AND Audit.ObjectID = #ObjectID;

SQL Server speed

I had previously posted a question about my query speed with an XML column. After some further investigation I have found that it is not with the XML as previously thought. The table schema and query are very simple. There are over 800K rows, everything was running smooth but not with the increase in records it is taking almost a minute to run.
The Table:
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[PID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The query:
SELECT *
FROM Audit
WHERE PID = 158
AND Page = 2
AND ObjectID = 93
The query only return 26 records and the interesting thing is that if I add "TOP 26" the query executes in less than a second, if I change it to "TOP 27" then it take a minute. Even if I change the query to SELECT ID, it does not matter.
Any help is appreciated
You have an index on ID, but your query is using other columns instead. Therefore, you're probably getting a full table scan. Changing to SELECT ID makes no difference because it's not anywhere in the WHERE clause.
It's quick when you ask for TOP 26 because it can quit once it finds 26 rows because you don't have any ORDER BY clause. Changing it to TOP 27 means that, once it finds the first 26 (which are all the matches, according to your post), it can't quit looking; it has to continue to search until it either finds a 27th matching row or reaches the end of the data.
A SHOW PLAN would have shown you the problem pretty quickly.
Add an index for the PID, Page and ObjectID fields.
Why not add a covering index for the Page and Object ID columns and call it a day?
I think you must add non unique indexes to your columns you want to search. Indexing will certainly reduce the search time it takes. Whether requesting single column or multi column in SELECT query will not make any difference. The time it takes to individually compare rows needs to be reduced by indexing.
the 26 rows are probably near the start of the table, when you scan you find them fast and abort the rest of the scan, when looking for the 27th that doesn't exist you scan the entire table, which is slow!
when looking for these type of problems try this from query management studio:
run: _SET SHOWPLAN_ALL ON_
then run your query, look for the word "SCAN", they is mostlikely where your query is running slow, figure out why no index is being used.
in this case you need to ad an index. I generally add an index based on how I query the data, if you always have one of the three: PID, Page, and Object ID, add an index with that column first, add another column to that index if you have that value sometimes too. etc.

Resources