Why sort on sorted non clustered index field? - sql-server

Say I have a table with ID, Name, and Date.
And I have a non-clustered index like,
CREATE NONCLUSTERED INDEX IX_Test_NameDate ON [dbo].[Test] (Name, Date)
When I run the query,
select
[Name], [Date]
from
[dbo].[Test] WITH (INDEX(IX_Test_NameDate))
where
[Name] like 'A%'
order by
[Date] asc
I get in SQL Server's execution plan,
Select <-- Sort <-- Index Seek (NonClustered)
Why the sort? Isn't the date already sorted in the non-clustered index? What would a better non-clustered index look like that doesn't require a sort (only an index seek).
(Can't use a clustered index as this example is a condensed version of a bigger example with multiple rows/indexes).
For example, I get the execution plan (with sort) for a table that looks like this,
ID Name Date
1 A 2014-01-01
2 A 2014-02-01
3 A 2014-03-01
4 A 2014-04-01
5 B 2014-01-01
6 B 2014-02-01
7 B 2014-03-01
8 B 2014-04-01
9 B 2014-05-01
10 B 2014-06-01
Shouldn't the dates be sorted in this case?

No, the Date column is not "already sorted in the non-clustered index", at least, not by itself. It is sorted after Name.
Consider the following trivial table data:
Name Date
----- --------
Allen 1/1/2014
Barb 1/1/2013
Charlie 1/1/2015
Darlene 1/1/2012
Ernie 1/1/2016
Faith 1/1/2011
Once you've sorted by Name, the Date columns are potentially out of order. Dates are guaranteed in order only for rows that have the same Name.
Your goals are at cross-purposes to each other. You want multiple names--so the data is best ordered by name so that the seek is possible, but then you want to sort by Date. How would you propose storing the above six-row table so that it is sorted by Date for every possible range of names?
If there is some kind of regularity or pattern about the ranges of names (perhaps, for example, you always pull names by first letter only) then there is a possible workaround.
ALTER TABLE dbo.Test ADD NamePrefix AS (Left(Name, 1)) PERSISTED;
CREATE NONCLUSTERED INDEX IX_Test_NamePrefix_Date ON dbo.Test (NamePrefix, Date);
Now this query theoretically should not need to perform the sort:
SELECT Name, Date
FROM dbo.Test
WHERE NamePrefix = 'A'
ORDER BY Date;
Be aware that there are some likely gotchas with adding a persisted computed column like this: increased data size, the fact that such a design is almost certainly wrong in almost every case, that the proliferation of computed columns would be very bad, among others.
P.S. It is generally not best practice to force indexes manually--let the optimizer choose.

Related

Cassandra best practice to ORDER BY using PRIMARY KEY

Originally I had a cassandra table like this:
CREATE TABLE table (
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(open_time));
open_time | close | high | low | open | volume
---------------------------------+--------+--------+-------+--------+--------
2020-08-05 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
2020-08-04 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
I need to perform a query to get the latest open_time. After noticing that querys like
SELECT open_time FROM table ORDER BY open_time DESC LIMIT 1;
are not allowed, I wonder what's the best practice here.
My idea is to add an id column, that I can use open_time as clustering order. Something like:
CREATE TABLE table (
id int,
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(id, open_time)
)
WITH CLUSTERING ORDER BY (open_time DESC);
Is this a valid solution to get the job done or are there better ways, e.g. something without an extra id column, because I would never query over the id itslef.
The most queries would be something like:
SELECT * FROM table WHERE open_time >= '2013-01-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Thanks!
CLUSTERING ORDER enforces the on-disk sort order within each partition. So ordering by the same key that you're partitioning on isn't possible. Partitioning by id will face a similar challenge, in that the CLUSTERING ORDER BY open_time will only be enforced within each id.
I wonder what's the best practice here.
Models like these are usually solved by time bucketing, as I mentioned in an answer to a similar question earlier today. To select the best "bucket," you'll need to understand your business case like number of entries per day, as well as the query requirements.
For the sake of example, let's say that month would work the best. If each row contained a value of 'YEAR-MONTH', the PK definition would look like this:
PRIMARY KEY (month_bucket,open_time))
WITH CLUSTERING ORDER BY (open_time DESC);
Then, you could support a query like this:
SELECT * FROM table
WHERE month_bucket = '2013-08'
AND open_time >= '2013-08-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Likewise, querying the most recent entry would only require the most recent (current?) month as a parameter:
SELECT * FROM table
WHERE month_bucket = '2020-08'
LIMIT 1;
As the results are stored within each month_bucket sorted by open_time in descending order, that query would return the most-recent entry.
I wrote an article on this for DataStax (several years ago) which is relevant to this problem. It's been moved to a new part of their site, which hosed the formatting, but the content is defintely there. Give it a read; hope it helps: We Shall Have Order!
If id is mentioned as primary key, it must be included in where clause otherwise it would need allow filtering.
You can try querying with "Select max(open_time)....",otherwise you can use id as above which will be incremented with every record and a result, id with highest value will always have the latest record.

SQL Optimize Group By Query

I have a table here with following fields:
Id, Name, kind. date
Data:
id name kind date
1 Thomas 1 2015-01-01
2 Thomas 1 2015-01-01
3 Thomas 2 2014-01-01
4 Kevin 2 2014-01-01
5 Kevin 2 2014-01-01
5 Kevin 2 2014-01-01
5 Kevin 2 2014-01-01
6 Sasha 1 2014-01-01
I have an SQL statement like this:
Select name,kind,Count(*) AS RecordCount
from mytable
group by kind, name
I want to know how many records there are for any name and kind. Expected results:
name kind count
Thomas 1 2
Thomas 2 1
Kevin 2 2
Sasha 1 4
The problem is that it is a big table, with more than 50 Million records.
Also I'd like to know the result within the last hour, last day, last week and so on, for which I need to add this WHERE clause this:
Select name,kind,Count(*) AS RecordCount
from mytable
WHERE Date > '2015-26-07'
group by kind, name
I use T-SQL with the SQL Server Management Studio. All of the relevant columns have a non clustered index and the primary key is a clustered index.
Does somebody have ideas how to make this faster?
Update:
The execution plan says:
Select, Compute Scalar, Stream Aggregate, Sort, Parallelism: 0% costs.
Hash Match (Partial Aggregate): 12%.
Clustered Index Scan: 88%
Sorry, I forgot to check the SQL-statements.
50 million is just lot of rows
Not anything you can do to optimize that query that I can see
Possibly a composite index on kind, name
Or try name, kind
Or name only
I think the query optimizer is smart enough for this to not be a factor but but switch the group by to name, kind as name is more unique
If kind is not very unique (just 1 and 2) then you may be better off no index on that
I defrag the indexes you have
To query the last day is no big deal because you already have a date column on witch you can put an index on.
For last week I would create a seperate date-table witch contains one row per day with columns id, date, week
You have to pre-calculate the week. And now if you want to query a specific week you can look in the date table, get the Dates and query only those dates from your tabele mytable
You should test if it is more performant to join the date columns or if you better put the id column in your myTable an join with id. For big tables id might be the better choice.
To query last hour you could add the column [hour] in myTable an query it in combination with the date

Why this query is running so slow?

This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

Putting clustered index on a join used column vs heavily scanned column?

I have this simple table :
Table Users
userId | name
---------------------
1 'a1'
2 'a2'
3 'a3'
4 'a4'
5 'a5'
Table Cities
cityId | name
---------------------
1 'c1'
2 'c2'
3 'c3'
4 'c4'
5 'c5'
Each user is can be in more than one city. :
So the mapping table is :
userId | CityId
------------------------------------
1 4
1 4
1 4
2 5
5 6
Table users is heavily scanned by name .
Question :
For the mapping table I have no issues. both columns together are primary/clustered index.
But i'm struggling with myself about the first 2 tables :
I think that Users should have userId column as primary key. why ? because it is used throug the join to the mapping table.
but I also need clustered index on the name column cause this table is heavily scanned by name.
(leave aside the unique problem. lets say all columns are unique)
What is the best practice decision for this case ?
The best decision depends on how exactly you use the data returned by a query.
A clustered index means that the data in the page files are ordered based on this index.
A regular index will have it's own page files to order the index and a pointer to the physical row.
Thus a clustered index will serve better for theses queries that return a range of value instead of unique rows.
So, unless you do a lot of queries with like operations on the Name column, you would be better to keep your clustered index on the ID column, for this index will be constantly scanned and used to return recordsets to support your join operations.

T-SQL rolling twelve month per day performance

I have checked similar problems, but none have worked well for me. The most useful was http://forums.asp.net/t/1170815.aspx/1, but the performance makes my query run for hours and hours.
I have 1.5 million records based on product sales (about 10k product) over 4 years. I want to have a table that contains date, product and rolling twelve months sales.
This query (from the link above) works, and shows what I want, but the perfomance makes it useless:
select day_key, product_key, price, (select sum(price) as R12 from #ORDER_TURNOVER as tb1 where tb1.day_key <= a.day_key and tb1.day_key > dateadd(mm, -12, a.day_key) and tb1.product_key = a.product_key) as RSum into #hejsan
from #ORDER_TURNOVER as a
I tried a rolling sum cursor function for all records which was fast as lightning, but I couldn't get the query only to sum the sales over the last 365 days.
Any ideas on how to solve this problem is much appreciated.
Thank you.
I'd change your setup slightly.
First, have a table that lists all the product keys that are of interest...
CREATE TABLE product (
product_key INT NOT NULL,
price INT,
some_fact_data VARCHAR(MAX),
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (product_key)
)
Then, I'd have a calendar table, with each individual date that you could ever need to report on...
CREATE TABLE calendar (
date SMALLDATETIME,
is_bank_holdiday INT,
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (date)
)
Finally, I'd ensure that your data table has a covering index on all the relevant fields...
CREATE INDEX IX_product_day ON #ORDER_TURNOVER (product_key, day_key)
This would then allow the following query...
SELECT
product.product_key,
product.price,
calendar.date,
SUM(price) AS RSum
FROM
product
CROSS JOIN
calendar
INNER JOIN
#ORDER_TURNOVER AS data
ON data.product_key = product.product_key
AND data.day_key > dateadd(mm, -12, calendar.date)
AND data.day_key <= calendare.date
GROUP BY
product.product_key,
product.price,
calendar.date
By doing everything in this way, each product/calendar_date combination will then relate to a set of record in your data table that are all consecutive to each other. This will make the act of looking up the data to be aggregated much, much simpler for the optimiser.
[Requires a single index, specifically in the order (product, date).]
If you have the index the other way around, it is actually much harder...
Example data:
product | date date | product
---------+------------- ------------+---------
A | 01/01/2012 01/01/2012 | A
A | 02/01/2012 01/01/2012 | B
A | 03/01/2012 02/01/2012 | A
B | 01/01/2012 02/01/2012 | B
B | 02/01/2012 03/01/2012 | A
B | 03/01/2012 03/01/2012 | B
On the left oyu just get all the records that are next to each other in a 365 day block.
On the right you search for each record before you can aggregate. The search is relatively simple, but you do 365 of them. Much more than the version on the left.
This is how one does "running totals" / "sum subsets" in SQL Server 2005-2008. In SQL 2012 there is native support for running totals but we are all still working with 2005-2008 db's
SELECT day_key ,
product_key ,
price ,
( SELECT SUM(price) AS R12
FROM #ORDER_TURNOVER AS tb1
WHERE tb1.day_key <= a.day_key
AND tb1.day_key > DATEADD(mm, -12, a.day_key)
AND tb1.product_key = a.product_key
) AS RSum
INTO #hejsan
FROM #ORDER_TURNOVER AS a
A few suggestions.
You could pre calculate the running totals so that they are not calculated again and again. What you are doing it the above select is a disguised loop and not a set query (unless the optimizer can convert the subquery to a join).
The above solution requires a few changes to the code.
Another solution that you can certainly try is to create a clustered index on your #ORDER_TURNOVER temp table. This is safer cause it's local change.
CREATE CLUSTERED INDEX IndexName
ON #ORDER_TURNOVER (day_key,day_key,product_key)
All your 3 expressions in the WHERE clause are SARGS so chanes are good that the optimizer will now do a seek instead of a scan.
If the index solution does not give enough performance gains that its well worth investing in solution 1

Resources