Using indexes when comparing datetimes - sql-server

I have two tables, both of which containing millions of rows of data.
tbl_one:
purchasedtm DATETIME,
userid INT,
totalcost INT
tbl_two:
id BIGINT,
eventdtm DATETIME,
anothercol INT
The first table has a clustered index on the first two columns: CLUSTERED INDEX tbl_one_idx ON(purchasedtm, userid)
The second one has a primary key on its ID column, and also a non-clustered index on the eventdtm column.
I want to run a query which looks for rows in which purchasedtm and eventdtm are on the same day.
Originally, I wrote my query as:
WHERE CAST(tbl_one.purchasedtm AS DATE) = CAST(tbl_two.eventdtm AS DATE)
But this was not going to use either of the two indexes.
Later, I changed my query to this:
WHERE tbl_one.purchasedtm >= CAST(tbl_two.eventdtm AS DATE)
AND tbl_one.purchasedtm < DATEADD(DAY, 1, CAST(tbl_two.eventdtm AS DATE))
This way, because only one side of the comparison is wrapped in a function, the other side can still use its index. Correct?
I also have some additional questions:
I can write the query the other way around too, i.e. keeping tbl_two.eventdtm untouched and wrapping tbl_one.purchasedtm in CAST(). Would that make a difference in performance?
If the answer to the previous question is yes is it because eventdtm has its own dedicated index, while looking up purcahsedtm would only be a partial index match?
Are there other factors I can take into consideration for deciding which of the two choices is better? (For example, if there are millions of rows in tbl_one but billions of rows in tbl_two, would that impact which column I should CAST and which one I should not?)
In genera, if you compare two columns that are both indexed, would we gain any performance compared to a similar scenario in which only one of them is indexed?
And lastly, can I perform my original task without using CAST?
Note: I do not have the ability to create or modify indexes, add columns, etc.

Little. late after commenting but...
As discussed in the comments, code such as CAST(DateTimeColumn AS date) is actually SARGable. Rob Farley posted an article on some of the SARGable and non-SARGable functionality here, however, I'll cover a few things off anyway.
Firstly, applying a function to a column will normally make your query non-SARGable, and especially if it changes the order of the values or the order of them is meaningless. Take something like:
SELECT *
FROM TABLE
WHERE RIGHT(COLUMN,5) = 'value';
The order of the values in the column are utterly unhelpful here, as we're focusing on the right hand characters. Unfortunately, as Rob also discusses:
SELECT *
FROM TABLE
WHERE LEFT(COLUMN,5) = 'value';
This is also non-SARGable. However what about the following?
SELECT *
FROM TABLE
WHERE Column LIKE 'value%';
This is, as the logic isn't applied to the column and the order doesn't change. If the value wehre '%value%' then that too would be non-SARGable.
When applying logic that adds (or subtracts) what you want to find, you always want to apply that to the literal value (or function, like GETDATE()`). For example one of these expressions is SARGable the other is not:
Column + 1 = #Variable --non-SARGable
Column = #Variable - 1 --SARGable
The same applies to things like DATEADD
#DateVariable BETWEEN DateColumn AND DATEADD(DAY, 30,DateColumn) --non-SARGable
DateColumn BETWEEN DATEADD(DAY, -30, #DateVariable) AND #DateVariable --SARGable
Changing the datatype (other than to a date) rarely will keep a query SARGable. CONVERT(date,varchardate,112) will not be SARGable, even though the order of the column is unchanged. Converting an decimal to an int, however, had the same result as converting a datetime to a date, and kept SARGability:
CREATE TABLE testtab (n decimal(2,1) PRIMARY KEY CLUSTERED);
INSERT INTO testtab
VALUES(0.1),
(0.3),
(1.1),
(1.7),
(2.4);
GO
SELECT n
FROM testtab
WHERE CONVERT(int,n) = 2;
GO
DROP TABLE testtab;
Hopefully, that gives you enough to go on, but pelase do ask if you want me to add anything further.

Related

T-SQL Order By - data type precedence not working as expected

I've been puzzling over this problem for days now, and have just identified the source of my woes - an order by clause is not working as expected.
The script goes like this:
select * from my_table
order by change_effective_date, unique_id desc
change_effective_date is a datetime field, and unique_id is an int field.
I had expected this to give me the most recent row first (i.e. the row with the highest value in change_effective_date). However, it was giving the oldest row first, and the unique_id was also in ascending order (these IDs are normally sequential, so I would generally expect them to follow the same order as the dates anyway, though this is not completely reliable).
Puzzled, I turned to Google and found that data type precedence can affect order by clauses, with lower-ranking datatypes being converted to the higher-ranking datatype: https://blog.sqlauthority.com/2010/10/08/sql-server-simple-explanation-of-data-type-precedence/
However, datetime takes precedence over int, so it shouldn't be affected in this way.
More curiously, if I take unique_id out of the order by clause, it sorts the data in descending date order perfectly. I do want to add a unique identifier to the order by clause, though, as there could be multiple rows with the same date and further on in the script I want to identify the most recent (in this case, the unique_id would be the tie-breaker as I would assume it to be sequential).
If anyone can explain what's happening here, I'd really appreciate it!
Thanks.
select * from my_table
order by change_effective_date desc, unique_id desc

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

How can I speed up this sql server query?

-- Holds last 30 valdates
create table #valdates(
date int
)
insert into #valdates
select distinct top (30) valuation_date
from tbsm.tbl_key_rates_summary
where valuation_date <= 20150529
order by valuation_date desc
select
sum(fv_change), sc_group, valuation_date
from
(select *
from tbsm.tbl_security_scorecards_summary
where valuation_date in (select date from #valdates)) as fact
join
(select *
from tbsm.tbl_security_classification
where sc_book = 'UC' ) as dim on fact.classification_id = dim.classification_id
group by
valuation_date, sc_group
drop table #valdates
This query takes around 40 seconds to return because the fact table has almost 13 million rows.. Can I do anything about this?
Based on the fact that there's no proper index that supports the fetch, that's probably the easiest (or only) option to really improve the performance. Most likely index like this would improve the situation a lot:
create index idx_security_scorecards_summary_1 on
tbl_security_scorecards_summary (valuation_date, classification_id)
include (fv_change)
Everything depends of course on how good the selectivity of the valuation_date and classification_id fields are (=how big portion of the table needs to be fetched) and might work better with the fields in opposite order. The field fv_change is in the include section so that it's included in the index structure so there's no need to fetch it from the base table.
Include fields help if the SQL has to fetch a lot of rows from the table. If the amount of rows that this touches is small, then it might not help at all. Like always in indexing, this of course slows down the inserts / updates, and is optimized for this case only and you should of course look at the bigger picture too.
The select is written in a little bit strange way, not sure if that makes any difference, but you could also try the normal way to do this:
select
sum(fact.c), dim.sc_group, fact.valuation_date
from
tbsm.tbl_security_scorecards_summary fact
join tbsm.tbl_security_classification dim
on fact.classification_id = dim.classification_id
where
fact.valuation_date in (select date from #valdates) and
dim.sc_book = 'UC'
group by
fact.valuation_date,
dim.sc_group
Looking at "statistics io" output should give you a good idea which table is causing the slowness, and looking at query plan to see if there's any strange operators might help to understand the situation better.

Computed column performance

I can't find any answer for my problem on the web.
When exactly are computed columns computed? (not persisted ones)
When I select TOP 100 from thousands of records, are they calculated for only those selected rows?
What if I add a WHERE clause for the computed column? Does this change?
The main problem is that I have a one to many relationship, but I want to have information on parent side about... let's say MAX(somecolumn) of child table.
I'm using Entity Framework. I decided to make a computed column.
Is this a good idea? Are there any others? Any help appreciated. Tnx
EDIT:
My column is defined like this:
[ComputedNextClassDate] as [dbo].[ComputeNextClassDate]([Id]),
And my function:
CREATE FUNCTION [dbo].[ComputeNextClassDate](#id INT)
RETURNS DATETIME
AS
BEGIN
DECLARE #nextDate DATETIME;
DECLARE #now DATETIME = GETUTCDATE();
SELECT #nextDate = MIN(Start) FROM [dbo].[Events] WHERE [Start] > #now AND [GroupClassId] = #id
RETURN #nextDate;
END;
For the calculated columns with no persistance, the calculation result is never stored.
On query execution, SQL Server engine search an execution plan. If your query has been well written, the value will be calculated only once even if it is used at many places into your query.
My opinion, I never use calculated columns with no persistence. The calculation must be done at the insertion or when reading. SQL Server, and others, are ineficient for calculation usually.
Call the CLR is catastrophic in terms of performance. Avoid it.
Prefer multiples tables with joins like
SELECT p.product_name
, SUM(ISNULL(sales,0))
FROM product p
LEFT OUTER JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_name

Sql Server Column with Auto-Generated Data

I have a customer table, and my requirement is to add a new varchar column that automatically obtains a random unique value each time a new customer is created.
I thought of writing an SP that randomizes a string, then check and re-generate if the string already exists. But to integrate the SP into the customer record creation process would require transactional SQL stuff at code level, which I'd like to avoid.
Help please?
edit:
I should've emphasized, the varchar has to be 5 characters long with numeric values between 1000 and 99999, and if the number is less than 10000, pad 0 on the left.
if it has to be varchar, you can cast a uniqueidentifier to varchar.
to get a random uniqueidentifier do NewId()
here's how you cast it:
CAST(NewId() as varchar(36))
EDIT
as per your comment to #Brannon:
are you saying you'll NEVER have over 99k records in the table? if so, just make your PK an identity column, seed it with 1000, and take care of "0" left padding in your business logic.
This question gives me the same feeling I get when users won't tell me what they want done, or why, they only want to tell me how to do it.
"Random" and "Unique" are conflicting requirements unless you create a serial list and then choose randomly from it, deleting the chosen value.
But what's the problem this is intended to solve?
With your edit/update, sounds like what you need is an auto-increment and some padding.
Below is an approach that uses a bogus table, then adds an IDENTITY column (assuming that you don't have one) which starts at 1000, and then which uses a Computed Column to give you some padding to make everything work out as you requested.
CREATE TABLE Customers (
CustomerName varchar(20) NOT NULL
)
GO
INSERT INTO Customers
SELECT 'Bob Thomas' UNION
SELECT 'Dave Winchel' UNION
SELECT 'Nancy Davolio' UNION
SELECT 'Saded Khan'
GO
ALTER TABLE Customers
ADD CustomerId int IDENTITY(1000,1) NOT NULL
GO
ALTER TABLE Customers
ADD SuperId AS right(replicate('0',5)+ CAST(CustomerId as varchar(5)),5)
GO
SELECT * FROM Customers
GO
DROP TABLE Customers
GO
I think Michael's answer with the auto-increment should work well - your customer will get "01000" and then "01001" and then "01002" and so forth.
If you want to or have to make it more random, in this case, I'd suggest you create a table that contains all possible values, from "01000" through "99999". When you insert a new customer, use a technique (e.g. randomization) to pick one of the existing rows from that table (your pool of still available customer ID's), and use it, and remove it from the table.
Anything else will become really bad over time. Imagine you've used up 90% or 95% of your available customer ID's - trying to randomly find one of the few remaining possibility could lead to an almost endless retry of "is this one taken? Yes -> try a next one".
Marc
Does the random string data need to be a certain format? If not, why not use a uniqueidentifier?
insert into Customer ([Name], [UniqueValue]) values (#Name, NEWID())
Or use NEWID() as the default value of the column.
EDIT:
I agree with #rm, use a numeric value in your database, and handle the conversion to string (with padding, etc) in code.
Try this:
ALTER TABLE Customer ADD AVarcharColumn varchar(50)
CONSTRAINT DF_Customer_AVarcharColumn DEFAULT CONVERT(varchar(50), GETDATE(), 109)
It returns a date and time up to milliseconds, wich would be enough in most cases.
Do you really need an unique value?

Resources