Can someone please explain the following behavior to me, let me know if table definitions etc. would help.
I have a query executed on SQL Server 2016 SP2, all tables have the default clustered index on the primary key column, which is an IDENTITY column:
SELECT a.smallint_col
FROM TableA a
INNER JOIN TableB b ON b.int_col = a.int_col
WHERE a.int_col = 123 AND b.varchar_col = 12345;
This query returns an error:
Conversion failed when converting the varchar value 'ABC123' to data type int.
'ABC123' is the value of a row in TableB.varchar_col.
I understand this query forces SQL Server to perform implicit conversion on TableB.varchar_col because it is passed without single quotes.
I can see from Include Live Query Statistics that this query is trying to use a non-clustered index scan on an index defined on TableB as:
CREATE NONCLUSTERED INDEX [varchar_col]
ON [TableB] ([varchar_col])
and an index seek on an index defined on TableA as:
CREATE NONCLUSTERED INDEX [int_col]
ON [TableA] ([int_col])
If I force the query to use the clustered indexes on each table by using WITH (INDEX(1)), the query returns successfully. I know that if I quote the value correctly '12345', the query also returns successfully (for some unknown reason our code passes it without the quotes) and I think that is the real solution.
However, I'd like to understand the behavior of SQL Server here. Why is the clustered index scan able to perform the implicit conversion without throwing the error but the non clustered index scan can't?
Likely what's happening here is that the order at which SQL Server is applying the clauses.
When you get the failure, the data engine is likely applying the clause b.varchar_col = 12345 first. As a result, when a value that isn't able to be (implicitly) converted to an int is compared the query fails.
For the times it works, then likely the clause a.int_col = 123 is being evaluated first. When this is applied, any rows that remain contain values in b.varchar_col that can be implicitly converted, and thus no failure.
Like you said, however, the real solution here is to correct your application layer. Likely you should be using a parametrised query, rather than (injecting?) raw values. Then you control the datatypes:
SELECT a.smallint_col
FROM TableA a
INNER JOIN TableB b ON b.int_col = a.int_col
WHERE a.int_col = #IntParam and b.varchar_col = #VarcharParam;
In your application code, you can then define #IntParam as an int and #VarcharParam as a varchar, meaning that no implicit conversion can happen apart from perhaps if your application passes 12345 to parameter #VarcharParam.
How you parametrise your application is a different question though (seeing as we don't even know what language your application uses, we can't even provide an example, I'm afraid).
Related
I'm trying to insert records into a table in a certain (and simple) order, as the table have an IDENTITY column (e.g. MyTbl (ID INT IDENTITY(1,1), Sale_Date DATE, Product_ID INT, Sales INT).
The query is quite simple (this is just a simplified example):
INSERT INTO MyTbl (Sale_Date, Product_ID, Sales)
SELECT Sale_Date, Product_ID,COUNT(*) as sales
FROM Fact_tbl
GROUP BY Sale_Date,Product_ID
ORDER BY Sale_Date,Product_ID
The expected behavior is that when I select the highest values of the identity ID column, I should see the latest Sale_Date. However, this is not the case. The order of the ID column in the table has nothing to do with the dates. To make things even worse, if I recreate the table and run the same INSERT statement again and again and again, I'm getting different order of insertion each time for the same data.
I'm getting this behavior even if I encase the query and put the ORDER BY in or out of the casing.
I never saw this behavior in any other SQL platform. Is this the expected behavior in Snowflake?
It's expected. Let me explain the reason:
AUTOINCREMENT and IDENTITY are synonymous. If either is specified for a column, Snowflake utilizes a sequence to generate the values for the column.
https://docs.snowflake.com/en/sql-reference/sql/create-table.html#optional-parameters
There is no guarantee that values from a sequence are contiguous (gap-free) or that the sequence values are assigned in a particular order. There is, in fact, no way to assign values from a sequence to rows in a specified order other than to use single-row statements (this still provides no guarantee about gaps).
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics
With Snowflake each INSERT has completely different order than the
same INSERT that ran a couple of minutes ago
No, it should insert the data in expected order because you use "ORDER BY" clause. The issue is, the sequence values are not assigned in a particular order!
It's not easy to verify if the data is sorted when you use "INSERT/SELECT ORDER BY", unless you have access to underlying metadata. For testing, you may define clustering keys on a table that you ingested "sorted" data.
Anyway, if you want to assign IDs matching the order when inserting bulk data, you need to use ROW_NUMBER instead of using an IDENTITY column or any sequence values.
This is not expected behavior in Snowflake. However the way you insert data into your table (with the order by) doesn't affect the order in which the data is stored inside the table. You can leave the order by out in the insert, but you should include it in your select.
I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:
Often you join two tables following their foreign key, so that the row in the RHS table will always be found. Adding the join does not affect the number of rows affected by the query. For example
create table a (x int not null primary key)
create table b (x int not null primary key, y int not null)
alter table a add foreign key (x) references b (x)
Now, assuming you set up some data in these two tables, you can get a certain number of rows from a:
select x from a
Adding a join to b following the foreign key does not change this:
select a.x from a join b on a.x = b.x
However, that is not true of joins in general, which may filter out some rows or (by Cartesian product) add more:
select a.x from a join b on a.x = b.x and b.y != 42 -- probably gives fewer rows
select a.x from a join b on a.x != b.y -- probably gives more rows
When reading SQL code there is no obvious way to tell whether a join is the key-preserving kind, which may add extra columns but does not change the number of rows returned, or whether it has other effects. Over time I have developed a coding convention which I mostly stick to:
if a key-preserving join, use join
if wanting to filter rows, put the filter condition in the where clause
if wanting more rows, sometimes cross join for Cartesian product is the clearest way
These are usually just style issues, since you can often put a predicate into either the join clause or the where clause, for example.
My question
Is there some way to have these key-preserving joins statically checked by the database server when the query is compiled? I understand that the query optimizer already knows that a join on a foreign key will always find exactly one row in the table pointed to by the foreign key. But I would like to tag it in my SQL code for the benefit of human readers. For example, suppose the new syntax fkjoin is used for a join following a foreign key. Then the following SQL fragments will give errors or not:
a fkjoin b on a.x = b.x -- OK
a fkjoin b on a.x = b.x and b.y = 42 -- "Error, join can fail due to extra predicate"
a fkjoin b on a.x = b.y -- "Error, no foreign key from a.x to b.y"
This would be a useful check for me when writing the SQL, and also when returning to read it later. I understand and accept that changing the foreign keys in the database would change what SQL is legal under this scheme - to me, that is a desired outcome, since if a necessary FK ceases to exist then the key-preserving semantics of the query are no longer guaranteed, and I'd like to find out about it.
Potentially, there could be some external SQL static checker tool that does the work, and special comment syntax could be used rather than a new keyword. The checker tool would need access to the database schema to see what foreign keys exist, but it would not need to actually execute the query.
Is there something that does what I want? I am using MSSQL 2008 R2. (Microsoft SQL Server for the pedantic)
I realize that you are interested in indicating whether a particular join on particular columns is on a FK, or is a restriction, or perhaps is of some other case, or none of the preceding. (And it's not clear what you mean by "success" or "failure" of a join, or its relevance.) Whereas focusing on that information, as explained below, is to miss focusing on more important and fundamental things.
A base table has a "meaning" or "predicate (expression)" that is a fill-in-the-(named-)blanks statement given by the DBA. The names of the blanks of the statement are the columns of the table. Rows that fill in the blanks to make a true proposition about the world go in the table. Rows that fill in the blanks to make a false proposition about the world are left out. Ie a table holds the rows that satisfy its statement. You cannot set a base table to a certain value without knowing its statement, observing the world and putting the appropriate rows into the table. You cannot know about the world from base tables except by knowing its statement and taking present-row propositions to be true and absent-row propositions to be false. Ie you need its statement to use the database.
Notice that the typical syntax for a table declaration looks like a shorthand for its statement:
-- employee [eid] is named [name] and lives at [address] in ...
EMPLOYEE(eid,name,address,...)
You can make bigger statements by putting logic operators AND, OR, AND NOT, EXISTS name, AND condition, etc between/around other statements. If you translate a statement to a relation/SQL expression by converting
a table's statement to its name
AND to JOIN
OR to UNION
AND NOT to EXCEPT/MINUS
EXISTS C,... [...] to SELECT all columns but C,... FROM ...
AND condition to ON/WHERE condition
IMPLIES to SUBSETOF
IFF to =
then you get a relation expression that calculates the rows that make the statement true. (Arguments of UNION & EXCEPT/MINUS need the same columns.) So just as every table holds the rows satisfying its statement, a query expression holds the rows that satisfy its statement. You cannot know about the world from a query result except by knowing its statement and taking its present-row propositions to be true and absent-row propositions to be false. Ie you need its statement to compose or interpret a query. (Observe that this is true regardless of what constraints hold.)
This is the foundation of the relational model: table expressions calculate rows satisfying corresponding statements. (To the extent that SQL differs, it is literally illogical.)
Eg: If table T holds the rows that make statement T(...,T.Ci,...) true and table U holds the rows that make statement U(...,U.Cj,...) true then table T JOIN U holds the rows that make statement T(...,T.Ci,...) AND U(...,U.Cj,...) true. That is the semantics of JOIN that is important to using a database. You can always join, and a join always has a meaning, and it is always the AND of its operands' meanings. Whether any tables happen to have FKs to others just isn't particularly helpful for reasoning about updates or queries. (The DBMS uses constraints for when you make mistakes.)
A constraint expression just corresponds to a proposition aka always-true statement about the world and simultaneusly to one about base tables. Eg for C UNIQUE NOT NULL in U, the following three expressions are equivalent to each other:
FOREIGN KEY T (C) REFERENCES U (C)
EXISTS columns other than C T(...,C,...)
IMPLIES EXISTS columns other than C U(...,C,...)
(SELECT C FROM T) SUBSETOF (SELECT C FROM U)
It is true that this implies that SELECT C FROM T JOIN U ON T.C = U.C = SELECT C FROM U, ie a join on a FK returns the same number of rows. But so what? The join's meaning is still the same function of its arguments'.
Whether a particular join on a particular column set involves a foreign key is just not germane to understanding the meaning of a query.
I work with SQL Server 2008, but can use a later version if it would matter.
I have 2 tables with pretty similar data about some people but in different formats (no intersections between these 2 sets of people).
Table 1:
int personID
bit IsOldPerson //this field is indexed
Table 2:
int PersonID
int Age
I want to have a combined view that has the same structure as the Table 1. So I write the following script (a simplified version):
CREATE FUNCTION CombinedView(#date date)
RETURNS TABLE
AS
RETURN
select personID as PID, IsOldPerson as IOP
from Table1
union all
select personID as PID, dbo.CheckIfOld(Age,#date) as IOP
from Table2
GO
The function "CheckIfOld" returns yes/no depending on the input age at the date #date.
So I have 2 questions here:
A. if I try select * from CombinedView(TODAY) where IOP=true, whether the SQL Server will do the following separately: 1) for the Table 1 use the index for the field IsOldPerson and do a "clever" index-based selection of results; 2) for the Table 2 calculate CheckIfOld for all the rows and during the calculation pick up or rejecting rows on the row-by-row basis ?
B. how can I check the execution plan in this particular case to understand whether my guess in the question (A) is correct or not?
Any help is greatly appreciated! Thanks!
Yes, if the query isn't too complex, the query optimizer should "see through" the view into its constituent UNION-ed SELECT statements, evaluate them separately, and concatenate the results. If there is an index on Table1, it should be able to use it. I tested this using tables we had and the same function concepts you presented. I reviewed the query plans of the raw SELECT to Table1 and the SELECT to the inline table-valued function with the UNION and the portion of the query plan relevant to Table1 was the same-- and it used the index.
Now if performance is a concern, I suggest you do one of two things:
If (a) Table2 is read-heavy rather than write-heavy, (b) you have the space, and (c) you can write CheckIfOld as a single CASE statement (as its name and context in your question implies), then you should consider creating a persisted calculated field in Table2 with the calculation from IsOldPerson and applying an index to it.
If Table2 is write-heavy, or you have no space for additional fields, you should at least consider converting CheckIfOld into an inline function. You will likely reap performance gains, depending on how it is used. In your case, it would be used like this:
select personID as PID, IOP.IsOldPerson from Table2 CROSS APPLY dbo.CheckIfOld(Age,#date) AS IOP
I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.