Select clause with limit and order by - snowflake-cloud-data-platform

If you do a select query with limit and order by a column, does that guarantee the results are deterministic? Even if the column has a lot of same values, like a boolean column? Or is the determinism guaranteed only if each row has a unique value on that column?

The sort has to be stable to get replicable results. For instance for:
CREATE TABLE t(id INT, col VARCHAR);
INSERT INTO t
VALUES (1, 'a'), (2, 'a'), (3, 'b);
Query:
SELECT *
FROM t
ORDER BY col
LIMIT 1;
It could return either 1, 'a' or 2, 'a'. It means that there is a tie which is not resolved and another column should be used to provide stable sort.
To easily check if columns provide stable sort the following query could be used:
SELECT *
FROM t
QUALIFY COUNT(*) OVER(PARTITION BY col_list_here) > 1

Did a quick test of this and it returned non-deterministic results (i.e. it changes every single time).
Adding more column in the ORDER BY clause to increase the cardinality still produced inconsistent result.
I was only able to get deterministic results when the ORDER BY columns produces unique combinations of values.

If a non-unique column is used in the ORDER BY clause, the result order is non-deterministic.
Moreover, for the LIMIT / TOP, please note that ORDER BY must be at the same level.
Another example of non-deterministic order behavior is if the combination of the keys in the OVER clause of a window function doesn’t form a composite unique key in the table.

Related

Order of XML nodes from document preserved in insert?

If I do:
INSERT INTO dst
SELECT blah
FROM src
CROSS APPLY xmlcolumn.nodes('blah')
where dst has an identity column, can one say for certain that the identity column order matches the order of the nodes from the original xml document?
I think the answer is no, there are no guarantees and that to ensure the ordering is able to be retained, some ordering information needs to also be extracted from the XML at the same time the nodes are enumerated.
There's no way to see it explicitly in an execution plan, but the id column returned by the nodes() method is a varbinary(900) OrdPath, which does encapsulate the original xml document order.
The solution offered by Mikael Eriksson on the related question Does the `nodes()` method keep the document order? relies on the OrdPath to provide an ORDER BY clause necessary to determine how identity values are assigned for the INSERT.
A slightly more compact usage follows:
CREATE TABLE #T
(
ID integer IDENTITY,
Fruit nvarchar(10) NOT NULL
);
DECLARE #xml xml =
N'
<Fruits>
<Apple />
<Banana />
<Orange />
<Pear />
</Fruits>
';
INSERT #T
(Fruit)
SELECT
N.n.value('local-name(.)', 'nvarchar(10)')
FROM #xml.nodes('/Fruits/*') AS N (n)
ORDER BY
ROW_NUMBER() OVER (ORDER BY N.n);
SELECT
T.ID,
T.Fruit
FROM #T AS T
ORDER BY
T.ID;
db<>fiddle
Using the OrdPath this way is presently undocumented, but the technique is sound in principle:
The OrdPath reflects document order.
The ROW_NUMBER computes sequence values ordered by OrdPath*.
The ORDER BY clause uses the row number sequence.
Identity values are assigned to rows as per the ORDER BY.
To be clear, this holds even if parallelism is employed. As Mikael says, the dubious aspect is using id in the ROW_NUMBER since id is not documented to be the OrdPath.
* The ordering is not shown in plans, but optimizer output using TF 8607 contains:
ScaOp_SeqFunc row_number order[CALC:QCOL: XML Reader with XPath filter.id ASC]
Under the current implementation of .nodes, the XML nodes are generated in document order. The result of that is always joined to the original data using a nested loops, which always runs in order also.
Furthermore, inserts are generally serial (except under very specific circumstances that it goes parallel, usually when you have an empty table, and never with an IDENTITY value being generated).
Therefore there is no reason why the server would ever return rows in a different order than the document order. You can see from this fiddle that that is what happens.
That being said, there is no guarantee that the implementation of .nodes won't change, or that inserts may in future go parallel, as neither of these is documented anywhere as being guaranteed. So I wouldn't rely on it without an explicit ORDER BY, and you do not have a column to order it on.
Using an ORDER BY would guarantee it. The docs state: "INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted."
Even using ROW_NUMBER as some have recommended is also not guaranteed. The only real solution is to get the document order directly from XQuery.
The problem is that SQL Server's version of XQuery does not allow using position(.) as a result, only as a predicate. Instead, you can use a hack involving the << positional operator.
For example:
SELECT T.X.value('text()[1]', 'nvarchar(100)') as RowLabel,
T.X.value('let $i := . return count(../*[. << $i]) + 1', 'int') as RowNumber
FROM src
CROSS APPLY xmlcolumn.nodes('blah') as T(X);
What this does is:
Assign the current node . to the variable $i
Takes all the nodes in ../* i.e. all children of the parent of this node
... [. << $i] that are previous to $i
and counts them
Then add 1 to make it one-based

Snowflake INSERT inconsistent behavior using ORDER BY

I'm trying to insert records into a table in a certain (and simple) order, as the table have an IDENTITY column (e.g. MyTbl (ID INT IDENTITY(1,1), Sale_Date DATE, Product_ID INT, Sales INT).
The query is quite simple (this is just a simplified example):
INSERT INTO MyTbl (Sale_Date, Product_ID, Sales)
SELECT Sale_Date, Product_ID,COUNT(*) as sales
FROM Fact_tbl
GROUP BY Sale_Date,Product_ID
ORDER BY Sale_Date,Product_ID
The expected behavior is that when I select the highest values of the identity ID column, I should see the latest Sale_Date. However, this is not the case. The order of the ID column in the table has nothing to do with the dates. To make things even worse, if I recreate the table and run the same INSERT statement again and again and again, I'm getting different order of insertion each time for the same data.
I'm getting this behavior even if I encase the query and put the ORDER BY in or out of the casing.
I never saw this behavior in any other SQL platform. Is this the expected behavior in Snowflake?
It's expected. Let me explain the reason:
AUTOINCREMENT and IDENTITY are synonymous. If either is specified for a column, Snowflake utilizes a sequence to generate the values for the column.
https://docs.snowflake.com/en/sql-reference/sql/create-table.html#optional-parameters
There is no guarantee that values from a sequence are contiguous (gap-free) or that the sequence values are assigned in a particular order. There is, in fact, no way to assign values from a sequence to rows in a specified order other than to use single-row statements (this still provides no guarantee about gaps).
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics
With Snowflake each INSERT has completely different order than the
same INSERT that ran a couple of minutes ago
No, it should insert the data in expected order because you use "ORDER BY" clause. The issue is, the sequence values are not assigned in a particular order!
It's not easy to verify if the data is sorted when you use "INSERT/SELECT ORDER BY", unless you have access to underlying metadata. For testing, you may define clustering keys on a table that you ingested "sorted" data.
Anyway, if you want to assign IDs matching the order when inserting bulk data, you need to use ROW_NUMBER instead of using an IDENTITY column or any sequence values.
This is not expected behavior in Snowflake. However the way you insert data into your table (with the order by) doesn't affect the order in which the data is stored inside the table. You can leave the order by out in the insert, but you should include it in your select.

Using indexes when comparing datetimes

I have two tables, both of which containing millions of rows of data.
tbl_one:
purchasedtm DATETIME,
userid INT,
totalcost INT
tbl_two:
id BIGINT,
eventdtm DATETIME,
anothercol INT
The first table has a clustered index on the first two columns: CLUSTERED INDEX tbl_one_idx ON(purchasedtm, userid)
The second one has a primary key on its ID column, and also a non-clustered index on the eventdtm column.
I want to run a query which looks for rows in which purchasedtm and eventdtm are on the same day.
Originally, I wrote my query as:
WHERE CAST(tbl_one.purchasedtm AS DATE) = CAST(tbl_two.eventdtm AS DATE)
But this was not going to use either of the two indexes.
Later, I changed my query to this:
WHERE tbl_one.purchasedtm >= CAST(tbl_two.eventdtm AS DATE)
AND tbl_one.purchasedtm < DATEADD(DAY, 1, CAST(tbl_two.eventdtm AS DATE))
This way, because only one side of the comparison is wrapped in a function, the other side can still use its index. Correct?
I also have some additional questions:
I can write the query the other way around too, i.e. keeping tbl_two.eventdtm untouched and wrapping tbl_one.purchasedtm in CAST(). Would that make a difference in performance?
If the answer to the previous question is yes is it because eventdtm has its own dedicated index, while looking up purcahsedtm would only be a partial index match?
Are there other factors I can take into consideration for deciding which of the two choices is better? (For example, if there are millions of rows in tbl_one but billions of rows in tbl_two, would that impact which column I should CAST and which one I should not?)
In genera, if you compare two columns that are both indexed, would we gain any performance compared to a similar scenario in which only one of them is indexed?
And lastly, can I perform my original task without using CAST?
Note: I do not have the ability to create or modify indexes, add columns, etc.
Little. late after commenting but...
As discussed in the comments, code such as CAST(DateTimeColumn AS date) is actually SARGable. Rob Farley posted an article on some of the SARGable and non-SARGable functionality here, however, I'll cover a few things off anyway.
Firstly, applying a function to a column will normally make your query non-SARGable, and especially if it changes the order of the values or the order of them is meaningless. Take something like:
SELECT *
FROM TABLE
WHERE RIGHT(COLUMN,5) = 'value';
The order of the values in the column are utterly unhelpful here, as we're focusing on the right hand characters. Unfortunately, as Rob also discusses:
SELECT *
FROM TABLE
WHERE LEFT(COLUMN,5) = 'value';
This is also non-SARGable. However what about the following?
SELECT *
FROM TABLE
WHERE Column LIKE 'value%';
This is, as the logic isn't applied to the column and the order doesn't change. If the value wehre '%value%' then that too would be non-SARGable.
When applying logic that adds (or subtracts) what you want to find, you always want to apply that to the literal value (or function, like GETDATE()`). For example one of these expressions is SARGable the other is not:
Column + 1 = #Variable --non-SARGable
Column = #Variable - 1 --SARGable
The same applies to things like DATEADD
#DateVariable BETWEEN DateColumn AND DATEADD(DAY, 30,DateColumn) --non-SARGable
DateColumn BETWEEN DATEADD(DAY, -30, #DateVariable) AND #DateVariable --SARGable
Changing the datatype (other than to a date) rarely will keep a query SARGable. CONVERT(date,varchardate,112) will not be SARGable, even though the order of the column is unchanged. Converting an decimal to an int, however, had the same result as converting a datetime to a date, and kept SARGability:
CREATE TABLE testtab (n decimal(2,1) PRIMARY KEY CLUSTERED);
INSERT INTO testtab
VALUES(0.1),
(0.3),
(1.1),
(1.7),
(2.4);
GO
SELECT n
FROM testtab
WHERE CONVERT(int,n) = 2;
GO
DROP TABLE testtab;
Hopefully, that gives you enough to go on, but pelase do ask if you want me to add anything further.

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

How does sql server choose values in an update statement where there are multiple options?

I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.

Resources