Order of XML nodes from document preserved in insert? - sql-server

If I do:
INSERT INTO dst
SELECT blah
FROM src
CROSS APPLY xmlcolumn.nodes('blah')
where dst has an identity column, can one say for certain that the identity column order matches the order of the nodes from the original xml document?
I think the answer is no, there are no guarantees and that to ensure the ordering is able to be retained, some ordering information needs to also be extracted from the XML at the same time the nodes are enumerated.

There's no way to see it explicitly in an execution plan, but the id column returned by the nodes() method is a varbinary(900) OrdPath, which does encapsulate the original xml document order.
The solution offered by Mikael Eriksson on the related question Does the `nodes()` method keep the document order? relies on the OrdPath to provide an ORDER BY clause necessary to determine how identity values are assigned for the INSERT.
A slightly more compact usage follows:
CREATE TABLE #T
(
ID integer IDENTITY,
Fruit nvarchar(10) NOT NULL
);
DECLARE #xml xml =
N'
<Fruits>
<Apple />
<Banana />
<Orange />
<Pear />
</Fruits>
';
INSERT #T
(Fruit)
SELECT
N.n.value('local-name(.)', 'nvarchar(10)')
FROM #xml.nodes('/Fruits/*') AS N (n)
ORDER BY
ROW_NUMBER() OVER (ORDER BY N.n);
SELECT
T.ID,
T.Fruit
FROM #T AS T
ORDER BY
T.ID;
db<>fiddle
Using the OrdPath this way is presently undocumented, but the technique is sound in principle:
The OrdPath reflects document order.
The ROW_NUMBER computes sequence values ordered by OrdPath*.
The ORDER BY clause uses the row number sequence.
Identity values are assigned to rows as per the ORDER BY.
To be clear, this holds even if parallelism is employed. As Mikael says, the dubious aspect is using id in the ROW_NUMBER since id is not documented to be the OrdPath.
* The ordering is not shown in plans, but optimizer output using TF 8607 contains:
ScaOp_SeqFunc row_number order[CALC:QCOL: XML Reader with XPath filter.id ASC]

Under the current implementation of .nodes, the XML nodes are generated in document order. The result of that is always joined to the original data using a nested loops, which always runs in order also.
Furthermore, inserts are generally serial (except under very specific circumstances that it goes parallel, usually when you have an empty table, and never with an IDENTITY value being generated).
Therefore there is no reason why the server would ever return rows in a different order than the document order. You can see from this fiddle that that is what happens.
That being said, there is no guarantee that the implementation of .nodes won't change, or that inserts may in future go parallel, as neither of these is documented anywhere as being guaranteed. So I wouldn't rely on it without an explicit ORDER BY, and you do not have a column to order it on.
Using an ORDER BY would guarantee it. The docs state: "INSERT queries that use SELECT with ORDER BY to populate rows guarantees how identity values are computed but not the order in which the rows are inserted."
Even using ROW_NUMBER as some have recommended is also not guaranteed. The only real solution is to get the document order directly from XQuery.
The problem is that SQL Server's version of XQuery does not allow using position(.) as a result, only as a predicate. Instead, you can use a hack involving the << positional operator.
For example:
SELECT T.X.value('text()[1]', 'nvarchar(100)') as RowLabel,
T.X.value('let $i := . return count(../*[. << $i]) + 1', 'int') as RowNumber
FROM src
CROSS APPLY xmlcolumn.nodes('blah') as T(X);
What this does is:
Assign the current node . to the variable $i
Takes all the nodes in ../* i.e. all children of the parent of this node
... [. << $i] that are previous to $i
and counts them
Then add 1 to make it one-based

Related

Snowflake INSERT inconsistent behavior using ORDER BY

I'm trying to insert records into a table in a certain (and simple) order, as the table have an IDENTITY column (e.g. MyTbl (ID INT IDENTITY(1,1), Sale_Date DATE, Product_ID INT, Sales INT).
The query is quite simple (this is just a simplified example):
INSERT INTO MyTbl (Sale_Date, Product_ID, Sales)
SELECT Sale_Date, Product_ID,COUNT(*) as sales
FROM Fact_tbl
GROUP BY Sale_Date,Product_ID
ORDER BY Sale_Date,Product_ID
The expected behavior is that when I select the highest values of the identity ID column, I should see the latest Sale_Date. However, this is not the case. The order of the ID column in the table has nothing to do with the dates. To make things even worse, if I recreate the table and run the same INSERT statement again and again and again, I'm getting different order of insertion each time for the same data.
I'm getting this behavior even if I encase the query and put the ORDER BY in or out of the casing.
I never saw this behavior in any other SQL platform. Is this the expected behavior in Snowflake?
It's expected. Let me explain the reason:
AUTOINCREMENT and IDENTITY are synonymous. If either is specified for a column, Snowflake utilizes a sequence to generate the values for the column.
https://docs.snowflake.com/en/sql-reference/sql/create-table.html#optional-parameters
There is no guarantee that values from a sequence are contiguous (gap-free) or that the sequence values are assigned in a particular order. There is, in fact, no way to assign values from a sequence to rows in a specified order other than to use single-row statements (this still provides no guarantee about gaps).
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics
With Snowflake each INSERT has completely different order than the
same INSERT that ran a couple of minutes ago
No, it should insert the data in expected order because you use "ORDER BY" clause. The issue is, the sequence values are not assigned in a particular order!
It's not easy to verify if the data is sorted when you use "INSERT/SELECT ORDER BY", unless you have access to underlying metadata. For testing, you may define clustering keys on a table that you ingested "sorted" data.
Anyway, if you want to assign IDs matching the order when inserting bulk data, you need to use ROW_NUMBER instead of using an IDENTITY column or any sequence values.
This is not expected behavior in Snowflake. However the way you insert data into your table (with the order by) doesn't affect the order in which the data is stored inside the table. You can leave the order by out in the insert, but you should include it in your select.

Highlight Duplicate Values in a NetSuite Saved Search

I am looking for a way to highlight duplicates in a NetSuite saved search. The duplicates are in a column called "ACCOUNT" populated with text values.
NetSuite permits adding fields (columns) to the search using a stripped down version of SQL Server. It also permits conditional highlighting of entire rows using the same code. However I don't see an obvious way to compare values between rows of data.
Although duplicates can be grouped together in a summary report and identified by a count of 2 or more, I want to show duplicate lines separately and highlight each.
The closest thing I found was a clever formula that calculates a running total here:
sum/* comment */({amount})
OVER(PARTITION BY {name}
ORDER BY {internalid}
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
I wonder if it's possible to sort results by the field being checked for duplicates and adapt this code to identify changes in the "ACCOUNT" field between a row and the previous row.
Any ideas? Thanks!
This post has been edited. I have left the progression as a learning experience about NetSuite.
Original - plain SQL way - not suitable for NetSuite
Does something like this meet your needs? The test data assumes looking for duplicates on id1 and id2. Note: This does not work in NetSuite as it supports limited SQL functions. See comments for links.
declare #table table (id1 int, id2 int, value int);
insert #table values
(1,1,11),
(1,2,12),
(1,3,13),
(2,1,21),
(2,2,22),
(2,3,23),
(1,3,1313);
--select * from #table order by id1, id2;
select t.*,
case when dups.id1 is not null then 1 else 0 end is_dup --identify dups when there is a matching dup record
from #table t
left join ( --subquery to find duplicates
select id1, id2
from #table
group by id1, id2
having count(1) > 1
) dups
on dups.id1 = t.id1
and dups.id2 = t.id2
order by t.id1, t.id2;
First Edit - NetSuite target but in SQL.
This was a SQL test based on the example available syntax provided in the question since I do not have NetSuite to test against. This will give you a value greater than 1 on each duplicate row using a similar syntax. Note: This will give the appropriate answer but not in NetSuite.
select t.*,
sum(1) over (partition by id1, id2)
from #table t
order by t.id1, t.id2;
Second Edit - Working NetSuite version
After some back and forth here is a version that works in NetSuite:
sum/* comment */(1) OVER(PARTITION BY {name})
This will also give a value greater than 1 on any row that is a duplicate.
Explanation
This works by summing the value 1 on each row included in the partition. The partition column(s) should be what you consider a duplicate. If only one column makes a duplicate (e.g. user ID) then use as above. If multiple columns make a duplicate (e.g. first name, last name, city) then use a comma-separated list in the partition. SQL will basically group the rows by the partition and add up the 1s in the sum/* comment */(1). The example provided in the question sums an actual column. By summing 1 instead we will get the value 1 when there is only 1 ID in the partition. Anything higher is a duplicate. I guess you could call this field duplicate count.

Hierarchical SQL select-query

I'm using MS SqlServer 2008. And I have a table 'Users'. This table has the key field ID of bigint. And also a field Parents of varchar which encodes all chain of user's parent IDs.
For example:
User table:
ID | Parents
1 | null
2 | ..
3 | ..
4 | 3,2,1
Here user 1 has no parents and user 4 has a chain of parents 3->2->1. I created a function which parses the user's Parents field and returns result table with user IDs of bigint.
Now I need a query which will select and join IDs of some requested users and theirs parents (order of users and theirs parents is not important). I'm not an SQL expert so all I could come up with is the following:
WITH CTE AS(
SELECT
ID,
Parents
FROM
[Users]
WHERE
(
[Users].Name = 'John'
)
UNION ALL
SELECT
[Users].Id,
[Users].Parents
FROM [Users], CTE
WHERE
(
[Users].ID in (SELECT * FROM GetUserParents(CTE.ID, CTE.Parents) )
))
SELECT * FROM CTE
And basically it works. But performance of this query is very poor. I believe WHERE .. IN .. expression here is a bottle neck. As I understand - instead of just joining the first subquery of CTE (ID's of found users) with results of GetUserParents (ID's of user parents) it has to enumerate all users in the Users table and check whether the each of them is a part of the function's result (and judging on execution plan - Sql Server does distinct order of the result to improve performance of WHERE .. IN .. statement - which is logical by itself but in general is not required for my goal. But this distinct order takes 70% of execution time of the query). So I wonder how this query could be improved or perhaps somebody could suggest some another approach to solve this problem at all?
Thanks for any help!
The recursive query in the question looks redundant since you already form the list of IDs needed in GetUserParents. Maybe change this into SELECT from Users and GetUserParents() with WHERE/JOIN.
select Users.*
from Users join
(select ParentId
from (SELECT * FROM Users where Users.Name='John') as U
cross apply [GetDocumentParents](U.ID, U.Family, U.Parents))
as gup
on Users.ID = gup.ParentId
Since GetDocumentParents expects scalars and select... where produces a table, we need to apply the function to each row of the table (even if we "know" there's only one). That's what apply does.
I used indents to emphasize the conceptual parts of the query. (select...) as gup is the entity Users is join'd with; (select...) as U cross apply fn() is the argument to FROM.
The key knowledge to understanding this query is to know how the cross apply works:
it's a part of a FROM clause (quite unexpectedly; so the syntax is at FROM (Transact-SQL))
it transforms the table expression left of it, and the result becomes the argument for the FROM (i emphasized this with indent)
The transformation is: for each row, it
runs a table expression right of it (in this case, a call of a table-valued function), using this row
adds to the result set the columns from the row, followed by the columns from the call. (In our case, the table returned from the function has a single column named ParentId)
So, if the call returns multiple rows, the added records will be the same row from the table appended with each row from the function.
This is a cross apply so rows will only be added if the function returns anything. If this was the other flavor, outer apply, a single row would be added anyway, followed by a NULL in the function's column if it returned nothing.
This "parsing" thing violates even the 1NF. Make Parents field contain only the immediate parent (preferably, a foreign key), then an entire subtree can be retrieved with a recursive query.

SQL Server 2005 SELECT TOP 1 from VIEW returns LAST row

I have a view that may contain more than one row, looking like this:
[rate] | [vendorID]
8374 1234
6523 4321
5234 9374
In a SPROC, I need to set a param equal to the value of the first column from the first row of the view. something like this:
DECLARE #rate int;
SET #rate = (select top 1 rate from vendor_view where vendorID = 123)
SELECT #rate
But this ALWAYS returns the LAST row of the view.
In fact, if I simply run the subselect by itself, I only get the last row.
With 3 rows in the view, TOP 2 returns the FIRST and THIRD rows in order. With 4 rows, it's returning the top 3 in order. Yet still top 1 is returning the last.
DERP?!?
This works..
DECLARE #rate int;
CREATE TABLE #temp (vRate int)
INSERT INTO #temp (vRate) (select rate from vendor_view where vendorID = 123)
SET #rate = (select top 1 vRate from #temp)
SELECT #rate
DROP TABLE #temp
.. but can someone tell me why the first behaves so fudgely and how to do what I want? As explained in the comments, there is no meaningful column by which I can do an order by. Can I force the order in which rows are inserted to be the order in which they are returned?
[EDIT] I've also noticed that: select top 1 rate from ([view definition select]) also returns the correct values time and again.[/EDIT]
That is by design.
If you don't specify how the query should be sorted, the database is free to return the records in any order that is convenient. There is no natural order for a table that is used as default sort order.
What the order will actually be depends on how the query is planned, so you can't even rely on the same query giving a consistent result over time, as the database will gather statistics about the data and may change how the query is planned based on that.
To get the record that you expect, you simply have to specify how you want them sorted, for example:
select top 1 rate
from vendor_view
where vendorID = 123
order by rate
I ran into this problem on a query that had worked for years. We upgraded SQL Server and all of a sudden, an unordered select top 1 was not returning the final record in a table. We simply added an order by to the select.
My understanding is that SQL Server normally will generally provide you the results based on the clustered index if no order by is provided OR off of whatever index is picked by the engine. But, this is not a guarantee of a certain order.
If you don't have something to order off of, you need to add it. Either add a date inserted column and default it to GETDATE() or add an identity column. It won't help you historically, but it addresses the issue going forward.
While it doesn't necessarily make sense that the results of the query should be consistent, in this particular instance they are so we decided to leave it 'as is'. Ultimately it would be best to add a column, but this was not an option. The application this belongs to is slated to be discontinued sometime soon and the database server will not be upgraded from SQL 2005. I don't necessarily like this outcome, but it is what it is: until it breaks it shall not be fixed. :-x

How does sql server choose values in an update statement where there are multiple options?

I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.

Resources