Preferred way to access data within XML columns in SQL Server - sql-server

Background
Recently I've started to use XML a lot more as a column in SQL Server 2005. During a bit of downtime yesterday, I noticed that two of the link tables I used a really just in the way and it bores me to tears having to write yet more supporting structure code for a couple of joins.
To actually generate the data for these two link tables, I pass in two XML fields to my stored procedure, which writes the main record, breaks the two XML variables down into #tables and inserts them into the actual tables with the new SCOPE_IDENTITY() from the master record.
After some though, I decided to just do away with those tables altogether and just store the XML in XML fields. Now I understand there are some pitfalls here, like general querying performance, GROUP BY doesn't work on XML data. And the query is generally a bit of a mess, but overall I like that I can now work with XElement when I get the data back.
Also, this stuff isn't going to get changed. It's a one shot affair, so I don't have to worry about modification.
I am wondering about the best way to actually get at this data. A lot of my queries involve getting a master record based upon the criteria of a child or even a subchild record. Most of the sprocs in the database do this but on a far more elaborate scale, usually requiring UDFs and Subqueries to work effectively but I have knocked up a trivial example to test querying some data...
INSERT INTO Customers VALUES ('Tom', '', '<PhoneNumbers><PhoneNumber Type="1" Value="01234 456789" /><PhoneNumber Type="2" Value="01746 482954" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Andy', '', '<PhoneNumbers><PhoneNumber Type="2" Value="07948 598348" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Mike', '', '<PhoneNumbers><PhoneNumber Type="3" Value="02875 482945" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Steve', '', '<PhoneNumbers></PhoneNumbers>')
Now I can see two ways of grabbing it.
Method 1
DECLARE #PhoneType INT
SET #PhoneType = 2
SELECT ct.*
FROM Customers ct
WHERE ct.PhoneNumbers.exist('/PhoneNumbers/PhoneNumber[#Type=sql:variable("#PhoneType")]') = 1
Really? sql:variable feels a bit unwholesome. However, it does work. However it's distinctively more difficult to access data in a more meaningful way.
Method 2
SELECT ct.*, pt.PhoneType
FROM Customers ct
CROSS APPLY ct.PhoneNumbers.nodes('/PhoneNumbers/PhoneNumber') AS nums(pn)
INNER JOIN PhoneTypes pt ON pt.ID = nums.pn.value('./#Type[1]', 'int')
WHERE nums.pn.value('./#Type[1]', 'int') = #PhoneType
This is more like it. Already I can easily expand it to do joins and all other good stuff. I've used CROSS APPLY before on a table valued function, and it was very good. The execution plan for this as opposed to the previous query is seriously more advanced. Admittedly I haven't done any indexing and whatnot on these tables, but it's 97% of the entire batch cost.
Method 2 (expanded)
SELECT ct.ID, ct.CustomerName, ct.Notes, pt.PhoneType
FROM Customers ct
CROSS APPLY ct.PhoneNumbers.nodes('/PhoneNumbers/PhoneNumber') AS nums(pn)
INNER JOIN PhoneTypes pt ON pt.ID = nums.pn.value('./#Type[1]', 'int')
WHERE nums.pn.value('./#Type[1]', 'int') IN (SELECT ID FROM PhoneTypes)
Nice IN clause here. I can also do something like pt.PhoneType = 'Work'
Finally
So I'm essentially obtaining the results that I want, but is there anything I should be aware of when using this mechanism to interrogate small amounts of XML data? Will it fall down on performance during elaborate searches? And is the storage of such markup style data too much of an overhead?
Side note
I've used things like sp_xml_preparedocument and OPENXML in the past just to pass lists into sprocs, but this is like a breath of fresh air in comparison!

One approach we've taken for some of our key items of information stored inside an XML column is to "surface" them as computed, persisted properties on the "parent" table. This is done using a little stored function.
It works great, because the value is computed only once every time the XML changes - as long as it's not changing, there's no recomputation, the value is stored on the table like any other column.
It's also great since it can be indexed! So if you're searching and/or joining on such a field - that works like a charm!
So you basically need a stored function along the lines of this:
CREATE FUNCTION [dbo].[GetPhoneNo1](#DataXML XML)
RETURNS VARCHAR(50)
WITH SCHEMABINDING
AS BEGIN
DECLARE #result VARCHAR(20)
SELECT
#result = #DataXML.value('(/PhoneNumbers/PhoneNumber[#Type="1"]/#Value)[1]', 'VARCHAR(50)')
RETURN #result
END
If you don't have a phone number of type 1, you'll just get back a NULL.
Then, you need to extend your parent table with a computed, persisted column:
ALTER TABLE dbo.Customers
ADD PhoneNumberType1 AS dbo.GetPhoneNo1(PhoneNumbers)
As you can see - it works just fine for single entries, but unfortunately, you cannot surface a whole list of properties. But if you have some key items, like ID's or something, that you expect most of your rows to have, this can be a very nice and slick way to get at that information more easily and more efficiently.

Related

'Multiple' values for a variable

A bit of background. There are multiple tables from multiple databases that have the same schemas. So, when I query to select all columns having the same master code (in the tables, the master code is in the column called CATMASTRCAT), the same code will have multiple rows, the only same thing about them is the CATMASTRCAT column. This works for a single master code (in the script below if I set the variable to 031325-002-70 it will show multiple rows having different organizations and same data with the rest, which is the desired result).
Question is, is there a way to have multiple master codes be as an input in the variable? I'm planning to create this as a stored procedure.
This is my SQL script:
DECLARE #ProductNumber AS VARCHAR(1000)
SET #ProductNumber = ('031325-002-70')
SELECT ITEMS
,ORGANIZATION
FROM [EU].[dbo].[SOMETHING14]
WHERE ITEMS in (#ProductNumber)
UNION
SELECT ITEMS
,ORGANIZATION
FROM [EU].[dbo].[SOMETHING12]
WHERE ITEMS in (#ProductNumber)
UNION
SELECT ITEMS
,ORGANIZATION
FROM [EU].[dbo].[SOMETHING11]
WHERE ITEMS IN (#ProductNumber)
Feel free to clarify any other needed data. I'm fairly new to SQL, just self-learning. You can also lecture me about the wrong code haha and how to do this better.
Thanks!
P.S. Attached the picture of query result
Yes the best way to do this is to use a table value parameter and then change the where clause to say
WHERE catmastrcat IN (SELECT catmastrcat FROM #tablevaluename)
or you could use an inner join -- which might be faster depending on indexes and other issues - the code for that would look like this
JOIN #tablevaluename tv ON AJF_CATMASTER.catmastrcat = tv.catmastrcat

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

Parse XML for a column value in a SQL query

I've set up a SQL fiddle to mimic the tables that I currently have which can be found here: http://sqlfiddle.com/#!6/7675e/5
I have 2 tables that I would like to join (Things and ThingData) which is easy enough, but I would like 1 of the columns to be coming from a value that is pulled from parsed XML in one of the columns in ThingData.
Ideally the output would look something like this:
thingID | thingValue | xmlValue
1 | aaa | a
As you can see in the fiddle, I'm able to parse a single XML string at a time, but I'm unsure of how to go from here to using the parsing stuff as a column in a join. Any help would be greatly appreciated.
I updated your SQL Fiddle to demonstrate; the number one issue you're facing is that you're not using an XML type for your XML column. That's going to cause headaches down the road as people shove crap in that column :P
http://sqlfiddle.com/#!6/4c674/2
I have made a change to your query that gets executed .
DECLARE #xml xml
SET #xml = (select thingDataXML from ThingData where thingDataID = 3)
;WITH XMLNAMESPACES(DEFAULT 'http://www.testing.org/a/b/c/d123')
SELECT
t.[thingID]
, t.[thingValue]
, CONVERT(XML,td.[thingDataXML]).value('(/anItem/a1/b1/c1/text())[1]','Varchar(1)') as xmlValue
FROM Things t
join ThingData td
on td.[thingDataID] = t.[thingDataID]
This should join from you main table into the table containing the xml and would then retreive the one value in the xml for you, note that if you would like to return multiple values for the one row you will need to either concatinate the answer or perhaps decide on a crossjoin to then use the data with the other data set to perform some logic with.
EDIT:
The answer for your question would be that both have a use case, the cross join he showed above creates an accesible table to query which uses a tiny bit more memory while the query runs but can be accessed many times, where as the normal xquery value function just reads a node which would only work to read a single value from the xml.
Cross join would be the solution if you would like to construct tabular data out of your xml.

MAX keyword taking a lot of time to select a value from a column

Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.
What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).
Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.

The Best Way to shred XML data into SQL Server database columns

What is the best way to shred XML data into various database columns? So far I have mainly been using the nodes and value functions like so:
INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(#column1)[1]', 'varchar(20)'),
Rows.n.value('(#column2)[1]', 'nvarchar(100)'),
Rows.n.value('(#column3)[1]', 'int'),
FROM #xml.nodes('//Rows') Rows(n)
However I find that this is getting very slow for even moderate size xml data.
Stumbled across this question whilst having a very similar problem, I'd been running a query processing a 7.5MB XML file (~approx 10,000 nodes) for around 3.5~4 hours before finally giving up.
However, after a little more research I found that having typed the XML using a schema and created an XML Index (I'd bulk inserted into a table) the same query completed in ~ 0.04ms.
How's that for a performance improvement!
Code to create a schema:
IF EXISTS ( SELECT * FROM sys.xml_schema_collections where [name] = 'MyXmlSchema')
DROP XML SCHEMA COLLECTION [MyXmlSchema]
GO
DECLARE #MySchema XML
SET #MySchema =
(
SELECT * FROM OPENROWSET
(
BULK 'C:\Path\To\Schema\MySchema.xsd', SINGLE_CLOB
) AS xmlData
)
CREATE XML SCHEMA COLLECTION [MyXmlSchema] AS #MySchema
GO
Code to create the table with a typed XML column:
CREATE TABLE [dbo].[XmlFiles] (
[Id] [uniqueidentifier] NOT NULL,
-- Data from CV element
[Data] xml(CONTENT dbo.[MyXmlSchema]) NOT NULL,
CONSTRAINT [PK_XmlFiles] PRIMARY KEY NONCLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Code to create Index
CREATE PRIMARY XML INDEX PXML_Data
ON [dbo].[XmlFiles] (Data)
There are a few things to bear in mind though. SQL Server's implementation of Schema doesn't support xsd:include. This means that if you have a schema which references other schema, you'll have to copy all of these into a single schema and add that.
Also I would get an error:
XQuery [dbo.XmlFiles.Data.value()]: Cannot implicitly atomize or apply 'fn:data()' to complex content elements, found type 'xs:anyType' within inferred type 'element({http://www.mynamespace.fake/schemas}:SequenceNumber,xs:anyType) ?'.
if I tried to navigate above the node I had selected with the nodes function. E.g.
SELECT
,C.value('CVElementId[1]', 'INT') AS [CVElementId]
,C.value('../SequenceNumber[1]', 'INT') AS [Level]
FROM
[dbo].[XmlFiles]
CROSS APPLY
[Data].nodes('/CVSet/Level/CVElement') AS T(C)
Found that the best way to handle this was to use the OUTER APPLY to in effect perform an "outer join" on the XML.
SELECT
,C.value('CVElementId[1]', 'INT') AS [CVElementId]
,B.value('SequenceNumber[1]', 'INT') AS [Level]
FROM
[dbo].[XmlFiles]
CROSS APPLY
[Data].nodes('/CVSet/Level') AS T(B)
OUTER APPLY
B.nodes ('CVElement') AS S(C)
Hope that that helps someone as that's pretty much been my day.
in my case i'm running SQL 2005 SP2 (9.0).
The only thing that helped was adding OPTION ( OPTIMIZE FOR ( #your_xml_var = NULL ) ).
Explanation is on the link below.
Example:
INSERT INTO #tbl (Tbl_ID, Name, Value, ParamData)
SELECT 1,
tbl.cols.value('name[1]', 'nvarchar(255)'),
tbl.cols.value('value[1]', 'nvarchar(255)'),
tbl.cols.query('./paramdata[1]')
FROM #xml.nodes('//root') as tbl(cols) OPTION ( OPTIMIZE FOR ( #xml = NULL ) )
https://connect.microsoft.com/SQLServer/feedback/details/562092/an-insert-statement-using-xml-nodes-is-very-very-very-slow-in-sql2008-sp1
I'm not sure what is the best method. I used OPENXML construction:
INSERT INTO Test
SELECT Id, Data
FROM OPENXML (#XmlDocument, '/Root/blah',2)
WITH (Id int '#ID',
Data varchar(10) '#DATA')
To speed it up, you can create XML indices. You can set index specifically for value function performance optimization. Also you can use typed xml columns, which performs better.
We had a similar issue here. Our DBA (SP, you the man) took a look at my code, made a little tweak to the syntax, and we got the speed we had been expecting. It was unusual because my select from XML was plenty fast, but the insert was way slow. So try this syntax instead:
INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value(N'(#column1/text())[1]', 'varchar(20)'),
Rows.n.value(N'(#column2/text())[1]', 'nvarchar(100)'),
Rows.n.value(N'(#column3/text())[1]', 'int')
FROM #xml.nodes('//Rows') Rows(n)
So specifying the text() parameter really seems to make a difference in performance. Took our insert of 2K rows from 'I must have written that wrong - let me stop it' to about 3 seconds. Which was 2x faster than the raw insert statements we had been running through the connection.
I wouldn't claim this is the "best" solution, but I've written a generic SQL CLR procedure for this exact purpose - it takes a "tabular" Xml structure (such as that returned by FOR XML RAW) and outputs a resultset.
It does not require any customization / knowledge of the structure of the "table" in the Xml, and turns out to be extremely fast / efficient (although this wasn't a design goal). I just shredded a 25MB (untyped) xml variable in under 20 seconds, returning 25,000 rows of a pretty wide table.
Hope this helps someone:
http://architectshack.com/ClrXmlShredder.ashx
This isn't an answer, more an addition to this question - I have just come across the same problem and I can give figures as edg asks for in the comment.
My test has xml which results in 244 records being inserted - so 244 nodes.
The code that I am rewriting takes on average 0.4 seconds to run.(10 tests run, spread from .56 secs to .344 secs) Performance is not the main reason the code is being rewritten, but the new code needs to perform as well or better. This old code loops the xml nodes, calling a sp to insert once per loop
The new code is pretty much just a single sp; pass the xml in; shred it.
Tests with the new code switched in show the new sp takes on average 3.7 seconds - almost 10 times slower.
My query is in the form posted in this question;
INSERT INTO some_table (column1, column2, column3)
SELECT
Rows.n.value('(#column1)[1]', 'varchar(20)'),
Rows.n.value('(#column2)[1]', 'nvarchar(100)'),
Rows.n.value('(#column3)[1]', 'int'),
FROM #xml.nodes('//Rows') Rows(n)
The execution plan appears to show that for each column, sql server is doing a separate "Table Valued Function [XMLReader]" returning all 244 rows, joining all back up with Nested Loops(Inner Join). So In my case where I am shredding from/ inserting into about 30 columns, this appears to happen separately 30 times.
I am going to have to dump this code, I don't think any optimisation is going to get over this method being inherently slow. I am going to try the sp_xml_preparedocument/OPENXML method and see if the performance is better for that. If anyone comes across this question from a web search (as I did) I would highly advise you to do some performance testing before using this type of shredding in SQL Server
There is an XML Bulk load COM object (.NET Example)
From MSDN:
You can insert XML data into a SQL
Server database by using an INSERT
statement and the OPENXML function;
however, the Bulk Load utility
provides better performance when you
need to insert large amounts of XML
data.
My current solution for large XML sets (> 500 nodes) is to use SQL Bulk Copy (System.Data.SqlClient.SqlBulkCopy) by using a DataSet to load the XML into memory and then pass the table to SqlBulkCopy (defining a XML schema helps).
Obviously there a pitfalls such as needlessly using a DataSet and loading the whole document into memory first. I would like to go further in the future and implement my own IDataReader to bypass the DataSet method but currently the DataSet is "good enough" for the job.
Basically I never found a solution to my original question regarding the slow performance for that type of XML shredding. It could be slow due to the typed xml queries being inherently slow or something to do with transactions and the the SQL Server log. I guess the typed xml functions were never designed for operating on non-trivial node sizes.
XML Bulk Load: I tried this and it was fast but I had trouble getting the COM dll to work under 64bit environments and I generally try to avoid COM dlls that no longer appear to be supported.
sp_xml_preparedocument/OPENXML: I never went down this road so would be interested to see how it performs.

Resources