Baseball XML to SQL query - optimize

Baseball XML to SQL query - optimize - sql-server

source data looks comes from the following, freely available XML files describing major league baseball games.
http://gd2.mlb.com/components/game/mlb/year_2013/month_04/day_09/gid_2013_04_09_atlmlb_miamlb_1/inning/
I have created a SQL Server table that contains a row for every GamePK/inning, with an XML column named PBP. Each file in the folder above becomes a row in this table. The query below is my attempt to parse the XML into a record set. It works but is very slow for a large number of rows, and very repetitive - seems like there should be a better way to do this without the UNION clause. Any help in improving/optimizing is appreciated
select
i.GamePK
,inn.value('#num', 'int') as inning
,itop.value('1', 'int') as IsTop
,itop.value('#num', 'int') as abNum
,itop.value('#batter', 'int') as batter
-- clip
,itoppit.value('#des', 'varchar(32)') as pitdesc
,itoppit.value('#id', 'int') as seq
,itoppit.value('#type', 'varchar(8)') as pittype
-- clip
from tblInnings i
cross apply PBP.nodes('/inning') as inn(inn)
cross apply inn.nodes('top/atbat') as itop(itop)
cross apply itop.nodes('pitch') as itoppit(itoppit)
union
select
i.GamePK
,inn.value('#num', 'int') as inning
,ibot.value('0', 'int') as IsTop
,ibot.value('#num', 'int') as abNum
,ibot.value('#batter', 'int') as batter
-- clip
,ibotpit.value('#des', 'varchar(32)') as pitdesc
,ibotpit.value('#id', 'int') as seq
,ibotpit.value('#type', 'varchar(8)') as pittype
--clip
from tblInnings i
cross apply PBP.nodes('/inning') as inn(inn)
cross apply inn.nodes('bottom/atbat') as ibot(ibot)
cross apply ibot.nodes('pitch') as ibotpit(ibotpit)

If you're using a recent version of SQL Server, there's a new column data type (XML).
You can apply xpath to it, making querying the column much easier.
Instead of trying to store the XML as a string in your DB, I'd suggest you actually store it as XML, and treat it as XML.
There is a learning curve. You'll need to be familiar with XPATH, but it's not rocket science.
an example:
SELECT Id, PartitionMonth, EmailAddress, AcquisitionCodeId, FieldValues.value('
declare namespace s="http://domain.com/FieldValues.xsd";
data(/s:FieldValues/s:item/#value)[1]', 'varchar(200)')
FROM Leads.Leads WITH (NOLOCK)
WHERE Id = 190708
Another example retrieving values by key:
SELECT r.EmailAddress, ar.Ip, ar.DateLog,
ar.FieldValues.value('
declare namespace s="http://domain.com/FieldValues.xsd";
data(/s:FieldValues/s:item[#key="First Name"]/#value)[1]', 'varchar(20)') FirstName,
ar.FieldValues.value('
declare namespace s="http://domain.com/FieldValues.xsd";
data(/s:FieldValues/s:item[#key="Last Name"]/#value)[1]', 'varchar(20)') LastName
FROM Records.Records r WITH (NOLOCK)
JOIN Records.AcquisitionRecords ar WITH (NOLOCK) ON r.Id = ar.Id
WHERE ar.AcquisitionCodeId IN (19, 21, 30, 34, 36)
AND ar.DateLog BETWEEN '1-mar-09' AND '31-mar-09'
A good place to get started on XML in SQL Server
http://msdn.microsoft.com/en-US/library/ms189887(v=sql.90).aspx

Related

SQL Server 2012 XML Flatten and reduce fan out / duplicates

I'm having some issues with trying to get the some XML data (stored as text in an old MS SQL Server 2012) parsed and into a usable format.
XML data is a string, but when I convert it to XML, it look like this:
<?xml version="1.0" encoding="utf-8"?>
<header1>
<header2>
<OrderFormHeader>
<AccountNum>123456</AccountNum>
<OrderNum>000123987</OrderNum>
<OrderDetails>
<CompanyName>Biznez1</CompanyName>
<CompAddressInfo>
<City>Phoenix</City>
<State>AZ</State>
</CompAddressInfo>
<ShipTo>TRUE</ShipTo>
<BillTo>FALSE</BillTo>
</OrderDetails>
</OrderFormHeader>
<OrderFormDetails>
<OrderFormLines>
<ItemNum>000001</ItemNum>
<InventoryNum>INV-001-000001</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>Bandaids</ItemDesc>
<UnitofMeasure>Box</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456789123</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
</OrderFormDetails>
<OrderFormLines>
<ItemNum>000002</ItemNum>
<InventoryNum>INV-001-000002</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>QTips</ItemDesc>
<UnitofMeasure>Box</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456789987</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
<OrderFormLines>
<ItemNum>000003</ItemNum>
<InventoryNum>INV-003-000001</InventoryNum>
<OtherDetails>
<QtyOrdered>1</QtyOrdered>
<ItemDesc>Scissors</ItemDesc>
<UnitofMeasure>Each</UnitofMeasure>
<ItemCode>
<CodeType>UPC</CodeType>
<CodeID>123456987321</CodeID>
</ItemCode>
<OtherDetails>
</OrderFormLines>
</header2>
</header1>
Needless to say, it's a crazy XML (at least to me).
(Note: There are multiple sets of OrderFormDetails nested within the object and parsing them via my code seems to fan out on the ItemNum and InventoryNum. I've removed the UPC code stuff as that was causing additional fan out, but wouldn't mind bringing that back into my code)
With that said, my current SQL code uses a table variable to take the data from the table, correct the UTF-8 and put it into an XML format. From there, I use the CROSS APPLY functions to get the data out, but it has severe fan-out issues where it will show the data multiple times rather than just 1 row each:
DECLARE #xml TABLE (IMPORTED_XML xml)
INSERT INTO #xml
SELECT
CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS XML) AS IMPORTED_XML
FROM MyXMLTable as mxt
with temp1 AS (
SELECT DISTINCT
sales_order.value('(./AccountNum/text())[1]','nvarchar(max)') AS ACCOUNT_NUM
, sales_order.value('(./OrderNum/text())[1]','nvarchar(max)') AS ORDER_NUM
, extra_so.value('(./CompanyName/text())[1]','nvarchar(max)') AS COMPANY_NAME
, base.value('(./ItemNum/text())[1]','nvarchar(max)') AS ITEM_ID
, base.value('(./InventoryNum/text())[1]','nvarchar(max)') AS INVENTORY_NUM
, sales.value('(./QtyOrdered/text())[1]','nvarchar(max)') AS QTY_ORDERED
, sales.value('(./UnitofMeasure/text())[1]','nvarchar(max)') AS ITEM_UOM
, sales.value('(./ItemDesc/text())[1]','nvarchar(max)') AS ITEM_DESC
FROM #xml
CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
CROSS APPLY core.nodes('//OrderFormDetails/OrderFormLines') as base(base)
CROSS APPLY core.nodes('//OrderFormHeader') AS sales_order(sales_order)
CROSS APPLY base.nodes('//OtherDetails') as sales(sales)
CROSS APPLY sales_order.nodes('//OrderDetails') AS extra_so(extra_so)
CROSS APPLY sales.nodes('//ItemCode') as itmcode(itmcode)
)
select * from temp1 order by item_desc asc
This seems to mostly work, but it ends up with multiple rows of data for the same stuff... I'm used to using the lateral flatten function in Snowflake, but not this XML parsing in SQL Server 2012. Any insight into this? Thank you in advance for your help

Your issue is that you are cross-joining each nested node all the way back from the root, because you are using //.
There are other points to note:
You don't need temporary tables, you can CROSS APPLY everything together in one query
You don't need REPLACE if the column is already varchar, only if it's nvarchar.
You don't need to use .nodes on every level of nesting, you only need it if you want multiple items from a single level.
Pick your data types carefully, does everything have to be nvarchar(max)?
SELECT
sales_order.value('(AccountNum/text())[1]','varchar(50)') AS ACCOUNT_NUM
, sales_order.value('(OrderNum/text())[1]','varchar(50)') AS ORDER_NUM
, sales_order.value('(OrderDetails/CompanyName/text())[1]','nvarchar(200)') AS COMPANY_NAME
, base.value('(ItemNum/text())[1]','varchar(50)') AS ITEM_ID
, base.value('(InventoryNum/text())[1]','varchar(50)') AS INVENTORY_NUM
, sales.value('(QtyOrdered/text())[1]','int') AS QTY_ORDERED
, sales.value('(UnitofMeasure/text())[1]','varchar(20)') AS ITEM_UOM
, sales.value('(ItemDesc/text())[1]','nvarchar(max)') AS ITEM_DESC
, itmcode.value('(CodeType/text())[1]','varchar(20)') AS itemcodetype
, itmcode.value('(CodeID/text())[1]','varchar(50)') AS itemcodeID
FROM MyXMLTable as mxt
CROSS APPLY (VALUES( CAST(REPLACE(mxt.XML_FIELD,'encoding="UTF-8"','encoding="UTF-16"') AS xml) )) v(IMPORTED_XML)
CROSS APPLY IMPORTED_XML.nodes('/header1/header2') AS core(core)
CROSS APPLY core.nodes('OrderFormHeader') AS sales_order(sales_order)
CROSS APPLY core.nodes('OrderFormDetails/OrderFormLines') as base(base)
CROSS APPLY base.nodes('OtherDetails') as sales(sales)
CROSS APPLY sales.nodes('ItemCode') as itmcode(itmcode);
db<>fiddle

Retrieving xml attribute using Xquery

I am using the below query to select the values of XML from attributes ad elements of the XML file but I am not able to read the seq, id, reported dated attributes from XML page
so any one please suggest How to get values of attributes using this Query.
select a_node.value('(./text())[1]', 'var char(50)') AS c_val,
c1_node.value('(./text())[1]', 'var char(50)') AS c_val 2,
ca_node.value('(./text())[1]', 'var char(50)') AS c_val3,
d_node.value('(./text())[1]', 'var char(50)') ,
e_node.value('(./text())[1]', 'varchar(50)') ,
f_node.value('(./text())[1]', 'var char(50)')
FROM #xmlData.nodes('/Reports/x:InquiryResponse/x:ReportData/x:AccountDetails/x:Account') AS b(b_node)
outer APPLY b.b_node.nodes('./x:primarykey') AS pK_InquiryResponse (a_node)
outer APPLY b.b_node.nodes('./x:seq') AS CustomerCode (c1_node)
outer APPLY b.b_node.nodes('./x:id') AS amount (ca_node)
outer APPLY b.b_node.nodes('./x:ReportedDate') AS CustRefField (d_node)
outer APPLY b.b_node.nodes('./x:AccountNumber') AS ReportOrderNO (e_node)
outer apply b.b_node.nodes('./x:CurrentBalance') as additional_id (f_node);
Edit: Xml Snippets Provided in Comments
<sch:Account seq="2" id="345778174" ReportedDate="2014-01-01">
<sch:AccountNumber>TSTC1595</sch:AccountNumber>
<sch:CurrentBalance>0</sch:CurrentBalance>
<sch:Institution>Muthoot Fincorp Limited</sch:Institution>
<sch:PastDueAmount>0</sch:PastDueAmount>
<sch:DisbursedAmount>12000</sch:DisbursedAmount>
<sch:LoanCategory>JOG Group</sch:LoanCategory>
</sch:Account>
<sch:Account seq="2" id="345778174" ReportedDate="2014-01-01">
<sch:BranchIDMFI>THRISSUR ROAD</sch:BranchIDMFI>
<sch:KendraIDMFI>COSTCO/RECENT-107</sch:KendraIDMFI>
</sch:Account>

Parsing XQuery with an Xml Loose #Variable
Assuming an Xml document similar to this (viz with all the attributes on one element):
DECLARE #xmlData XML =
N'<Reports xmlns:x="http://foo">
<x:InquiryResponse>
<x:ReportData>
<x:AccountDetails>
<x:Account x:primarykey="pk" x:seq="sq" x:id="id"
x:ReportedDate="2014-01-01T00:00:00" />
</x:AccountDetails>
</x:ReportData>
</x:InquiryResponse>
</Reports>';
You can scrape the attributes out as follows:
WITH XMLNAMESPACES('http://foo' AS x)
select
Nodes.node.value('(#x:primarykey)[1]', 'varchar(50)') AS c_val,
Nodes.node.value('(#x:seq)[1]', 'varchar(50)') AS c_val2,
Nodes.node.value('(#x:id)[1]', 'varchar(50)') AS c_val3,
Nodes.node.value('(#x:ReportedDate)[1]', 'DATETIME') as someDateTime
FROM
#xmlData.nodes('/Reports/x:InquiryResponse/x:ReportData/x:AccountDetails/x:Account')
AS Nodes(node);
Attributes don't need text() as they are automatically strings
It is fairly unusual to have attributes in a namespace - drop the xmlns alias prefix if they aren't.
SqlFiddle here
Edit - Parsing Xml Column
Namespace dropped from the attributes
-Assumed that you have the data in a table, not a variable, hence the APPLY requirement. Note that OUTER APPLY will return nulls, e.g. useful only if you have rows with
empty Xml or missing Xml Elements. CROSS APPLY is the norm (viz
applying the xpath to each row selected on the LHS table)
Elements are accessed similar to attributes, just without #
WITH XMLNAMESPACES('http://foo' AS x)
select
Nodes.node.value('(#seq)[1]', 'varchar(50)') AS c_val2,
Nodes.node.value('(#id)[1]', 'varchar(50)') AS c_val3,
Nodes.node.value('(#ReportedDate)[1]', 'DATETIME') as someDateTime,
Nodes.node.value('(x:AccountNumber)[1]', 'VARCHAR(50)') as accountNumber
FROM
MyXmlData z
CROSS APPLY
z.XmlColumn.nodes('/Reports/x:InquiryResponse/x:ReportData/x:AccountDetails/x:Account')
AS Nodes(node);
Updated Fiddle
Edit Xml File off Disk
Here's the same thing for an xml file read from disk. Note that once you have the data in an XML variable (#MyXmlData) that you don't need to CROSS APPLY to anything - just supply xpath to select the appropriate node, and then scrape out the elements and attributes.
DECLARE #MyXmlData XML;
SET #MyXmlData =
( SELECT * FROM OPENROWSET ( BULK N'c:\temp\file3098.xml', SINGLE_CLOB ) AS MyXmlData );
-- Assuming all on the one element, no need for all the applies
-- attributes don't have a text axis (they are automatically strings
WITH XMLNAMESPACES('http://foo' AS x)
select
Nodes.node.value('(#seq)[1]', 'varchar(50)') AS c_val2,
Nodes.node.value('(#id)[1]', 'varchar(50)') AS c_val3,
Nodes.node.value('(#ReportedDate)[1]', 'DATETIME') as someDateTime,
Nodes.node.value('(x:AccountNumber)[1]', 'VARCHAR(50)') as accountNumber
FROM
#MyXmlData.nodes('/Reports/x:InquiryResponse/x:ReportData/x:AccountDetails/x:Account')
AS Nodes(node);

Converting a string to XML datatype before querying in T-SQL

How do I convert a string into an XML datatype so that I can query the data as XML:
For example (thanks to "mellamokb the Wise" who provided the original SQL for this)
The code below works fine if xmlstring is of the type XML (see DEMO)
select id, name
from Data
cross apply (
select Destination.value('data(#Name)', 'varchar(50)') as name
from [xmlstring].nodes('/Holidays/Summer/Regions/Destinations/Destination') D(Destination)
) Destinations(Name)
However, if xmlString is of type varchar I get an error even though I'm converting the string to XML (DEMO):
select id, name
from Data
cross apply (
select Destination.value('data(#Name)', 'varchar(50)') as name
from CONVERT(xml,[xmlstring]).nodes('/Holidays/Summer/Regions/Destinations/Destination') D(Destination)
) Destinations(Name)

You can do the cast it in one extra cross apply.
select id,
T.N.value('#Name', 'varchar(50)') as name
from Data
cross apply (select cast(xmlstring as xml)) as X(X)
cross apply X.X.nodes('/Holidays/Summer/Regions/Destinations/Destination') T(N)
SQL Fiddle
There might be performance issues with casting to XML. Have a look at this answer and this answer

XPath in T-SQL query

I have two tables, XMLtable and filterTable.
I need all the XMLtable.ID values from XMLtable where the data in Col_X contains MyElement, the contents of which matches filterColumn in filterTable.
The XML for each row in Col_X may contain multiple MyElement's, and I want that ID in case ANY of those elements match ANY of the values in filterColumn.
The problem is that those columns are actually of varchar(max) datatype, and the table itself is huge (like 50GB huge). So this query needs to be as optimized as possible.
Here's an example for where I am now, which merely returns the row where the first matching element equals one of the ones I'm looking for. Due to a plethora of different error messages I can't seem to be able to change this to compare to all of the same named elements as I want to.
SELECT ID,
CAST(Col_X AS XML).value('(//*[local-name()=''MyElement''])', N'varchar(25)')
FROM XMLtable
...and then compare the results to filterTable. This already takes 5+ minutes.
What I'm trying to achieve is something like:
SELECT ID
FROM XMLtable
WHERE CAST(Col_X AS XML).query('(//*[local-name()=''MyElement''])')
IN (SELECT filterColumn FROM filterTable)
The only way I can currently achieve this is to use the LIKE operator, which takes like a thousand times longer.
Now, obviously it's not an option to start changing the datatypes of the columns or anything else. This is what I have to work with. :)

Try this:
SELECT
ID,
MyElementValue
FROM
(
SELECT ID, myE.value('(./text())[1]', N'VARCHAR(25)') AS 'MyElementValue'
FROM XMLTable
CROSS APPLY (SELECT CAST(Col_X AS XML)) as X(Col_X)
CROSS APPLY X.Col_X.nodes('(//*[local-name()="MyElement"])') as T2(myE)
) T1
WHERE MyElementValue IN (SELECT filterColumn FROM filterTable)
and this:
SELECT
ID,
MyElementValue
FROM
(
SELECT ID, myE.value('(./text())[1]', N'VARCHAR(25)') AS 'MyElementValue'
FROM XMLTable
CROSS APPLY (SELECT CAST(Col_X AS XML)) as X(Col_X)
CROSS APPLY X.Col_X.nodes('//MyElement') as T2(myE)
) T1
WHERE MyElementValue IN (SELECT filterColumn FROM filterTable)
Update
I think that you are experiencing what is described here Compute Scalars, Expressions and Execution Plan Performance. The cast to XML is deferred to each call to the value function. The test you should make is to change the datatype of Col_X to XML.
If that is not an option you could query the rows you need from XMLTable into a temporary table that has an XML column and then do the query above against the temporary table without the need to cast to XML.
CREATE TABLE #XMLTable
(
ID int,
Col_X xml
)
INSERT INTO #XMLTable(ID, Col_X)
SELECT ID, Col_X
FROM XMLTable
SELECT
ID,
MyElementValue
FROM
(
SELECT ID, myE.value('(./text())[1]', N'varchar(25)') AS 'MyElementValue'
FROM #XMLTable
CROSS APPLY Col_X.nodes('//MyElement') as T2(myE)
) T1
WHERE MyElementValue IN (SELECT filterColumn FROM filterTable)
DROP TABLE #XMLTable

You could try something like this. It does at least functionally do what you want, I believe. You'll have to explore its performance with your data set empirically.
SELECT ID
FROM
(
SELECT xt.ID, CAST(xt.Col_X AS XML) [content] FROM XMLTable AS xt
) AS src
INNER JOIN FilterTable AS f
ON f.filterColumn IN
(
SELECT
elt.value('.', 'varchar(25)')
FROM src.content.nodes('//MyElement') AS T(elt)
)

I finally got this working, and with far better performance than I expected. Below is the script that finally produced the correct result in 5 - 6 minutes.
SELECT ID, myE.value('.', N'VARCHAR(25)') AS 'MyElementValue'
FROM (SELECT ID, CAST(Col_X AS XML) AS Col_X
FROM XMLTable) T1
CROSS APPLY Col_X.nodes('(//*[local-name()=''MyElement''])' T2(myE)
WHERE myE.value('.', N'varchar(25)') IN (SELECT filterColumn FROM filterTable)
Thanks for the help tho people!

Matching one attribute to another using XPath/XQuery in SQL Server 2008

Consider the XML and SQL:
declare #xml xml = '
<root>
<person id="11272">
<notes for="107">Some notes!</notes>
<item id="107" selected="1" />
</person>
<person id="77812">
<notes for="107"></notes>
<notes for="119">Hello</notes>
<item id="107" selected="0" />
<item id="119" selected="1" />
</person>
</root>'
select Row.Person.value('data(../#id)', 'int') as person_id,
Row.Person.value('data(#id)', 'int') as item_id,
Row.Person.value('data(../notes[#for=data(#id)][1])', 'varchar(max)') as notes
from #xml.nodes('/root/person/item') as Row(Person)
I end up with:
person_id item_id notes
----------- ----------- -------
77812 107 NULL
77812 119 NULL
11272 107 NULL
What I want is the 'notes' column to be pulled based on the #id attribute of the current item. If I replace [#for=data(#id)] in the selector with [#for=107] of course I get the value Some notes! in the last record. Is it possible to do this with XPath/XQuery, or am I barking up the wrong tree here? I think the problem is that
The XML is a bit awkward, yes, but I can't really change it I'm afraid.
I found one solution that works, but it feels awfully heavy for something like this.
select Item.Person.value('data(../#id)', 'int') as person_id,
Item.Person.value('data(#id)', 'int') as item_id,
Notes.Person.value('text()[1]', 'varchar(max)') as notes
from #xml.nodes('/root/person/item') as Item(Person)
inner join #xml.nodes('/root/person/notes') as Notes(Person) on
Notes.Person.value('data(#for)', 'int') = Item.Person.value('data(#id)', 'int')
and
Notes.Person.value('data(../#id)', 'int') = Item.Person.value('data(../#id)', 'int')
Update!
I figured it out! I'm new at XQuery but this works, so I'm calling it job done :) I changed the query for the notes to:
Item.Person.value('
let $id := data(#id)
return data(../notes[#for=$id])[1]
', 'varchar(max)') as notes

I would suggest that you do a cross apply instead of doing ../ to find a parent node. According to query plan it is a lot faster.
select P.X.value('data(#id)', 'int') as person_id,
I.X.value('data(#id)', 'int') as item_id,
I.X.value('let $id := data(#id)
return data(../notes[#for=$id])[1]', 'varchar(max)') as notes
from #xml.nodes('/root/person') as P(X)
cross apply P.X.nodes('item') as I(X)
You can even remove the ../ in the flwor with one extra cross apply gaining a bit more.
select P.X.value('#id', 'int') as person_id,
TI.id as item_id,
P.X.value('(notes[#for = sql:column("TI.id")])[1]', 'varchar(max)') as notes
from #xml.nodes('/root/person') as P(X)
cross apply P.X.nodes('item') as I(X)
cross apply (select I.X.value('#id', 'int')) as TI(id)
Comparing the queries against each other I got 67% on your query 17% on my first and 16% on the second. Note: these figures only give you a hint on what query will actually be faster in reality. Test the against your data to know for sure.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Baseball XML to SQL query - optimize - sql-server

Related

SQL Server 2012 XML Flatten and reduce fan out / duplicates

Retrieving xml attribute using Xquery

Converting a string to XML datatype before querying in T-SQL

XPath in T-SQL query

Matching one attribute to another using XPath/XQuery in SQL Server 2008

Categories

Resources