Extract data from XML document using t-sql - sql-server

I have been trying to extract data from the following xml doc using t-sql on sql server 2019.
XML:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://www.URL1.com/1</loc>
<image:image>
<image:loc>https://www.URL1.com/11</image:loc>
</image:image>
<image:image>
<image:loc>https://www.URL1.com/12</image:loc>
</image:image>
<image:image>
<image:loc>https://www.URL1.com/13</image:loc>
</image:image>
</url>
<url>
<loc>https://www.URL1.com/2</loc>
<image:image>
<image:loc>https://www.URL1.com/21</image:loc>
</image:image>
<image:image>
<image:loc>https://www.URL1.com/22</image:loc>
</image:image>
</url>
<url>
<loc>https://www.URL1.com/3</loc>
<image:image>
<image:loc>https://www.URL1.com/32</image:loc>
</image:image>
</url>
</urlset>
I would like to extract data out of the xml document into a SQL Server table. My desired output as below
Desired output:
+------------------------+-------------------------+
| Loc | ImageLoc |
+------------------------+-------------------------+
| https://www.URL1.com/1 | https://www.URL1.com/11 |
| https://www.URL1.com/1 | https://www.URL1.com/12 |
| https://www.URL1.com/1 | https://www.URL1.com/13 |
| https://www.URL1.com/2 | https://www.URL1.com/21 |
| https://www.URL1.com/2 | https://www.URL1.com/22 |
| https://www.URL1.com/3 | https://www.URL1.com/32 |
+------------------------+-------------------------+
My Attempts have been failed so far miserably. I have tried many thing but the only thing that allowed me to get even the Loc element was the following, I have tried using OUTER APPLY/CROSS APPLY to het the ImageLoc with no luck.
My Attempt:
DECLARE #xml XML
SELECT #xml = BulkColumn
FROM OPENROWSET(BULK 'M:\Files\MyXML.xml', SINGLE_BLOB) x
SELECT
t.c.value('(text())[1]', 'VARCHAR(max)') URLs
, t2.i.value('(text())[1]', 'VARCHAR(max)') URLs
FROM #xml.nodes('*:urlset/*:url/*:loc') t(c)
OUTER APPLY #xml.nodes('*:urlset/*:url/*:loc/*:image/*:loc') t2(i)
Could you please help? Thanks in advance

This answer was posted by lptr in the comments as just a link to a fiddle. As the OP has said that it answers their question, and lptr doesn't wish/respond to posting answers, I have migrated it to the answer section.
Here they use the * wildcard rather than defining the namespace to get the values from the XML:
dbfiddle.uk/...
SELECT
t.c.value('(*:loc/text())[1]', 'VARCHAR(max)') URLs
, t2.i.value('(text())[1]', 'VARCHAR(max)') URLs
FROM #xml.nodes('*:urlset/*:url') t(c)
OUTER APPLY t.c.nodes('*:image/*:loc') t2(i);

You need to define your namespace in your SQL as well. This can be done with putting WITH XMLNAMESPACES at the start your query and defining it there. Then you can define the image namespace and prefix it in your references and return the values from the nodes:
WITH XMLNAMESPACES ('xyz' AS image)
SELECT u.i.value('(../loc/text())[1]','varchar(500)') AS loc,
u.i.value('(image:loc/text())[1]','varchar(500)') AS loc
FROM #xml.nodes('urlset/url/image:image') u(i);
db<>fiddle

Related

T-SQL, how to parse this XML?

I've spent hours trying to parse this XML (bus stop schedule) and produce a recordset with , . Is there a way to convert XML to JSON, which I find is easier to handle?
Anyone willing to help? (Azure SQL Server)
<?xml version="1.0" encoding="UTF-8"?>
<Trias xmlns="http://www.vdv.de/trias" version="1.1">
<ServiceDelivery>
<ResponseTimestamp xmlns="http://www.siri.org.uk/siri">2021-11-25T17:52:12Z</ResponseTimestamp>
<DeliveryPayload>
<StopEventResponse>
<StopEventResult>
<StopEvent>
<ThisCall>
<CallAtStop>
<ServiceDeparture>
<TimetabledTime>2021-11-25T17:53:00Z</TimetabledTime>
<EstimatedTime>2021-11-25T17:53:00Z</EstimatedTime>
</ServiceDeparture>
</CallAtStop>
</ThisCall>
<Service>
<PublishedLineName>
<Text>58</Text>
<Language>de</Language>
</PublishedLineName>
</Service>
</StopEvent>
</StopEventResult>
<StopEventResult>
<StopEvent>
<ThisCall>
<CallAtStop>
<ServiceDeparture>
<TimetabledTime>2021-11-25T17:58:00Z</TimetabledTime>
<EstimatedTime>2021-11-25T17:58:00Z</EstimatedTime>
</ServiceDeparture>
</CallAtStop>
</ThisCall>
<Service>
<PublishedLineName>
<Text>60</Text>
<Language>de</Language>
</PublishedLineName>
</Service>
</StopEvent>
</StopEventResult>
</StopEventResponse>
</DeliveryPayload>
</ServiceDelivery>
</Trias>
A minimal reproducible example was not provided.
So shooting from the hip.
There is no need for any XML parsing. SQL Server comes with the built-in XQuery language support to handle XML data type.
The only nuance is that the input XML has namespaces.
A default namespace is declared by using XMLNAMESPACES() clause.
A couple of XQuery methods are in use: .nodes() and .value()
SQL
DECLARE #xml XML =
N'<Trias xmlns="http://www.vdv.de/trias" version="1.1">
<ServiceDelivery>
<ResponseTimestamp xmlns="http://www.siri.org.uk/siri">2021-11-25T17:52:12Z</ResponseTimestamp>
<DeliveryPayload>
<StopEventResponse>
<StopEventResult>
<StopEvent>
<ThisCall>
<CallAtStop>
<ServiceDeparture>
<TimetabledTime>2021-11-25T17:53:00Z</TimetabledTime>
<EstimatedTime>2021-11-25T17:53:00Z</EstimatedTime>
</ServiceDeparture>
</CallAtStop>
</ThisCall>
<Service>
<PublishedLineName>
<Text>58</Text>
<Language>de</Language>
</PublishedLineName>
</Service>
</StopEvent>
</StopEventResult>
<StopEventResult>
<StopEvent>
<ThisCall>
<CallAtStop>
<ServiceDeparture>
<TimetabledTime>2021-11-25T17:58:00Z</TimetabledTime>
<EstimatedTime>2021-11-25T17:58:00Z</EstimatedTime>
</ServiceDeparture>
</CallAtStop>
</ThisCall>
<Service>
<PublishedLineName>
<Text>60</Text>
<Language>de</Language>
</PublishedLineName>
</Service>
</StopEvent>
</StopEventResult>
</StopEventResponse>
</DeliveryPayload>
</ServiceDelivery>
</Trias>';
;WITH XMLNAMESPACES(DEFAULT 'http://www.vdv.de/trias')
SELECT c.value('(ThisCall/CallAtStop/ServiceDeparture/TimetabledTime/text())[1]', 'DATETIMEOFFSET(0)') AS TimetabledTime
, c.value('(ThisCall/CallAtStop/ServiceDeparture/EstimatedTime/text())[1]', 'DATETIMEOFFSET(0)') AS EstimatedTime
, c.value('(Service/PublishedLineName/Text/text())[1]', 'VARCHAR(100)') AS [Text]
, c.value('(Service/PublishedLineName/Language/text())[1]', 'CHAR(2)') AS [Language]
FROM #xml.nodes('/Trias/ServiceDelivery/DeliveryPayload/StopEventResponse/StopEventResult/StopEvent') AS t(c);
Output
+----------------------------+----------------------------+------+----------+
| TimetabledTime | EstimatedTime | Text | Language |
+----------------------------+----------------------------+------+----------+
| 2021-11-25 17:53:00 +00:00 | 2021-11-25 17:53:00 +00:00 | 58 | de |
| 2021-11-25 17:58:00 +00:00 | 2021-11-25 17:58:00 +00:00 | 60 | de |
+----------------------------+----------------------------+------+----------+

Processing XML prolog by SQL Server XML functions

I have a large database table with an XML column. The XML contents is a kind of document like as below:
<?int-dov version="1.0" encoding="UTF-8" standalone="no"?>
<ds:datastoreItem ds:itemID="{F8484AF4-73BF-45CA-A524-0D796F244F37}" xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml"><ds:schemaRefs><ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/></ds:schemaRefs></ds:datastoreItem>
I'm seeking a function or fast way to fetch standalone attribute value in a T-SQL query. When I run the below query:
select XmlContent.query('#standalone') from XmlDocuments
I get this error message:
Msg 2390, Level 16, State 1, Line 4
XQuery [XmlDocuments.XmlContent.query()]: Top-level attribute nodes are not supported
So, I would be appreciated if anybody gives me a solution to address this problem.
You can use the processing-instruction() function to get that.
SELECT #xml.value('./processing-instruction("int-dov")[1]','nvarchar(max)')
Result
version="1.0" encoding="UTF-8" standalone="no"
If you want to get just the standalone part, the only way I've found is to construct an XML node from it:
SELECT CAST(
N'<x ' +
#xml.value('./processing-instruction("int-dov")[1]','nvarchar(max)') +
N' />' AS xml).value('x[1]/#standalone','nvarchar(10)'
Result
no
db<>fiddle
Just to complement #Charlieface answer. All credit goes to him.
SQL
DECLARE #xml XML =
N'<?int-dov version="1.0" encoding="UTF-8" standalone="no"?>
<ds:datastoreItem ds:itemID="{F8484AF4-73BF-45CA-A524-0D796F244F37}"
xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml">
<ds:schemaRefs>
<ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/>
</ds:schemaRefs>
</ds:datastoreItem>';
SELECT col.value('x[1]/#standalone','nvarchar(10)') AS [standalone]
, col.value('x[1]/#encoding','nvarchar(10)') AS [encoding]
, col.value('x[1]/#version','nvarchar(10)') AS [version]
FROM (VALUES(CAST(N'<x ' +
#xml.value('./processing-instruction("int-dov")[1]','nvarchar(max)') +
N' />' AS xml))
) AS tab(col);
Output
+------------+----------+---------+
| standalone | encoding | version |
+------------+----------+---------+
| no | UTF-8 | 1.0 |
+------------+----------+---------+

How do I select the value of an ID element in XPath?

Given an XML with the following set up:
<eCrow.CrowGroup CorrelationID="ec367934-e7bd-4213-b0e5-d149c57eec61" >
<eCrow.01>fu</eCrow.01>
<eCrow.02>bar</eCrow.02>
<eCrow.03 CorrelationID="bfe7d35b-bbc1-4591-8d0d-9d42252039bc" >03003</eCrew.03>
</eCrow.CrowGroup>
How do I manage to get XPath to select the CorrelationID from within the node header: <eCrow.CrowGroup CorrelationID="ec367934-e7bd-4213-b0e5-d149c57eec61" >, NOT the CorrelationID from eCrow.03.
In regards to the link suggestion, I am probably doing something wrong but //eCrew.CrewGroup/*#CorrelationID just selects the entire node.
Please try the following.
As already mentioned in the comments, I had to fix your XML to make it well-formed.
XQuery .value() method gives you the answer.
SQL
DECLARE #xml XML =
N'<eCrow.CrowGroup CorrelationID="ec367934-e7bd-4213-b0e5-d149c57eec61">
<eCrow.01>fu</eCrow.01>
<eCrow.02>bar</eCrow.02>
<eCrow.03 CorrelationID="bfe7d35b-bbc1-4591-8d0d-9d42252039bc">03003</eCrow.03>
</eCrow.CrowGroup>';
SELECT #xml.value('(/eCrow.CrowGroup/eCrow.03/#CorrelationID)[1]', 'VARCHAR(30)') AS CorrelationID
, #xml.value('(/eCrow.CrowGroup/eCrow.03/#CorrelationID)[1]', 'uniqueidentifier') AS CorrelationID2;
Output
+--------------------------------+--------------------------------------+
| CorrelationID | CorrelationID2 |
+--------------------------------+--------------------------------------+
| bfe7d35b-bbc1-4591-8d0d-9d4225 | BFE7D35B-BBC1-4591-8D0D-9D42252039BC |
+--------------------------------+--------------------------------------+

Selecting data from XML

I'm trying to insert a row based on data extracted from a chunk of XML. Some columns need to initialized to node values a couple of nodes "deep" in the XML structure.
I can't seem to the query right. Here's what I got:
declare #xmlRaw xml = '
<LogEntry>
<SummaryMessage>Something bad happened</SummaryMessage>
<Exception>
<Type>System.ApplicationException</Type>
<Message>A test of the error handling</Message>
</Exception>
</LogEntry>'
select
LogEntryColumn.value('SummaryMessage[1]', 'varchar(10)') as SummaryMessage, -- works fine
LogEntryColumn.query('Exception[1]').value('Message[1]', 'varchar(10)') as ExMessage -- not working
from
#xmlRaw.nodes('LogEntry[1]') as LogEntryTable(LogEntryColumn)
This outputs:
SummaryMessage ExMessage
-------------- ----------
Something NULL
I've tried a raft of variations for the "ExMessage" column query but no joy.
Note that I'm using "LogEntryColumn.query(...).value(...)" because I want to check how that form performs versus something like:
select
LogEntryColumn.value('SummaryMessage[1]', 'varchar(10)') as SummaryMessage, -- works fine
ExceptionEntryColumn.value('Message[1]', 'varchar(10)') as ExMessage -- not working
from
#xmlRaw.nodes('LogEntry[1]') as LogEntryTable(LogEntryColumn)
outer apply #xmlData.nodes('LogEntry[1]/Exception') as ExceptionTable(ExceptionEntryColumn)
Basically I'm wondering if multiple "outer apply" from clauses is better/worse than multiple .query(...) invocations.
Here is what you need.
SQL
DECLARE #xmlRaw XML =
N'<LogEntry>
<SummaryMessage>Something bad happened</SummaryMessage>
<Exception>
<Type>System.ApplicationException</Type>
<Message>A test of the error handling</Message>
</Exception>
</LogEntry>';
SELECT c.value('(SummaryMessage/text())[1]', 'varchar(100)') AS SummaryMessage
, c.value('(Exception/Message/text())[1]', 'varchar(100)') AS ExMessage
FROM #xmlRaw.nodes('/LogEntry') AS t(c);
Output
+------------------------+------------------------------+
| SummaryMessage | ExMessage |
+------------------------+------------------------------+
| Something bad happened | A test of the error handling |
+------------------------+------------------------------+

Multiple values from XML column

I am trying to figure out how to get multiple values from multiple nodes of an XML field in a table (actually it's XML stored as text).
I've seen several methods that involve declaring the XML as a variable and using it as a table but I don't see how that would work for me. How to Extract data from xml column in sql 2008
I am currently using .value to get some fields but I don't see how to make it work since there can be multiple LX01_AssignedNumber and I need to get all of the ProcedureModifier from each.
SELECT CAST(xmldata as xml).value('declare namespace ns1="http://schemas.microsoft.com/BizTalk/EDI/EDIFACT/2006/EnrichedMessageXML";declare namespace ns0="http://schemas.microsoft.com/BizTalk/EDI/X12/2006";
(/ns1:X12EnrichedMessage/TransactionSet/ns0:X12_00501_837_P/ns0:TS837_2000A_Loop/ns0:TS837_2000B_Loop/ns0:TS837_2300_Loop/ns0:TS837_2400_Loop/ns0:SV1_ProfessionalService/ns0:C003_CompositeMedicalProcedureIdentifier/C00303_ProcedureModifier) [1]', 'varchar(20)') AS RendAttendNPI
FROM EDI_DATA
How do I get all the Line Numbers and all of the Procedure Modifiers from each record?
XML:
<ns1:X12EnrichedMessage xmlns:ns1="http://schemas.microsoft.com/BizTalk/EDI/EDIFACT/2006/EnrichedMessageXML">
...
<TransactionSet>
<!-- ProcessLogID=PLG0005169955 ;ProcessLogDetailID=PLG0005173285 ;EnvID=1;RetryCount=1 -->
<ns0:X12_00501_837_P xmlns:ns0="http://schemas.microsoft.com/BizTalk/EDI/X12/2006">
<ns0:TS837_2000A_Loop xmlns:ns0="http://schemas.microsoft.com/BizTalk/EDI/X12/2006">
<ns0:TS837_2000B_Loop xmlns:ns0="http://schemas.microsoft.com/BizTalk/EDI/X12/2006">
<ns0:TS837_2300_Loop xmlns:ns0="http://schemas.microsoft.com/BizTalk/EDI/X12/2006">
<ns0:TS837_2400_Loop>
<ns0:LX_ServiceLineNumber>
<LX01_AssignedNumber>1</LX01_AssignedNumber>
</ns0:LX_ServiceLineNumber>
<ns0:SV1_ProfessionalService>
<ns0:C003_CompositeMedicalProcedureIdentifier>
<C00301_ProductorServiceIDQualifier>HC</C00301_ProductorServiceIDQualifier>
<C00302_ProcedureCode>26340</C00302_ProcedureCode>
<C00303_ProcedureModifier>AG</C00303_ProcedureModifier>
<C00304_ProcedureModifier>58</C00304_ProcedureModifier>
<C00305_ProcedureModifier>51</C00305_ProcedureModifier>
<C00306_ProcedureModifier>XS</C00306_ProcedureModifier>
</ns0:C003_CompositeMedicalProcedureIdentifier>
<SV102_LineItemChargeAmount>8918</SV102_LineItemChargeAmount>
<SV103_UnitorBasisforMeasurementCode>UN</SV103_UnitorBasisforMeasurementCode>
<SV104_ServiceUnitCount>13</SV104_ServiceUnitCount>
<ns0:C004_CompositeDiagnosisCodePointer>
<C00401_DiagnosisCodePointer>1</C00401_DiagnosisCodePointer>
<C00402_DiagnosisCodePointer>2</C00402_DiagnosisCodePointer>
</ns0:C004_CompositeDiagnosisCodePointer>
</ns0:SV1_ProfessionalService>
<ns0:DTP_SubLoop_2>
<ns0:DTP_Date_ServiceDate>
<DTP01_DateTimeQualifier>472</DTP01_DateTimeQualifier>
<DTP02_DateTimePeriodFormatQualifier>D8</DTP02_DateTimePeriodFormatQualifier>
<DTP03_ServiceDate>20160104</DTP03_ServiceDate>
</ns0:DTP_Date_ServiceDate>
</ns0:DTP_SubLoop_2>
<ns0:REF_SubLoop_7>
<ns0:REF_LineItemControlNumber>
<REF01_ReferenceIdentificationQualifier>6R</REF01_ReferenceIdentificationQualifier>
<REF02_LineItemControlNumber>11453481</REF02_LineItemControlNumber>
</ns0:REF_LineItemControlNumber>
</ns0:REF_SubLoop_7>
</ns0:TS837_2400_Loop>
<ns0:TS837_2400_Loop>
<ns0:LX_ServiceLineNumber>
<LX01_AssignedNumber>2</LX01_AssignedNumber>
</ns0:LX_ServiceLineNumber>
<ns0:SV1_ProfessionalService>
<ns0:C003_CompositeMedicalProcedureIdentifier>
<C00301_ProductorServiceIDQualifier>HC</C00301_ProductorServiceIDQualifier>
<C00302_ProcedureCode>20680</C00302_ProcedureCode>
<C00303_ProcedureModifier>58</C00303_ProcedureModifier>
</ns0:C003_CompositeMedicalProcedureIdentifier>
<SV102_LineItemChargeAmount>1277</SV102_LineItemChargeAmount>
<SV103_UnitorBasisforMeasurementCode>UN</SV103_UnitorBasisforMeasurementCode>
<SV104_ServiceUnitCount>1</SV104_ServiceUnitCount>
<ns0:C004_CompositeDiagnosisCodePointer>
<C00401_DiagnosisCodePointer>3</C00401_DiagnosisCodePointer>
</ns0:C004_CompositeDiagnosisCodePointer>
</ns0:SV1_ProfessionalService>
</ns0:TS837_2400_Loop>
</ns0:TS837_2300_Loop>
</ns0:TS837_2000B_Loop>
</ns0:TS837_2000A_Loop>
</ns0:X12_00501_837_P>
</TransactionSet>
</ns1:X12EnrichedMessage>
Look into SQL Server CROSS APPLY which you can use to shred single XML data into multiple rows, for example :
;WITH XMLNAMESPACES ('http://schemas.microsoft.com/BizTalk/EDI/X12/2006' as ns0
,'http://schemas.microsoft.com/BizTalk/EDI/EDIFACT/2006/EnrichedMessageXML' as ns1)
SELECT
TS837_2400_Loop.value('(.//LX01_AssignedNumber)[1]', 'int') 'line_number'
,C00303_ProcedureModifier.value('.', 'varchar(100)') 'procedure_modifier'
FROM EDI_DATA
CROSS APPLY (select CONVERT(XML, xmldata)) as P(X)
CROSS APPLY X.nodes('.//ns0:TS837_2400_Loop') AS Q(TS837_2400_Loop)
CROSS APPLY TS837_2400_Loop.nodes('.//C00303_ProcedureModifier') AS R(C00303_ProcedureModifier)
sqlfiddle demo
output :
| line_number | procedure_modifier |
|-------------|--------------------|
| 1 | AG |
| 2 | 58 |

Resources