I have over 500,000 XML files stored in a MS SQL data base such as the one below (which has been edited to save space in the question).
<?xml version="1.0"?>
<PROJECTS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<APPLICATION_ID>7000518</APPLICATION_ID>
<ACTIVITY>C06</ACTIVITY>
<ADMINISTERING_IC>RR</ADMINISTERING_IC>
<APPLICATION_TYPE>1</APPLICATION_TYPE>
<BUDGET_START>09/01/2009</BUDGET_START>
<BUDGET_END>09/30/2013</BUDGET_END>
<FULL_PROJECT_NUM>1C06RR020539-01A1</FULL_PROJECT_NUM>
<FY>2009</FY>
<ORG_STATE>CA</ORG_STATE>
<ORG_ZIPCODE>900952000</ORG_ZIPCODE>
<PIS>
<PI>
<PI_NAME>JONES,MARY</PI_NAME>
<PI_ID>9876543</PI_ID>
</PI>
<PI>
<PI_NAME>DOE, JOHN</PI_NAME>
<PI_ID>1234567</PI_ID>
</PI>
</PIS>
<PROJECT_TERMSX>
<TERM>Extramural Activities</TERM>
<TERM>Extramural Research Facilities Construction Project</TERM>
</PROJECT_TERMSX>
<PROJECT_TITLE>The Center for Oral/Research</PROJECT_TITLE>
<SUPPORT_YEAR>1</SUPPORT_YEAR>
</row>
</PROJECTS>
I can search for any of the single nodes using something like:
SELECT nref.value('(APPLICATION_ID)[1]', 'Int') APPLICATION_ID,
nref.value('(ACTIVITY)[1]', 'varchar(3)') ACTIVITY
FROM [XML_2010] cross apply XMLData.nodes('//PROJECTS/row') as R(nref)
WHERE nref.value('(CORE_PROJECT_NUM)[1]', 'varchar(25)') LIKE '%CA187342%'
But how can I find the data associated with all XML files that have DOE, JOHN as a PI which is a sub node to PIS? Such as the APPLICATION_ID and BUDGET_START etc?
Thanks for the help
XML is great for archives and data exchange, but is the wrong container to store actively used / filtered / searched data. Therefore I'd strongly suggest to transfer all your data in classical, indexed tables like this:
Attention I reduce your XML to some examples per level, the rest follows the same approach and is up to you. The declared table variable is to mock-up a test scenario:
DECLARE #YourTable TABLE(ID INT IDENTITY,YourXml XML);
INSERT INTO #YourTable VALUES
('<PROJECTS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<APPLICATION_ID>7000518</APPLICATION_ID>
<ACTIVITY>C06</ACTIVITY>
<!-- more first level elements like above -->
<!-- Here there are multiple PIs -->
<PIS>
<PI>
<PI_NAME>JONES,MARY</PI_NAME>
<PI_ID>9876543</PI_ID>
</PI>
<PI>
<PI_NAME>DOE, JOHN</PI_NAME>
<PI_ID>1234567</PI_ID>
</PI>
</PIS>
<!-- Here there are multiple PROJECT_TERMS -->
<PROJECT_TERMSX>
<TERM>Extramural Activities</TERM>
<TERM>Extramural Research Facilities Construction Project</TERM>
</PROJECT_TERMSX>
<!-- These are normal first level elements again -->
<PROJECT_TITLE>The Center for Oral/Research</PROJECT_TITLE>
<SUPPORT_YEAR>1</SUPPORT_YEAR>
</row>
</PROJECTS>');
--This SELECT reads all first-level-data together with the partial XMLs into a temp table #Projects:
SELECT r.value('(APPLICATION_ID/text())[1]','bigint') AS APPLICATION_ID
,r.value('(ACTIVITY/text())[1]','nvarchar(max)') AS ACTIVITY
--more columns like above
,r.query('PIS') AS AllPis
,r.query('PROJECT_TERMSX') AS AllProjectTerms
--more first level columns
INTO #Projects
FROM #YourTable AS t
OUTER APPLY t.YourXml.nodes('/PROJECTS/row') AS A(r);
--This SELECT reads from #Projects and stores all related PI-data in another temp table #PIs
SELECT APPLICATION_ID
,p.value('(PI_ID/text())[1]','bigint') AS PI_ID
,p.value('(PI_NAME/text())[1]','nvarchar(max)') AS PI_NAME
INTO #PIs
FROM #Projects AS p
OUTER APPLY p.AllPis.nodes('PIS/PI') AS A(p);
--Same with #Terms
SELECT APPLICATION_ID
,t.value('(./text())[1]','nvarchar(max)') AS TERM
INTO #Terms
FROM #Projects AS p
OUTER APPLY p.AllProjectTerms.nodes('PROJECT_TERMSX/TERM') AS A(t);
--This is now the content of your temp tables
SELECT * FROM #Projects;
SELECT * FROM #PIs;
SELECT * FROM #Terms;
--Clean up
GO
DROP TABLE #Projects;
DROP TABLE #PIs;
DROP TABLE #Terms;
Before the Clean up you will enter some code, which writes your data out of these staging tables into real tables. The IDs to define the relation are stored together with the data. This should be easy. You will need INSERT INTO or MERGE, depending if you have to deal with already existing data.
Hint
You might think about a m:n-relation between projects and PIs and projects and terms. For this you'd write a separate PI-table and a separate Term-table with a mapping table in between (holding the application_id and the second id, both as foreign keys)
Related
I'm trying to parse XML data in SQL Server. I have a XML column in a table, the XML stored in it can vary by type, but they all inherit from the same base type.
Row 1: has XML like so:
<Form>
<TaskType>1</TaskType>
--Other Properties ...
</Form>
Row 2: has XML like so:
<License>
<TaskType>2</TaskType>
--Other Properties ...
</License>
Normally I might parse XML with this T-SQL code snippet:
SELECT
xmlData.A.value('.', 'INT') AS Animal
FROM
#XMLToParse.nodes('License/TaskType') xmlData(A)
This doesn't work since in a view since I'm dependent on the name to find the node.
How can I always find the TaskType XML element in my XML content?
Please try the following solution.
XPath is using asterisk * as a wildcard.
http://www.tizag.com/xmlTutorial/xpathwildcard.php
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, xmldata XML);
INSERT #tbl (xmldata) VALUES
(N'<Form>
<TaskType>1</TaskType>
<TaskName>Clone</TaskName>
<!--Other XML elements-->
</Form>'),
(N'<License>
<TaskType>2</TaskType>
<TaskName>Copy</TaskName>
<!--Other XML elements-->
</License>');
-- DDL and sample data population, end
SELECT ID
, c.value('(TaskType/text())[1]', 'INT') AS TaskType
, c.value('(TaskName/text())[1]', 'VARCHAR(20)') AS TaskName
FROM #tbl
CROSS APPLY xmldata.nodes('/*') AS t(c);
Output
ID
TaskType
TaskName
1
1
Clone
2
2
Copy
Apparently you can just interate the nodes like so without being aware of their name:
SELECT xmlData.A.value('.', 'INT') AS Animal
FROM #XMLToParse.nodes('node()/TaskType') xmlData(A)
I posted this question
INSERT Statement Expensive Queries on Activity Monitor
As you will see the XML structure has different levels.
I have created different tables
Organisation = organisation_id (PRIMARY_KEY)
Contacts = organisation_id (FOREIGN_KEY)
Roles = organisation_id (FOREIGN_KEY)
Rels = organisation_id (FOREIGN_KEY)
Succs = organisation_id (FOREIGN_KEY)
What I want is to generate the organisation_id and do the insert on each table in cascading manner. At the moment the process takes almost 2 hours for 300k. I have 3 approach
Convert XML to List Object and Send by batch(1000) as JSON text and send to a stored procedure the uses OPENJSON
Convert XML to list object and send by batch (1000) and save the batch as JSON a file that SQL Server can read and pass the filepath on a stored procedure which then opens the JSON file using OPENROWSET and OPENJSON
Send the path to XML to a stored procedure then use OPENROWSET and OPENXML.
All process (1-3) inserts the data into a FLAT temp table then iterate each row to call different INSERT stored procedure for each tables. Approach #3 seems to fail with errors on 300k but works on 4 records.
The other question is, will it be much faster if I use an physical table than a temp table?
-------UPDATE-------
As explained on the link, I was doing while loop. Someone suggested / commented to do a batch insert on each of the table. The problem is, for example, Contacts I can only do this if I know the organisation_id
select
organisation_id = IDENTITY( bigint ) -- IF I CAN GENERATE THE ACTUAL ORGANISATION ID
,name = Col.value('.','nvarchar(20)')
,contact_type = c.value('(./#type)[1]','nvarchar(50)')
,contact_value= c.value('(./#value)[1]','nvarchar(50)')
into
#temporganisations
from
#xml.nodes('ns1:OrgRefData/Organisations/Organisation') as Orgs(Col)
outer apply Orgs.Col.nodes('Contacts/Contact') as Cs(c)
Then when I do the batch insert
insert into contacts
(
organisation_id,type,value
)
select
torg.organisation_id -- if this is the actual id then perfect
,torg.type
,torg.value
from #temporg torg
I would suggest that you shred the XML client-side, and switch over to doing some kind of Bulk Copy, this will generally perform much better.
At the moment, you cannot do a normal bcp or SqlBulkCopy, because you also need the foreign key. You need a way to uniquely identify Organisation within the batch, and you say that is difficult owing to the number of columns needed for that.
Instead, you need to generate some kind of unique ID client-side, an incrementing integer will do. You then assign this ID to the child objects as you are shredding the XML into Datatables / IEnumerables / CSV files.
You have two options:
The easiest in many respects, is to not use IDENTITY from OrganisationId and just directly insert your generated ID. This means you can leverage standard SqlBulkCopy procedures.
The downside is that you lose the benefit of automatic IDENTITY assignment, but you could instead just use the SqlBulkCopyOptions.KeepIdentity option which only applies to this insert, and carry on with IDENTITY for other inserts. You would need to estimate a correct batch of IDs that won't clash.
A variation on this is to use GUIDs, these are always unique. I don't really recommend this option.
If you don't want to do this, then it gets quite a bit more complex.
You need to define equivalent Table Types for each of the tables. Each has a column for the temporary primary key of the Organisation
CREATE TYPE OrganisationType AS TABLE
(TempOrganisationID int PRIMARY KEY,
SomeData varchar...
Pass through the shredded XML as Table-Valued Parameters. You would have #Organisations, #Contacts etc.
Then you would have SQL along the following lines:
-- This stores the real IDs
DECLARE #OrganisationIDs TABLE
(TempOrganisationID int PRIMARY KEY, OrganisationId int NOT NULL);
-- We need a hack to get OUTPUT to work with non-inserted columns, so we use a weird MERGE
MERGE INTO Organisation t
USING #Organisations s
ON 1 = 0 -- never match
WHEN NOT MATCHED THEN
INSERT (SomeData, ...)
VALUES (s.SomeData, ...)
OUTPUT
s.TempOrganisationID, inserted.OrganisationID
INTO #OrganisationIDs
(TempOrganisationID, OrganisationID);
-- We now have each TempOrganisationID matched up with a real OrganisationID
-- Now we can insert the child tables
INSERT Contact
(OrganisationID, [Type], [Value]...)
SELECT o.OrganisationID, c.[Type], c.[Value]
FROM #Contact c
JOIN #OrganisationIDs o ON o.TempOrganisationID = c.TempOrganisationID;
-- and so on for all the child tables
Instead of saving the IDs to a table variable, you could instead stream back the OUTPUT to client, and have the client join the IDs to the child tables, then BulkCopy them back again as part of the child tables.
This makes the SQL simpler, however you still need the MERGE, and you risk complicating the client code significantly.
You can try to use the following conceptual example.
SQL
-- DDL and sample data population, start
USE tempdb;
GO
DROP TABLE IF EXISTS #city;
DROP TABLE IF EXISTS #state;
-- parent table
CREATE TABLE #state (
stateID INT IDENTITY PRIMARY KEY,
stateName VARCHAR(30),
abbr CHAR(2),
capital VARCHAR(30)
);
-- child table (1-to-many)
CREATE TABLE #city (
cityID INT IDENTITY,
stateID INT NOT NULL FOREIGN KEY REFERENCES #state(stateID),
city VARCHAR(30),
[population] INT,
PRIMARY KEY (cityID, stateID, city)
);
-- mapping table to preserve IDENTITY ids
DECLARE #idmapping TABLE (GeneratedID INT PRIMARY KEY,
NaturalID VARCHAR(20) NOT NULL UNIQUE);
DECLARE #xml XML =
N'<root>
<state>
<StateName>Florida</StateName>
<Abbr>FL</Abbr>
<Capital>Tallahassee</Capital>
<cities>
<city>
<city>Miami</city>
<population>470194</population>
</city>
<city>
<city>Orlando</city>
<population>285713</population>
</city>
</cities>
</state>
<state>
<StateName>Texas</StateName>
<Abbr>TX</Abbr>
<Capital>Austin</Capital>
<cities>
<city>
<city>Houston</city>
<population>2100263</population>
</city>
<city>
<city>Dallas</city>
<population>5560892</population>
</city>
</cities>
</state>
</root>';
-- DDL and sample data population, end
;WITH rs AS
(
SELECT stateName = p.value('(StateName/text())[1]', 'VARCHAR(30)'),
abbr = p.value('(Abbr/text())[1]', 'CHAR(2)'),
capital = p.value('(Capital/text())[1]', 'VARCHAR(30)')
FROM #xml.nodes('/root/state') AS t(p)
)
MERGE #state AS o
USING rs ON 1 = 0
WHEN NOT MATCHED THEN
INSERT(stateName, abbr, capital)
VALUES(rs.stateName, rs.Abbr, rs.Capital)
OUTPUT inserted.stateID, rs.stateName
INTO #idmapping (GeneratedID, NaturalID);
;WITH Details AS
(
SELECT NaturalID = p.value('(StateName/text())[1]', 'VARCHAR(30)'),
city = c.value('(city/text())[1]', 'VARCHAR(30)'),
[population] = c.value('(population/text())[1]', 'INT')
FROM #xml.nodes('/root/state') AS A(p) -- parent
CROSS APPLY A.p.nodes('cities/city') AS B(c) -- child
)
INSERT #city (stateID, city, [Population])
SELECT m.GeneratedID, d.city, d.[Population]
FROM Details AS d
INNER JOIN #idmapping AS m ON d.NaturalID = m.NaturalID;
-- test
SELECT * FROM #state;
SELECT * FROM #idmapping;
SELECT * FROM #city;
As per my requirement, I have to find if some words like xyz#test.com value exists in which tables of columns. The database size is very huge and more than 2500 tables.
Can anyone please provide an optimal way to find this type of value from the database. I've created a loop query which took around almost more than 9 hrs to run.
9 hours is clearly a long time. Furthermore, 2,500 tables seems close to insanity for me.
Here is one approach that will run 1 query per table, not one per column. Now I have no idea how this will perform against 2,500 tables. I suspect it may be horrible. That said I would strongly suggest a test filter first like Table_Name like 'OD%'
Example
Declare #Search varchar(max) = 'cappelletti' -- Exact match '"cappelletti"'
Create Table #Temp (TableName varchar(500),RecordData xml)
Declare #SQL varchar(max) = ''
Select #SQL = #SQL+ ';Insert Into #Temp Select TableName='''+concat(quotename(Table_Schema),'.',quotename(table_name))+''',RecordData = (Select A.* for XML RAW) From '+concat(quotename(Table_Schema),'.',quotename(table_name))+' A Where (Select A.* for XML RAW) like ''%'+#Search+'%'''+char(10)
From INFORMATION_SCHEMA.Tables
Where Table_Type ='BASE TABLE'
and Table_Name like 'OD%' -- **** Would REALLY Recommend a REASONABLE Filter *** --
Exec(#SQL)
Select A.TableName
,B.*
,A.RecordData
From #Temp A
Cross Apply (
Select ColumnName = a.value('local-name(.)','varchar(100)')
,Value = a.value('.','varchar(max)')
From A.RecordData.nodes('/row') as C1(n)
Cross Apply C1.n.nodes('./#*') as C2(a)
Where a.value('.','varchar(max)') Like '%'+#Search+'%'
) B
Drop Table #Temp
Returns
If it Helps, the individual queries would look like this
Select TableName='[dbo].[OD]'
,RecordData= (Select A.* for XML RAW)
From [dbo].[OD] A
Where (Select A.* for XML RAW) like '%cappelletti%'
On a side-note, you can search numeric data and even dates.
Make a procedure with VARCHAR datatype of column with table name and store into the temp table from system tables.
Now make one dynamic Query with executing a LOOP on each record with = condition with input parameter of email address.
If condition is matched in any statement using IF EXISTS statement, then store that table name and column name in another temp table. and retrieve the list of those records from temp table at end of the execution.
In my SQL Server DB I have table with an XML column. The XML that goes in it is like the sample below:
<Rows>
<Row>
<Name>John</Name>
</Row>
<Row>
<Name>Debbie</Name>
</Row>
<Row>
<Name>Annie</Name>
</Row>
<Row>
<Name>John</Name>
</Row>
</Rows>
I have a requirement that I need to find the occurrence of all rows where the XML data has duplicate entries of <Name>. For example, above we have 'John' twice in the XML.
I can use the exist XML statement to find 1 occurrence, but how can I find if it's more than 1? Thanks.
To identify any table row that has duplicate <Name> values in its XML, you can use exist as well:
exist('//Name[. = preceding::Name]')
To identify which names are duplicates, respectively, you need nodes and CROSS APPLY
SELECT
t.id,
x.Name.value('.', 'varchar(100)') AS DuplicateName
FROM
MyTable t
CROSS APPLY t.MyXmlColumn.nodes('//Name[. = preceding::Name]') AS x(Name)
WHERE
t.MyXmlColumn.exist('//Name[. = preceding::Name]')
Try this:
;with cte as
(SELECT tbl.col.value('.[1]', 'varchar(100)') as name
FROM yourtable
CROSS APPLY xmlcol.nodes('/Rows/Row/Name') as tbl(col))
select name
from cte
group by name
having count(name) > 1
We first use the nodes function to convert from XML to relational data, then use value to get the text inside the Name node. We then put the result of the previous step into a CTE, and use a simple group by to get the value with multiple occurences.
Demo
Not sure if this question makes for some poor performance down the track, but seems to at least feel "a better way" right now..
What I am trying to do is this:
I have a table called CONTACTS which amongst other things has a primary key field called memberID
I also have an XML field which contains the ID's of your friends (for example).. like:
<root><id>2</id><id>6</id><id>14</id></root>
So what I am trying to do via a stored proc is pass in say your member ID, and return all of your friends info, for example:
select name, address, age, dob from contacts
where id... xml join stuff...
The previous way I had it working (well sort of!) selected all the XML nodes (/root/id) into a temp table, and then did a join from that temp table to the contact table to get the contact fields...
Any help much appreciated.. just a bit overloaded from the .query .nodes examples, and of course which is maybe a better way of doing this...
THANKS IN ADVANCE!
<-- EDIT -->
I did get something working, but looks like a SQL frankenstein statement!
Basically I needed to get the friends contact ID's from the XML field, and populate into a temp table like so:
Declare #contactIDtable TABLE (ID int)
INSERT INTO #contactIDtable (ID)
SELECT CONVERT(INT,CAST(T2.memID.query('.') AS varchar(100))) AS friendsID
FROM dbo.members
CROSS APPLY memberContacts.nodes('/root/id/text()') AS T2(memID)
But crikey! the convert/cast thing looks serious.. as I need to get an INT for the next bit which is the actual join to return the contact data as follows:
SELECT memberID, memberName, memberAddress1
FROM members
INNER JOIN #contactIDtable cid
ON members.memberID = cid.ID
ORDER BY memberName
RESULT...
Well it works.. in my case, my memberContacts XML field had 3 nodes (id's in this case), and the above query returned 3 rows of data (memberID, memberName, memberAddress1)...
The whole point of this of course was to try to save creating a many join table i.e. list of all my friends ID's... just not sure if the above actually makes this quicker and easier...
Anymore ideas / more efficient ways of trying to do this???
SQL Server's syntax for reading XML is one of the least intuitive around. Ideally, you'd want to:
select f.name
from friends f
join #xml x
on x.id = f.id
Instead, SQL Server requires you to spell out everything. To turn an XML variable or column into a "rowset", you have to spell out the exact path and think up two aliases:
#xml.nodes('/root/id') as table_alias(column_alias)
Now you have to explain to SQL Server how to turn <id>1</id> into an int:
table_alias.column_alias.value('.', 'int')
So you can see why most people prefer to decode XML on the client side :)
A full example:
declare #friends table (id int, name varchar(50))
insert #friends (id, name)
select 2, 'Locke Lamorra'
union all select 6, 'Calo Sanzo'
union all select 10, 'Galdo Sanzo'
union all select 14, 'Jean Tannen'
declare #xml xml
set #xml = ' <root><id>2</id><id>6</id><id>14</id></root>'
select f.name
from #xml.nodes('/root/id') as table_alias(column_alias)
join #friends f
on table_alias.column_alias.value('.', 'int') = f.id
In order to get your XML contents as rows from a "pseudo-table", you need to use the .nodes() on the XML column - something like:
DECLARE #xmlfield XML
SET #xmlfield = '<root><id>2</id><id>6</id><id>14</id></root>'
SELECT
ROOT.ID.value('(.)[1]', 'int')
FROM
#xmfield.nodes('/root/id') AS ROOT(ID)
SELECT
(list of fields)
FROM
dbo.Contacts c
INNER JOIN
#xmlfield.nodes('/root/id') AS ROOT(ID) ON c.ID = Root.ID.value('(.)[1]', 'INT')
Basically, the .nodes() defines a pseudo-table ROOT with a single column ID, that will contain one row for each node in the XPath expression, and the .value() selects a specific value out of that XML fragment.