SSIS ETL: Xml column into varchar destination - sql-server

What would be the optimal and efficient way to do the below
Requirement
Copy an xml column into staging DB and then
Staging to DWH :
Column Size
Source : Xml
StagingDB : Xml or Varchar ?
Datawarehouse : Varchar(8000)
I need to decide which would be the best optimal and most performing way to copy an xml column into staging DB. copy XML from source into an xml column in STage DB is the best way or Xml to Varchar(max) will be the best way, considering the data transferred will be in millions ?

If you don't want to benefit from Xml properties and you want to store the whole xml inside a Varchar(8000) column in the data warehouse it is more easier to read the xml as a text. (less validation required , faster)
Note that you can read the file using a script task or other component instead of Xml Source
So it is up to you, if you need to get some xml node , or to achieve something related to xml functions or properties then you have to use Xml Else use Varchar.

Related

How to Import/export SQL Server tables via XML files

I am receiving data files in XML format from a public agency. The structure of the files and the notations within make me think that what I am seeing is a bulk table dump from SQL Server. Each XML file starts with a schema definition. This is followed by multiple elements, each element looking like it contains one table row of data. The structure of the files makes me think that they were created by some facility within SQL Server itself.
I have looked at the SQL Server docs and online articles, but I cannot find information on how this would be accomplished. I am thinking that if there is a built in facility to export the data in this format, there must be a built in facility to import the data back into SQL Server.
A snippet of one XML file showing the opening schema section and a couple of data elements can be found here;
sample data XML file
Am I correct in thinking that these files are SQL Server dumps? If so, who are these exported and then imported?
You can use xml functionality of SQL Server. Something like this.
Import
declare #x xml = N'--your xml here'
; with xmlnamespaces('urn:schemas-microsoft-com:sql:SqlRowSet1' as p)--required as xml has xmlns attribute
--insert mytable
select t.v.value('p:pool_idn[1]','int') pool_idn, --note p: prefix
t.v.value('p:eff_dte[1]','datetime') eff_dte,
t.v.value('p:pool_nam[1]','varchar(50)') pool_nam
--and so on
from #x.nodes('root/p:pool') t(v)
Export
select * from mytable
for xml auto, elements, root, xmlschema('urn:my:schema')

How can I insert a 100+k rows of XML into SQL Server 2012 in a fast way

I have the following scenario/requirements for which I am not sure what is the best way to address to perform in the fastest way possible, looking for some guidance of features to use and examples of them, if available
I will receive anywhere between 10k to 100k of entities (in XML format) from a web service that I want to upsert (some rows might exist, others might not).
here are some of the requirements:
The source of the XML is a web service that I'm calling from C# code. Actually two different methods. For one of the methods, the return schema will be something flat that I can map directly to one of my tables. For the other, it will return an XML representation that I might need to work with in C# in order to be able to map it to flat entities for my tables. In that scenario, would it be best to do the modifications needed and then write to file to an XML to use as source?
The returned XML can contain up to 150k entities in XML, that may or may not exist in my tables yet, so I'm looking to upsert them. The files, when written to disk, can weight up to 20 megabytes. I asked if they could do JSON instead of XML, but apparently that's not a choice.
The SQL database is on a different server than my IIS server, so I rather avoid having the SQL server retrieve the XML from a file, I rather pass it from C# as a string or as a Table Value Parameter.
The tables are rather simple and don't have indexes other than the PK ones.
I've never been big on XML, although it got way easier with LINQ to XML, which I was initially using to parse each record and send individual inserts but the performance was just bad, so based on some research I've been doing, I'm thinking I could use:
Upserts from SQL server through MERGE statements.
Pass the whole XML as a parameter and use OPENXML to use as source in the MERGE statement.
Or, somehow generate a Table Value Parameter in C# and pass that to SQL to use on the MERGE.
I read on this similar question (which didnt have access to upsert/merge) that instead of trying to upsert directly from the XML, that it might be better to insert everything to a temporary table and do the merge/upsert against the temporary table?
Would this work and be considerably fast?
If anyone has had a similar scenario, can you share your thoughts/ideas about what combination of features would be best?
Thanks.
You are on the right track. I have a similar setup using XML to transfer data between an online portal and the client-server application. The rest of the setup is very similar to what you have.
The fact that your tables are not indexed is a bit of a concern, if you are comparing any fields that are not PK Fields, regardless of how you index the temp tables. It is important to have either one index with all of the fields used in the merge match clause, or an index for each of them - I find the former yields better performance. Beyond that, using an XML parameter, OpenXML and temp tables is the way to go.
The following code has not been tested, so may need a bit of debugging, but it will put you on the right track. A couple of notes: If all of the fields in the OpenXML WITH clause are attributes, then you can drop the last parameter (i.e. ", 2") and field source specifiers (i.e. "#id" for the detail table). Although the data in your description is flat, in which case you will only need one table, I do often need to import into linked records. I have included a simple master-detail relationship example in the code below, just for the sake of completeness.
CREATE PROCEDURE usp_ImportFromXML (#data XML) AS
BEGIN
/*
<root>
<data>
<match_field_1>1</match_field_1>
<match_field_2>val2</match_field_2>
<data_1>val3</data_1>
<data_2>val4</data_2>
<detail_records>
<detail_data id="detailID1">
<detail_1>blah1<detail_1>
<detail_2>blah2<detail_2>
</detail_data>
<detail_data id="detailID2">
<detail_1>blah3<detail_1>
<detail_2>blah4<detail_2>
</detail_data>
</detail_records>
</data>
<data>
...
</root>
*/
DECLARE #iDoc INT
EXEC sp_xml_preparedocument #iDoc OUTPUT, #data
SELECT * INTO #temp
FROM OpenXML(#iDoc, '/root/data', 2) WITH (
match_field_1 INT,
match_field_2 VARCHAR(50),
data_1 VARCHAR(50),
data_2 VARCHAR(50)
)
SELECT * INTO #detail
FROM OpenXML(#iDoc, '/root/data/detail_data', 2) WITH (
match_field_1 INT '../../match_field_1',
match_field_2 VARCHAR(50) '../../match_field_2',
detail_id VARCHAR(50) '#id',
detail_1 VARCHAR(50),
detail_2 VARCHAR(50)
)
EXEC sp_xml_removedocument #iDoc
CREATE INDEX #IX_temp ON #temp(match_field_1, match_field_2)
CREATE INDEX #IX_detail ON #detail(match_field_1, match_field_2, detail_id)
MERGE data_table a
USING #temp ta
ON ta.match_field_1 = a.match_field_1 AND ta.match_field_2 = a.match_field_2
WHEN MATCHED THEN
UPDATE SET data_1 = ta.data_1, data_2 = ta.data_2
WHEN NOT MATCHED THEN
INSERT (match_field_1, match_field_2, data_1, data_2) VALUES (ta.match_field_1, ta.match_field_2, ta.data_1, ta.data_2)
MERGE detail_table a
USING (SELECT d.*, p._key FROM #detail d, data_table p WHERE d.match_field_1 = p.match_field_1 AND d.match_field_2 = p.match_field_2) ta
ON a.id = ta.id AND a.parent_key = ta._key
WHEN MATCHED THEN
UPDATE SET detail_1 = ta.detail_1, detail2 = ta.detail_2
WHEN NOT MATCHED THEN
INSERT (parent_key, id, detail_1, detail_2) VALUES (ta._key, ta.id, ta.detail_1, ta.detail_2)
DROP TABLE #temp
DROP TABLE #detail
END
Use (3). Process the data ready for upset in C#. C# is made for this kind of algorithmic work. It is both the right programming language as well as the faster programming language. T-SQL is not the right tool. You do not want to use XML with T-SQL for very high performance stuff because it burns CPU like crazy. Instead use the fast TDS protocol to send TVP or bulk data.
Then, send the data to the server using either a TVP or a bulk-insert (SqlBulkCopy) to a temp table. The latter technique is great for very many rows (>10k?). Bulk insert uses special TDS features. It does not use SQL batches to transfer the data. It does not get faster than this.
Then use the MERGE statement as you described. Use big batch sizes, potentially all rows in one batch.
The best way I've found is to bulk insert into a temp table from your C# code, then issue the merge once the data is in SQL Server. I have an example here on my blog SQL Server Bulk Upsert
I use this in production to insert millions of rows daily, and have yet to find a faster way to do it. Give it a try, I think you will be impressed with the performance of the solution.

bulk import of xml data in to sql server

I have a set of xml files that I want to parse the data of and import in to a sql server 2012 database. The provided xml files will be validated against a schema.
I am looking as to what is the best method of doing this is. I have found this: http://msdn.microsoft.com/en-us/library/ms171878.aspx
I am wondering if this is the best way or if there are others?
You have several options:
SSIS XML Source. This does not validate against the schema. If you want to detect and properly handle invalid XML files, create a script task to validate the schema in C#.
Parse the XML in a stored procedure.
Insert the entire XML file in one column. Depending on your schema validation requirements, you can use an untyped or typed XML column. (Or both)
Parse the XML using XPath functions. This is actually very fast.
INSERT INTO SomeTable (Column1, Column2, Column3)
SELECT
YourXmlColumn.value('(/root/col1)[1]','int'),
YourXmlColumn.value('(/root/col2)[1]','nvarchar(10)'),
YourXmlColumn.value('(/root/col3)[1]','nvarchar(2000)'),
YourXmlColumn.value('(/root/col4)[1]','datetime2(0)')
FROM YourXmlTable

Slow performance for package with XML destination column

I have done several SSIS packages over the past few months to move data from a legacy database to a SQL Server database. It normally takes 10-20 minutes to process around 5 millions of records depending on the transformation.
The issue I am experiencing with one of my package is a very poor performance because one of the columns in my destination is of the SQL Server XML data type.
Data comes in like this: 5
A script creates a Unicode string like this: <XmlData><Value>5</Value></XmlData>
Destination is simply a column with XML data type
This is really slow. Any advice?
I did a SQL Trace and notice that in behind the scene SSIS is executing on each row a convert before the insert:
declare #p as xml
set #p=convert(xml,N'<XmlData><Value>5</Value></XmlData>')
Try using a temporary table to store the resulting 5 million records without the XML transformation and then use SQL Server itself to move them from tempDB to the final destination:
INSERT INTO final_destination (...)
SELECT cast(N'<XmlData><Value>5</Value></XmlData>' AS XML) AS batch_converted_xml, col1, col2, colX
FROM #tempTable
If 5.000.000 turns to be too much data for a single batch, you can do it in smaller batches (100k lines should work like a charm).
The record captured by the profiler looks like an OleDB transformation with one command per line.

Concatenating rows from different tables into one field

In a project using a MSSQL 2005 Database we are required to log all data manipulating actions in a logging table. One field in that table is supposed to contain the row before it was changed. We have a lot of tables so I was trying to write a stored procedure that would gather up all the fields in one row of a table that was given to it, concatenate them somehow and then write a new log entry with that information.
I already tried using FOR XML PATH and it worked, but the client doesn't like the XML notation, they want a csv field.
Here's what I had with FOR XML PATH:
DECLARE #foo varchar(max);
SET #foo = (SELECT * FROM table WHERE id = 5775 FOR XML PATH(''));
The values for "table", "id" and the actual id (here: 5775) would later be passed in via the call to the stored procedure.
Is there any way to do this without getting XML notation and without knowing in advance which fields are going to be returned by the SELECT statement?
We used XML path and as you've discovered, it works very well. Since one of SQL's features is to store XML properly, the CSV makes no sense.
What you could try is a stored proc that reads out the XML in CSV format (fake it). I would. Since you won't likely be reading the data that much compared to saving it, the overhead is negligible.
How about:
Set #Foo = Stuff(
( Select ',' + MyCol1 + ',' + MyCol2 ...
From Table
Where Id = 5775
For Xml Path('')
), 1, 1, '')
This will produce a CSV line (presuming the inner SQL returns a single row). Now, this solves the second part of your question. As to the first part of "without knowing in advance which fields", there is no means to do this without using dynamic SQL. I.e., you have to build the SQL statement as a string on the fly. If you are going to do that you might as well build the entire CSV result on the fly.

Resources