I'm trying to use SQL Server Integration Services (SSIS) to parse an XML file and map the elements to database columns across several tables in a single database.
However, when I use the Data Flow Task->XML Source to try and parse an example XML file (that file is located here, XSD is located here), it says
"http://www.exchangenetwork.net/schema/TRI/4:TransferWasteQuantity" has multiple members named "http://www.exchangenetwork.net/schema/TRI/F:WasteQuantityCatastrophicMeasure"
Is there any way to get SSIS to parse XML data such as this? This schema changes regularly so I'd prefer to do as little parsing code outside of the data mappings as possible. Also if there's a better way to do this outside of SSIS (say, by using SQL Server Analysis Services) then that would work too.
Apparently, due to the complexity of the XML, this is not possible with the latest iterations of SQL, at least without significant modification of the incoming XML submission and XSD, which would likely lead to data corruption, therefore it is conclusive that it is not feasible to implement.
Related
We use SharePoint 2013 as a library to hold thousands of Excel files, with almost never consistent formatting, to manage projects occurring on servers. Somewhere in these maybe formatted as table objects is a common set of server names.
Somehow, without being able to change this process in the short term, I need to pull data from all these files to identify how many projects are targeting a particular server.
I've got access to SQL Server 2016 enterprise, and wondering if something like PolyBase could help with this? I also wonder about SSIS but I don't expect any tables to look exactly like another one.
Other tools may be an option, but I'm not sure what can handle this scale and variety. I think daily updates to the data would be enough, but even so it's still a mess.
How do I pull thousands of varied excel tables into a database? Is this even possible?
Any longer term solution that doesn't allow them to format and annotate like excel is unlikely to actually be adopted.
The less you know in advance, the more difficult it will be...
Some ideas:
Technology
read about FROM OPENROWSET which allows to read from an Excel
read about linked server
Use Excel and its great abilities through VBA to iterate through all your Excel-Sheets, open them, analyse them and fill proper tables. Within Excel you know most about your messy data...
Target structure
You might create thousands of tables, each representing one single sheet in all your Excel files. You could query these tables with dynamically created SQL (using meta-data of INFORMATION_SCHEMA) or think about Full-Text-Search
You might import each sheet into one single XML-structure (SELECT * ... FOR XML PATH('...')). In this case you'd need a target table with columns for Path and name of your Excel, Name of the sheet and an XML column for your data. Another approach was to represent each File on one XML and include all sheets there. Try to define common naming for all your data. Querying XML allows to query columns without knowing their actual names (XQuery with XPath using *).
If your Excels are xlsx already, you might open them with UNZIP and take the existing XML as-is.
To be honest: I do not think that any tool can do the magic to import such a wide range of mess automatically...
I have a very large XML file (with an xsd file) which has many different objects that need to be imported into SQL Server tables. When I mean objects, I mean the highest level XML wrapping tag e.g. Products, Orders, Locations etc
My current method using SSIS has been to:
Create a Data Flow Task
Set up a new XML source in the data flow task
Drag a new OLE DB Destination into the task
Link the XML output to the input of the OLE DB Destination
Open the OLE DB Destination task and ask it to create a new table for the data to go into
I have to repeat steps 3-5 for all the objects in the XML file which could run into the hundreds. Theres no way that I can manually do this.
Is there anyway to get SSIS to just create new tables for all the different objects in SQL server and import the data into those? So it would automatically create dbo.Products, dbo.Locations, dbo.Customers and put the correct XML data into those tables.
I can't see any other feasible way of doing this.
Is there anyway to get SSIS to just create new tables for all the
different objects in SQL server and import the data into those?
No :(
There's really two problems, here. You have not gotten to the second one, yet, which is ssis is probably going to choke reading a very large xml file. The XML source component loads the entire file into memory when it reads it.
There are a couple of alternatives that I can think of:
- use XSLT transforms
- roll your own sax parser and use a script component source
For the XSLT method, you would transform each object into a flatfile i.e. parse just your customer data into a csv format and then add data flows to read in each flat file. The drawbacks are that ssis uses an earlier version of XSLT which also loads the whole file into memory, rather than streaming it. However, I have seen this perform very well on files that were 0.5 GB in size. Also, XLST can be challenging to learn if you are not already familiar, but it is very powerful in getting data into a relational form.
The sax parser method would allow you to stream the file and pull out the parts that you want into a relational form. In a script component, you could direct different objects to different outputs.
I'm looking to query an xml file that's in excess of 80GB (and insert the results into a pre-existing db). This prevents me from simply declaring it as an xml variable and using openrowset. I am NOT looking to use a CLR and would prefer an entirely TSQL approach if possible (looking to do this on SQL Server 2012/Windows Server 2008)
With the 2Gb limit on the XML datatype, I realize the obvious approach is to split the file into say 1GB pieces. However, it would simply be too messy to be worthwhile (Elements in the document are of varying sizes and not all elements have the same sub-elements. Only looking to keep some common elements though).
Anyone have a suggestion?
If it is a one off thing I would use the SQL Import/Export tool. If its something that will need to be scheduled create an Integration Services package.
With a simple schema (sorry, should have mentioned it) bulk cmds would probably have been the way to go. With something more messy, and given the size of the file, a SAX parser seems to provide a better solution. Reads the file without putting it into memory, I can pick out only the elements I want and not have to worry about the schema at all.
I was trying to understand the basic advantage of using XML DataType in SQL Server 2005. I underwent the article here, saying that in case you want to delete multiple records. Serialize the XMl, send it in Database and using below query you can delete it..
I was curious to look into any other advantage of using this DataType...
EDIT
Reasons for Storing XML Data in SQL Server 2005
Here are some reasons for using native XML features in SQL Server 2005 as opposed to managing your XML data in the file system:
You want to use administrative functionality of the database server for managing your XML data (for example, backup, recovery and replication).
My Understanding - Can you share some knowledge over it to make it clear?
You want to share, query, and modify your XML data in an efficient and transacted way. Fine-grained data access is important to your application. For example, you may want to insert a new section without replacing your whole document.
My Understanding - XML is in specific column row, In order to add new section in this row's cell, Update is required, so whole document will be updated. Right?
You want the server to guarantee well-formed data, and optionally validate your data according to XML schemas.
My Understanding - Can you share some knowledge over it to make it clear?
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer.
My Understanding - Same can be done by adding individual columns. Then why XML column?
Pros:
Allows storage of xml data that can be automatically controlled by an xml schema - thereby guaranteeing a certain level of data quality
Many web/desktop apps store data in xml form, these can then be easily stored and queried in the database - so it is a great place to store xml data that an app may need to use (e.g. for configuration settings)
Cons:
Be careful about using xml fields, they may start off as innocent storage but can become a performance nightmare if you want to search, analyse and report on many records.
Also, if xml fields will be added to, changed or deleted this can be slow and leads to complex t-sql.
In replication, the whole xml gets updated even if only one node changes - therefore you could have many more conflicts that cannot easily be resolved.
I would say of the 4 advantages you've listed, these two are critical:
You want to share, query, and modify your XML data in an efficient and transacted way
SQL Server stores the XML in an optimised way that it wouldn't for plain strings, and lets you query the XML in an efficient way, rather than requiring you to bring the entire XML document back to the client. Think how inefficient it is if you want to query 10,000 XML columns, each containing 1k of data. For a small XPath query you would need to return 10k of data across the wire, for each client, each time.
You want indexing of XML data for efficient query processing and good scalability, and the use a first-rate query optimizer
This ties into what I said above, it's far more efficiently stored than a plain text column which would also run into page fragmentation issues.
We wish to import data into a SQL Server database from a source location located elsewhere in the company WAN in another country.
We are to be using SSIS to perform the import but wonder where would be the best place to perform the extract and transform. We could create a view on the source SQL server and SSIS will directly retrieve data from that. The alternative would be to drop a file out of the source and have SSIS import the data from that file.
I am thinking the former is a cleaner solution but would be interested to know whether there are any benefits in using files or potential issues with grabbing the data direct?
Thanks
I would avoid using files if possible, especially if your starting point is a database. By extracting to a file, you would be adding an unnecessary layer in the process that would increase the possiblity of errors. Typical issues of using extracted files include unwittingly using an old / incomplete file (if extract failed) and masking user manually edited changes (direct in file for data issues).
If you have a SQL Server database, then creating a stored procedure, view or entering sql into SSIS would give you defined interface between source and SSIS. Including the transform with the extract does blur the interface a little, but is quite common for simple transformation that do not depend on any target (or secondary source) data for the load.
An issue you may need to consider when grabbing data (with either approach) is the transactional state of data. Depending on your source, you may need to handle data in various states of completeness and act appropriately.