query/shred large xml document into sql server - sql-server

I'm looking to query an xml file that's in excess of 80GB (and insert the results into a pre-existing db). This prevents me from simply declaring it as an xml variable and using openrowset. I am NOT looking to use a CLR and would prefer an entirely TSQL approach if possible (looking to do this on SQL Server 2012/Windows Server 2008)
With the 2Gb limit on the XML datatype, I realize the obvious approach is to split the file into say 1GB pieces. However, it would simply be too messy to be worthwhile (Elements in the document are of varying sizes and not all elements have the same sub-elements. Only looking to keep some common elements though).
Anyone have a suggestion?

If it is a one off thing I would use the SQL Import/Export tool. If its something that will need to be scheduled create an Integration Services package.

With a simple schema (sorry, should have mentioned it) bulk cmds would probably have been the way to go. With something more messy, and given the size of the file, a SAX parser seems to provide a better solution. Reads the file without putting it into memory, I can pick out only the elements I want and not have to worry about the schema at all.

Related

Extract data from thousands of Excel files into database

We use SharePoint 2013 as a library to hold thousands of Excel files, with almost never consistent formatting, to manage projects occurring on servers. Somewhere in these maybe formatted as table objects is a common set of server names.
Somehow, without being able to change this process in the short term, I need to pull data from all these files to identify how many projects are targeting a particular server.
I've got access to SQL Server 2016 enterprise, and wondering if something like PolyBase could help with this? I also wonder about SSIS but I don't expect any tables to look exactly like another one.
Other tools may be an option, but I'm not sure what can handle this scale and variety. I think daily updates to the data would be enough, but even so it's still a mess.
How do I pull thousands of varied excel tables into a database? Is this even possible?
Any longer term solution that doesn't allow them to format and annotate like excel is unlikely to actually be adopted.
The less you know in advance, the more difficult it will be...
Some ideas:
Technology
read about FROM OPENROWSET which allows to read from an Excel
read about linked server
Use Excel and its great abilities through VBA to iterate through all your Excel-Sheets, open them, analyse them and fill proper tables. Within Excel you know most about your messy data...
Target structure
You might create thousands of tables, each representing one single sheet in all your Excel files. You could query these tables with dynamically created SQL (using meta-data of INFORMATION_SCHEMA) or think about Full-Text-Search
You might import each sheet into one single XML-structure (SELECT * ... FOR XML PATH('...')). In this case you'd need a target table with columns for Path and name of your Excel, Name of the sheet and an XML column for your data. Another approach was to represent each File on one XML and include all sheets there. Try to define common naming for all your data. Querying XML allows to query columns without knowing their actual names (XQuery with XPath using *).
If your Excels are xlsx already, you might open them with UNZIP and take the existing XML as-is.
To be honest: I do not think that any tool can do the magic to import such a wide range of mess automatically...

Retrieving text from FileTable SQL Server

Is it possible to retrieve the actual text from a File Table in SQL Server 2014?
I want to implement some hit-highlighting functionality, but in order to do so, I need to retrieve the actual text in the file I indexed, since the content is in a varbinary column.
If it's not possible, I suppose the only alternative to do this is forgetting about FileTables and implementing an application-side "document reader", so that I'll have real text inside my "file_stream" column instead of the varbinary. Or maybe even defining an UDF that uses iFilters behind some C# code, right?
Please, any advice would be really useful.
Before you do start with your own implementation, take a look at the very similar question:
SQL Server 2012 FTS with Hit-Highlighting?
Also this blog entry from 2012 is still current:
Hit-Highlighting in Full-Text Search
I would take a look at the mentioned HitHighlight function (which is actually a commercial product, ThinkHighlight). Most likely it's not worth the effort to build your own solution. But if you do so - tell me ;)

Sql Server XML columns substitute for Document DB?

Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ?
If I were to create a table with a guid PK Id and an XML column for the document.
What would be the main problems compared to using a document DB?
Sql Server supports indexing over XML columns so querying should not be completely horrible?
You've got several questions in here:
Is it possible to use Sql Server XML columns as a substitute for a real Document DB (such as Couch or Mongo) ? Yes, you can use it as a substitute, but no, you probably wouldn't be satisfied with performance if you're exclusively storing XML and not leveraging any of SQL Server's relational tools.
If I were to create a table with a guid PK Id and an XML column for the document. What would be the main problems compared to using a document DB? In a nutshell, scaling out. SQL Server doesn't scale this kind of thing out well. You can do it with replication, but it's painful to manage relative to a "real" Document DB.
Sql Server supports indexing over XML columns so querying should not be completely horrible? The problem is that SQL Server's XML indexes can take several times the storage space of the original data. These indexes can't be maintained online (as in defrags), so you end up with locking issues during maintenance windows.
I'm doing some experimenting with this on:
http://rogeralsing.com/2011/03/02/linq-to-sqlxml-projections/
Query speed is 'decent' , it's nothing I'd use for scaling.
But the joy of schema free storage running on standard infrastructure is quite nice.
Yes, you can. Storing a document inside a SqlServer XML column will work and if you use standard XML serialization that will leave you with a decent ACID complant key/value store. Also, it will allow you to do queries on it with relative ease and you can join the results to data that you store in a more relational way. We do so, it works. If you store content in XML fields, storage demands are a lot lower than using NTEXT and querying it will be more flexible and faster.
What SqlServer will not get you (comparing to mongo) is the seamless failover of replica-sets an the autosharding of mongo. Also, atomic operations like incrementing a specific property deep inside a document is hard (though not impossible with the XQuery update function). Updates tend to be faster on most NoSql databases, because they are more relaxed on the "data is only safe on disk" principle.
Yes, it is possible. As to whether it's a good idea, this is just my 2 cents...
Before the XML datatype came along I worked on a system storing XML in an NTEXT column - that wasn't pleasant, and to get any real use out of the data meant shredding some of that data out into relational form.
OK, the XML datatype now makes it easier to query an XML blob and to extract certain values/index them. But personally, in general, I wouldn't. I'm not saying never use XML as there are scenarios for that - rather if that's all your planning on doing then I'd be thinking "is this the right tool for the job". Using a RDBMS as a document database makes me feel a bit uneasy. Whereas something like MongoDB has been built from the ground up as a document database.
In all honesty, I haven't done any performance testing on storing data as XML so I can't give you an indication of what performance would be like. Would be interested to know how this performs at scale.

Fast batch insert/update with SQL Server 2008 and BCP

I'm not a good SQL programmer, I've got only the basics, but I've heard of some BCP thing for fast data loading. I've searched the internet and it seems to be a command-line only utility, and not something you can use in code.
The thing is, I want to be able to make very fast inserts and updates in a SQL Server 2008 database. I would like to have a function in the database that would accept:
The name of the table I want to execute an insert/update operation against
The names of the columns I'll be feeding data to
The data in a CSV format or something that SQL can read stupid-fast
A flag indicating weather the function should perform an insert or update operation
This function would then read this CSV string and genarate the necessary code for inserting/updating the table.
I would then write code in C# to call that function passing it the table name, column names, a list of objects serialized as a CSV string and the insert/update flag.
As you can see, this is intended to be both fast and generic, suitable for any project dealing with large amounts of data, and thus a candidate to my company's framework.
Am I thinking right? Is this a good idea? Can I use that BCP thing, and is it suitable to every case?
As you can see, I need some directions on this... thanks in advance for any help!
In C#, look at SQLBulkCopy. It's what SSIS uses in the background.
For true bcp/BULK INSERT, you'd need bulkadmin rights which may not be allowed
Have you considered using SQL Server Integrated Services (SSIS). It's designed to do exactly what you describe. It is very fast. You can insert data on a transactional basis. And you can set it up to run on a schedule. And much more.

SQL Server XML (multi-namespace) Parsing question

I'm trying to use SQL Server Integration Services (SSIS) to parse an XML file and map the elements to database columns across several tables in a single database.
However, when I use the Data Flow Task->XML Source to try and parse an example XML file (that file is located here, XSD is located here), it says
"http://www.exchangenetwork.net/schema/TRI/4:TransferWasteQuantity" has multiple members named "http://www.exchangenetwork.net/schema/TRI/F:WasteQuantityCatastrophicMeasure"
Is there any way to get SSIS to parse XML data such as this? This schema changes regularly so I'd prefer to do as little parsing code outside of the data mappings as possible. Also if there's a better way to do this outside of SSIS (say, by using SQL Server Analysis Services) then that would work too.
Apparently, due to the complexity of the XML, this is not possible with the latest iterations of SQL, at least without significant modification of the incoming XML submission and XSD, which would likely lead to data corruption, therefore it is conclusive that it is not feasible to implement.

Resources