Extract data from thousands of Excel files into database - sql-server

We use SharePoint 2013 as a library to hold thousands of Excel files, with almost never consistent formatting, to manage projects occurring on servers. Somewhere in these maybe formatted as table objects is a common set of server names.
Somehow, without being able to change this process in the short term, I need to pull data from all these files to identify how many projects are targeting a particular server.
I've got access to SQL Server 2016 enterprise, and wondering if something like PolyBase could help with this? I also wonder about SSIS but I don't expect any tables to look exactly like another one.
Other tools may be an option, but I'm not sure what can handle this scale and variety. I think daily updates to the data would be enough, but even so it's still a mess.
How do I pull thousands of varied excel tables into a database? Is this even possible?
Any longer term solution that doesn't allow them to format and annotate like excel is unlikely to actually be adopted.

The less you know in advance, the more difficult it will be...
Some ideas:
Technology
read about FROM OPENROWSET which allows to read from an Excel
read about linked server
Use Excel and its great abilities through VBA to iterate through all your Excel-Sheets, open them, analyse them and fill proper tables. Within Excel you know most about your messy data...
Target structure
You might create thousands of tables, each representing one single sheet in all your Excel files. You could query these tables with dynamically created SQL (using meta-data of INFORMATION_SCHEMA) or think about Full-Text-Search
You might import each sheet into one single XML-structure (SELECT * ... FOR XML PATH('...')). In this case you'd need a target table with columns for Path and name of your Excel, Name of the sheet and an XML column for your data. Another approach was to represent each File on one XML and include all sheets there. Try to define common naming for all your data. Querying XML allows to query columns without knowing their actual names (XQuery with XPath using *).
If your Excels are xlsx already, you might open them with UNZIP and take the existing XML as-is.
To be honest: I do not think that any tool can do the magic to import such a wide range of mess automatically...

Related

How can I import multiple csv files from a folder into sql, into their own separate table

I would like some advice on the best way to go about doing this. I have multiple files all with different layouts and I would like to create a procedure to import them into new tables in sql.
I have written a procedure which uses xp_cmdshell to get the list of file names in a folder and the use a cursor to loop through those file names and use a bulk insert to get them into sql but I dont know the best way to create a new table with a new layout each time.
I thought if I could import just the column row into a temp table then I could use that to create a new table to do my bulk insert into. but I couldn't get that to work.
So whats the best way to do this using SQL? I am not that familiar with .net either. I have thought about doing this in SSIS, I know its easy enough to load multiple files which have the same layout in SSIS but can it be doe with variable layouts?
thanks
You could use BimlScript to make the whole process automated where you just point it at the path of interest and it writes all the SSIS and T-SQL DDL for you, but for the effort involved in writing the C# you'd need, you may as well just put the data dump into SQL Server in the C#, too.
You can use SSIS to solve this issue, though, and there are a few levels of effort to pick from.
The easiest is to use the SQL Server Import and Export Wizard to create SSIS packages from your Excel spreadsheets that will dump the sheet into its own table. You'd have to run this wizard every time you had a new spreadsheet you wanted to import, but you could save the package(s) so that you could re-import that spreadsheet again.
The next level would be to edit a saved SSIS package (or write one from scratch) to parameterize the file path and the destination table names, and you could then re-use that package for any spreadsheets that followed the same format.
Further along would be to write a package that determined with of the packages from the previouw level to call. If you can query the header rows effectively, you could probably write an SSIS package that accepted a path as an input parameter, found all the Excel sheets in that path, queried the header rows to determine the spreadsheet format, and then pass that information to the parameterized package for that format type.
SSIS development is, of course, its own topic - Integration Services Features and Tasks on MSDN is a good place to start. SSIS has its quirks, and I highly recommend learning BimlScript if you want to do a lot of SSIS development. If you'd like to talk over what the ideas above would require in more detail, please feel free to message me.

Automated file import with SSIS package

I am very new to SSIS and its capabilities. I am busy building a new project that will upload files to a database. The problem I am facing is that files and tables differentiate from one another.
So what was done is I created a table that will map each file's columns to the specific table's column the data needs to be stored in, in a separate table. I want the user to manage this part when they receive a new file or the file layout changes some how.
As far as I know about SSIS is that you can map each file to a table and it can be scheduled as task.
My question is will SSIS be able to handle this or should I handle this process in code?
Many thanks in advance
I would say it all depends on the amount of data that would be imported into your SQL server, for large data sets (Normally 10000+ Rows) it becomes a necessity to utilize the SSIS as you would receive performance gains in your application. Here is a simple example of creating a SSIS package using code. For smaller data operations I would suggest using a combination of this and this. Or to Create a dynamic table on your SQL server based on the file format, look at this
SSIS can be very picky about file formats, so if the files are completely different, then it probably isnt the tool for the job. For flat files, SSIS requires the ordering of columns to be the same.
If you know that your files will only ever arrive in one of 5 formats (for example), it wouldn't be much trouble to write 5 packages to import them. If any new file could have a totally different schema, I dont think SSIS would be the right tool for the job.

Best way to import large excel file into SQL Server

We are trying to devise an optimal method for importing very large Excel files into SQL database. Using SSIS is somewhat troublesome because it scans top X records to determine the format of the file, but rows further down may be different, so it takes a lot of trial and error, with us having to bring the unusual columns to the top so SSIS can "learn".
When we get new file formats to import, they conform to specification in terms of row formatting etc - so we can say we know the schema in advance. The SQL destination tables have the same schema, with couple of extra columns such as date inserted and original filename.
Is there an easier way to create format definitions for new files we are going to insert? We don't have to use SSIS, we are open to any other tool, with a view for as much automation as possible. There's a question of testing the sanity of data we will import, we were planning on doing basic queries against staging datasets such as "less than 1% of records can miss postal code" etc.
Many thanks
Maybe you can import data as text and after that you can convert that using Derived Column transformation. You can read data from Excel as Text using IMEX option in Connection String. More information about this parameter you can find here.

SQL Server XML (multi-namespace) Parsing question

I'm trying to use SQL Server Integration Services (SSIS) to parse an XML file and map the elements to database columns across several tables in a single database.
However, when I use the Data Flow Task->XML Source to try and parse an example XML file (that file is located here, XSD is located here), it says
"http://www.exchangenetwork.net/schema/TRI/4:TransferWasteQuantity" has multiple members named "http://www.exchangenetwork.net/schema/TRI/F:WasteQuantityCatastrophicMeasure"
Is there any way to get SSIS to parse XML data such as this? This schema changes regularly so I'd prefer to do as little parsing code outside of the data mappings as possible. Also if there's a better way to do this outside of SSIS (say, by using SQL Server Analysis Services) then that would work too.
Apparently, due to the complexity of the XML, this is not possible with the latest iterations of SQL, at least without significant modification of the incoming XML submission and XSD, which would likely lead to data corruption, therefore it is conclusive that it is not feasible to implement.

Run a query from two data sets programmatically

I am trying to reconcile data from a website and a database programmatically. Right now my process is manual. I download data from the website, download data from my database, and reconcile using an Excel vlookup. Within Excel, I am only reconciling 1 date for many items.
I'd like to programmatically reconcile the data for multiple dates and multiple items. The problem is that I have to download the data from the website manually. I have heard of people doing "outer joins" and "table joins" but I do not know where to begin. Is this something that I code in VBA or notepad?
Generally I do this by bulk inserting the website data into a staging table and then write select statments to join that table to my data in the database. You may need to do clean up first to be able to match the records if they are stored differently.
Python is a scripting language. http://www.python.org
There are tools to allow you to read Excel spreadsheets. For example:
http://michaelangela.wordpress.com/2008/07/06/python-excel-file-reader/
You can also use Python to talk to your database server.
http://pymssql.sourceforge.net/
http://www.oracle.com/technology/pub/articles/devlin-python-oracle.html
http://sourceforge.net/projects/pydb2/
Probably the easiest way to automate this is to save the excel files you get to disk, and use Python to read them, comparing that data with what is in your database.
This will not be a trivial project, but it is very flexible and straight forward. Trying to do it all in SQL will be, IMHO, a recipe for frustration, especially if you are new to SQL.
Alternatively:
You could also do this by using VBA to read in your excel files and generate SQL INSERT statements that are compatible with your DB schema. Then use SQL to compare them.

Resources