In this scenario I am dealing with very huge excel files (~150 MB) which have multiple but unknown number of sheets (We know them to be less than 20). The number of these sheets depend on the number of records in the excel file; once a sheet is filled to its max records (1,048,576 rows), another is created.
I would like to use SSIS to import all the data in these files into SQL Server tables.
Before, instead of using SSIS, we had a stored procedure to read the excel file using OpenRowset, import the data on 1st sheet and if it had passed the hard limit, the new sheet name is tested. The approach is not very optimized, as it calls upon Microsoft ACE OLEDB (Or ancient Microsoft Jet Provider) which are very, very slow compared to bulk insert.
Importing excel data using SSIS seems much faster in case of simple excel files. Trying to extract sheet names seems to need to treat the file as Rowset collections and will slow down the process. My suggested approach is to use an outer "Foreach Loop" to handle my excel files and an inner "For Loop" to handle worksheets within each file. Once the next worksheet is called and does not exist, the for loop should break but the outer loop should crunch the next file. (something similar to try/catch block)
My question is: is it possible? if yes, how?
(We have an alternative solution to use CLR to convert the excel file into csv files and then import the data, but we would like to keep the job within SSIS package if possible.
Related
I've been tasked with importing data from an excel spreadsheet in to a table in SQL 2012. The spreadsheet will have data added to it monthly.
My plan, is to use SSIS to create a workflow to do this, I then will use SQL Job agent to execute the workflow at the beginning of every month to add in the new data.
One problem i can think of with this plan is the spreadsheet is going to become huge and eventually exceed the excel maximum rows. Instead of adding to the one spreadsheet I could have a new spreadsheet for each month? Though I'm not sure how I can use the workflow to pick the newest spreadsheet to add to the table
I'm a complete novice to SSIS, there might even be a more practical way of doing this whole process, so please feel free to offer suggestions.
Why inserting data into one Excel worksheet??
Inserting data into one Excel worksheet or even one workbook (Excel file) is not a good practice at all, you have to think in another way, you can create a new Excel file each time new data comes and save historical data in another repository or directory (if you need to). Or as #TabAlleman suggested if you can use flat files, it is more recommended since reading data from Excel is more difficult. But also make sure that you will not store all data in one flat file.
I have 40 excel sheets in a single folder. I want to load them all in different tables sql server database through SSIS package. The difficulty I am having is because of different number and name of columns in each excel sheet.
Can this task be achievable through a single package?
Another option, if you want to do it in one data flow, you can write custom C# source component with multiple outputs. In the script task you'll figure out the file type and send the data to the proper output.
NPOI library(https://npoi.codeplex.com/) is a good way to read excel files in C#.
But if you have fixed file formats I would prefer to create N Data Flows inside Foreach loop container. Use regular Excel source components and just ignore errors in each data flow. This will let you get a file and try to load it in each data flow one by one. On error you will not fail the package but just go to the next data flow until you find the proper file format.
It can only be done, by adding multiple sources or using a script component, white a flag on what sheet it is. Then you can use a conditional split and enter multiple destinations.
We use SharePoint 2013 as a library to hold thousands of Excel files, with almost never consistent formatting, to manage projects occurring on servers. Somewhere in these maybe formatted as table objects is a common set of server names.
Somehow, without being able to change this process in the short term, I need to pull data from all these files to identify how many projects are targeting a particular server.
I've got access to SQL Server 2016 enterprise, and wondering if something like PolyBase could help with this? I also wonder about SSIS but I don't expect any tables to look exactly like another one.
Other tools may be an option, but I'm not sure what can handle this scale and variety. I think daily updates to the data would be enough, but even so it's still a mess.
How do I pull thousands of varied excel tables into a database? Is this even possible?
Any longer term solution that doesn't allow them to format and annotate like excel is unlikely to actually be adopted.
The less you know in advance, the more difficult it will be...
Some ideas:
Technology
read about FROM OPENROWSET which allows to read from an Excel
read about linked server
Use Excel and its great abilities through VBA to iterate through all your Excel-Sheets, open them, analyse them and fill proper tables. Within Excel you know most about your messy data...
Target structure
You might create thousands of tables, each representing one single sheet in all your Excel files. You could query these tables with dynamically created SQL (using meta-data of INFORMATION_SCHEMA) or think about Full-Text-Search
You might import each sheet into one single XML-structure (SELECT * ... FOR XML PATH('...')). In this case you'd need a target table with columns for Path and name of your Excel, Name of the sheet and an XML column for your data. Another approach was to represent each File on one XML and include all sheets there. Try to define common naming for all your data. Querying XML allows to query columns without knowing their actual names (XQuery with XPath using *).
If your Excels are xlsx already, you might open them with UNZIP and take the existing XML as-is.
To be honest: I do not think that any tool can do the magic to import such a wide range of mess automatically...
I would like some advice on the best way to go about doing this. I have multiple files all with different layouts and I would like to create a procedure to import them into new tables in sql.
I have written a procedure which uses xp_cmdshell to get the list of file names in a folder and the use a cursor to loop through those file names and use a bulk insert to get them into sql but I dont know the best way to create a new table with a new layout each time.
I thought if I could import just the column row into a temp table then I could use that to create a new table to do my bulk insert into. but I couldn't get that to work.
So whats the best way to do this using SQL? I am not that familiar with .net either. I have thought about doing this in SSIS, I know its easy enough to load multiple files which have the same layout in SSIS but can it be doe with variable layouts?
thanks
You could use BimlScript to make the whole process automated where you just point it at the path of interest and it writes all the SSIS and T-SQL DDL for you, but for the effort involved in writing the C# you'd need, you may as well just put the data dump into SQL Server in the C#, too.
You can use SSIS to solve this issue, though, and there are a few levels of effort to pick from.
The easiest is to use the SQL Server Import and Export Wizard to create SSIS packages from your Excel spreadsheets that will dump the sheet into its own table. You'd have to run this wizard every time you had a new spreadsheet you wanted to import, but you could save the package(s) so that you could re-import that spreadsheet again.
The next level would be to edit a saved SSIS package (or write one from scratch) to parameterize the file path and the destination table names, and you could then re-use that package for any spreadsheets that followed the same format.
Further along would be to write a package that determined with of the packages from the previouw level to call. If you can query the header rows effectively, you could probably write an SSIS package that accepted a path as an input parameter, found all the Excel sheets in that path, queried the header rows to determine the spreadsheet format, and then pass that information to the parameterized package for that format type.
SSIS development is, of course, its own topic - Integration Services Features and Tasks on MSDN is a good place to start. SSIS has its quirks, and I highly recommend learning BimlScript if you want to do a lot of SSIS development. If you'd like to talk over what the ideas above would require in more detail, please feel free to message me.
We are trying to devise an optimal method for importing very large Excel files into SQL database. Using SSIS is somewhat troublesome because it scans top X records to determine the format of the file, but rows further down may be different, so it takes a lot of trial and error, with us having to bring the unusual columns to the top so SSIS can "learn".
When we get new file formats to import, they conform to specification in terms of row formatting etc - so we can say we know the schema in advance. The SQL destination tables have the same schema, with couple of extra columns such as date inserted and original filename.
Is there an easier way to create format definitions for new files we are going to insert? We don't have to use SSIS, we are open to any other tool, with a view for as much automation as possible. There's a question of testing the sanity of data we will import, we were planning on doing basic queries against staging datasets such as "less than 1% of records can miss postal code" etc.
Many thanks
Maybe you can import data as text and after that you can convert that using Derived Column transformation. You can read data from Excel as Text using IMEX option in Connection String. More information about this parameter you can find here.