SSIS - Ultra large file (500MB) fails with OutOfMemory exception

SSIS - Ultra large file (500MB) fails with OutOfMemory exception - sql-server

I have a SSIS pacakge which loads data from a non-standard XML file to database table with a xml datatype column . I call it a non-standard file as it has some invalid characters like tabs in it which I remove using a script task and it has hierarchy tags in it which can be present for some keys and not for others. I had tried using XSLT, but it did not work as all the attributes (tags) appear as separate output in SSIS XML source rather than separate output. So I read the whole XML file as a single column and single row as a flat file. The package runs fine when loading small files (upto 8 Mb) but fails when the size is large. When trying to load a 500 MB file, the script task failed due to OutOfMemory error. So the file was sent in smaller chunks. Now, the script task worked for processing a 90 MB file but fails in the DFT as SSIS only reads part of the XML and not the whole file because of which the DFT fails at the destination. I adjusted MaxbufferRows to 1 and DefaultBufferSize to 100 MB from the defaults of 10000 rows and 10 MB respectively. I found that the flat file source is reading 8193 KB data (8388609 characters).
Please advice.
Note : I am running the SSIS package from Citrix. I am storing the whole xml document in a table and then using .nodes to extract the relevant information to be stored in relevant stage tables.

In your Script task, open a StreamReader to process the file incrementally, and stream it into a nvarchar(max) or XML column in SQL Server. No need to load the whole thing in SSIS.
See SqlClient Streaming Support
Edit your question to include a pared-down file, including the transformations you need to make if you need an example.

Related

Parsing CEF files in snowflake

We have staged the log files in external stage s3.The staged log files are in CEF file format.How to parse CEF files from stage to move the data to snowflake?

If the files have a fixed format (i.e. there are record and field delimiters and each record has the same number of columns) then you can just treat it as a text file and create an appropriate file format.
If the file has a semi-structured format then you should be able to load it into a variant column - whether you can create multiple rows per file or only one depends in the file structure. If you can only create one record per file then you may run into issues with file size as a variant column has a maximum file size.
Once the data is in a variant column you should be able to process it to extract usable data from it. If there is a structure Snowflake can process (e.g. xml or json) then you can use the native capabilities. If there is no recognisable structure then you'd have to write your own parsing logic in a stored procedure.
Alternatively, you could try and find another tool that will convert your files to an xml/json format and then Snowflake can easily process those files.

Can SSIS support loading of files with varying column lengths in each row?

Currently I receive a daily file of around 750k rows and each row has a 3 character identifier at the start.
For each identifier, the number of columns can change but are specific to the identifier (e.g. SRH will always have 6 columns, AAA will always have 10 and so on).
I would like to be able to automate this file into an SQL table through SSIS.
This solution is currently built in MSACCESS using VBA just looping through recordsets using a CASE statement, it then writes a record to the relevant table.
I have been reading up on BULK INSERT, BCP (w/Format File) and Conditional Split in SSIS however I always seem to get stuck at the first hurdle of even loading the file in as SSIS errors due to variable column layouts.
The data file is pipe delimited and looks similar to the below.
AAA|20180910|POOL|OPER|X|C
SRH|TRANS|TAB|BARKING|FORM|C|1.026
BHP|1
*BPI|10|16|18|Z
BHP|2
*BPI|18|21|24|A
(* I have added the * to show that these are child records of the parent record, in this case BHP can have multiple BPI records underneath it)
I would like to be able to load the TXT file into a staging table, and then I can write the TSQL to loop through the records and parse them to their relevant tables (AAA - tblAAA, SRH - tblSRH...)

I think you should read each row as one column of type DT_WSTR and length = 4000 then you need to implement the same logic written using vba within a Script component (VB.NET / C#), there are similar posts that can give you some insights:
SSIS ragged file not recognized CRLF
SSIS reading LF as terminator when its set as CRLF
How to load mixed record type fixed width file? And also file contain two header
SSIS Flat File - CSV formatting not working for multi-line fileds
how to skip a bad row in ssis flat file source

Piping Data from CSV File to OLEDB Destination in SSIS

I have a SSIS package in which I use a ForEach Container to loop through a folder destination and pull a single .csv file.
The Container takes the file it finds and uses the file name for the ConnectionString of a Flat File Connection Manager.
Within the Container, I have a Data Flow Task to move row data from the .csv file (using the Flat File Connection Manager) into an OLEDB destination (this has another OLEDB Connection Manager it uses).
When I try to execute this container, it can grab the file name, load it into the Flat File Connection Manager, and begin to transfer row data; however, it continually errors out before moving any data - namely over two issues:
Error: 0xC02020A1 at Move Settlement File Data Into Temp Table, SettlementData_YYYYMM [1143]: Data conversion failed. The data conversion for column ""MONTHS_REMAIN"" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
Error: 0xC02020A1 at Move Settlement File Data Into Temp Table, Flat File Source [665]: Data conversion failed. The data conversion for column ""CUST_NAME"" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
In my research so far, I know that you can set what conditions to force an error-out failure and choose to ignore failures from Truncation in the Connection Manager; however, because the Flat File Connection Manager's ConnectionString is re-made each time the Container executes, it does not seem to hold on to those option settings. It also, in my experience, should be picking the largest value from the dataset when the Connection Manager chooses the OutputColumnWidth for each column, so I don't quite understand how it is truncating names there (the DB is set up as VARCHAR(255) so there's plenty of room there).
As for the failed data conversions, I also do not understand how that can happen when the column referenced is using simple Int values, and both the Connection Manager AND the receiving DB are using floats, which should encompass the Int data (am I unaware that you cannot convert Int into Float?).
It's been my experience that some .csv files don't play well in SSIS when going directly into a DB destination; so, would it be better to transform the .csv into a .xlsx file, which plays much nicer going into a DB, or is there something else I am missing to easily move massive amounts of data from a .csv file into a DB - OR, am I just being stupid and turning a trivial matter into something bigger than it is?
Note: The reason I am dynamically setting the file in the Flat File Connection Manager is that the .csv file will have a set name appended with the month/year it was produced as part of a repeating process, and so I use the constant part of the name to grab it regardless of the date info
EDIT:
Here is a screen cap of my Flat File Connection Manager previewing some of the data that it will try to pipe through. I noticed some of these rows have quotes around them, and wanted to make sure that wouldn't affect anything adversely - the column having issues is the MONTHS_REMAIN one

Is it possible that one of the csv files in the suite you are processing is malformed? For instance, if one of the files had an extra column/comma, then that could force a varchar column into an integer column, producing error similar to the ones you have described. Have you tried using error row redirection to confirm that all of your csv files are formed correctly?
To use error row redirection, update your Flat File Source and adjust the Error Output settings to redirect rows. Your Flat File Source component will now have an extra red arrow which you can connect to a destination. Drag the red arrow from your source component to a new conditional split. Next, right-click the red line and add dataviewer. Now, when error rows are processed, they will flow over the red line into the data viewer so you can examine them. Last, Execute the package and wait for the dataviewer to capture the errant rows for examination.
Do the data values captured by the data viewer look correct? Good luck!

How can i use SQLLoader to load data into my Database tables directly from a tar.gz file?

I am trying to load data into my oracle database table from an external tar.gz file. I can load data easily from a standard text file using SQLLoader but i'm not sure how to do the same if i have a tar.gz file instead of a word file.
I am found the following link somewhat helpful:
http://www.simonecampora.com/blog/2010/07/09/how-to-extract-and-load-a-whole-table-in-oracle-using-sqlplus-sqlldr-named-pipes-and-zipped-dumps-on-unix/
However the author of the link is using .dat.gz instead of .tar.gz. Is there anyway to load data into my Oracle database table using SQL loader from a tar.gz file instead of a text file?
Also, Part of the problem for me is that i'm supposed to load data from a NEW tar.gz file every hour into the same table. For e.g. In hour 1 i have file1.tar.gz and i load all its 10 rows of data into TABLE in my oracle database. In hour 2 i have file2.tar.gz and i have to load its 10 rows of data into the same TABLE in my oracle database. But the 10 rows extracted by SQLLoader in file2.tar.gz keep replacing the first 10 rows extracted from file1.tar.gz. Any way i can save the rows from file1.tar.gz as row 1-10 and file2.tar.gz rows as row 11-20 using SQL Loader?

The magic is in the "zcat" part. zcat can output from zipped files. Including tar.gz.
For example try: zcat yourfile.tar.gz and you will see output. In the example URL you provided, they're redirecting the output of zcat into a place that SQLLDR can read from.

SSIS using too much memory to load large (40GB+) XML file into SQL Server table

I need to load a single large (40GB+) XML file into an SQL Server 2012 database table using SSIS. I'm having problems because SSIS seems to be trying to load the entire document in memory instead of streaming it.
Here's more details of my SSIS package.
I've created an XML Source with the folowing properties:
Data access mode: XML file from variable (but could be XML File Location)
Variable name: variable that specifies the XML file path in my computer.
XSD location: the path to the XSD that defines the XML being read.
The XML structure is simple, with only 3 hierarchical levels:
Root element with header information
One level defining collections of objects
The leaf level defining individual objects (each with a fixed set of fields)
I need to insert one database record per leaf element, repeating fields from the higher hierarchy levels. In other words, I need to flaten the XML hierarchy.
How can I make SSIS stream load the data, instead of trying to load the entire document in memory?

The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).
The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.
Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.
There may be third-party data sources which can do this. I had to write my own about five years ago.

Have you considered processing the files in smaller chunks?
I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.
To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.
Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx