SSIS - Export table data to flat file in chunks - sql-server

I have a requirement of exporting the table data into a flat file (.csv format) with each of the files containing only 5000 records each. If the table has 10000 records then it has to create 2 files with 5000 records each. The records in the table will increase on daily basis. So basically I am looking for a dynamic solution which will export "n" number of records to "n" number of files with only 5000 records each.
*A simple visualization:
Assume the table has 10230 records. What i need is:
File_1.csv - 1 to 5000 records
File_2.csv - 5001 to 10000 records
File_3.csv - 10001 to 10230 records*
I have tried BCP command for the above mentioned logic. Can this be done using Data Flow Task?

No, that is not something SSIS is going to support well natively.
A Script Task, or Script Component acting as a destination, could accomplish this but you'd be re-inventing a significant portion of the wheel with all the file handling required.
The first step would be to add a row number to all the rows coming from the source in a repeatable fashion. That could be as simple as SELECT *, ROW_NUMBER() OVER (ORDER BY MyTablesKey) AS RN FROM dbo.MyTable
Now that you have a monotonically increasing value associated to each row, you could use the referenced answer to be able to pull the data in a given range if you take the ForEach approach.
If you could make a reasonable upper bounds on how many buckets/files of data you'd ever have, then you could use some of the analytic functions to specify the size of your groupings. Then all of the data is fed into the data flow and you have a conditional split that has that upper bounds worth of output buffers heading to flat file destinations.
An alternative approach would be to export the file as is and then use something like PowerShell to split it up into smaller units. Unix is nice as they have split as a native method for just this sort of thing.

Well, it can be done with standard SSIS components ans SQL 2012+. Idea is the following - use SELECT ... ORDER BY ... OFFSET <Row offset> ROWS FETCH NEXT <Row number> ROWS as bucket source and use it together with FOR container and Flat File Destination with expressions.
More details:
Create package with Iterator int variable with init value of 0 and Flat File Destination where connection string is defined as an Expression of `"\Filename_"+[User::Iterator]+".csv". Also define Bucket_size variable or parameter as int
Create For loop sequence container. Leave its parameters empty for now. Next steps will be inside For Loop.
On Loop container (or Package level - up to you) create SQL_rowcount variable with "SELECT count(*) FROM ... ORDER BY ... OFFSET "+(DT_WSTR,20)[User::Iterator]*[User::Bucket_Size]+" ROWS ". This command gives you remaining rows count in the current bucket.
Create Task Execute SQL Command with command from SQL_rowcount variable and storing single result into a variable Bucket_Rowcount.
Create a string variable SQL_bucket with expression "SELECT .. FROM ... ORDER BY ... OFFSET "+(DT_WSTR,20)[User::Iterator]*[User::Bucket_Size]+" ROWS FETCH NEXT "+(DT_WSTR,20)[User::Bucket_Size]+" ROWS".
Create a simple dataflow task - OLEDB Source with command from SQL_bucket variable and Flat File destination from step 1.
Now little trick - we have to define loop conditions. We do it based on current bucket rowcount - last bucket has no more than Bucket size rows. Continuation condition (checked before loop entry) - last iteration has more than Bucket rows (at least 1 row left for the next iteration).
So, define the following properties for For Loop Container
InitExpression - #Bucket_Rowcount = #Bucket_Size + 1
EvalExpression - #Bucket_Rowcount > #Bucket_Size
AssignExpression - #Iterator = #Iterator + 1
This is it.
You can optimize it if the source table is not modified during export; first (before For Loop) fetch number of rows and figure out number of buckets, and do this number of iterations. Thus you avoid repeating select count(*) statements in the loop.

Related

COPY INTO: is there a way to show number of records skipped during loading data into Snowflake?

I am using copy-into-table from an external location and there is an option to continue loading the data in case the row have corrupted data. Is there an option to show how many rows were skipped while loading, like there is an option in Teradata TPT.
Assuming that you are not doing transformations in your COPY INTO command, you can leverage the VALIDATE() function after the load and get the records skipped and the reason why they were not loaded:
https://docs.snowflake.com/en/sql-reference/functions/validate.html
Example where t1 is your table being loaded. You can also specify a specific query_id if you know it:
select * from table(validate(t1, job_id => '_last'));
The COPY INTO outputs the following columns:
ROWS_PARSED: Number of rows parsed from the source file
ROWS_LOADED: Number of rows loaded from the source file
ERROR_LIMIT: If the number of errors reaches this limit, then abort
ERRORS_SEEN: Number of error rows in the source file
The number of rows skipped can be calculated as ROWS_PARSED - ROWS_LOADED. I am using pyodbc the parsing of these columns might differ the way you are scripting.

Is it possible to load two flat files with different columns, perform calculations on the data, and then upload one data set to the database?

I have two files that look sorta like this:
I need to upload the data from File 1. But before I do, I need to remove the quantity in File 2 from File 1. In this case, the final value would be 80. But in some cases, it might be negative and I would need to show 0 instead.
Is it possible to import both files and do the calculations as needed? I would think I could use a script block, but I'm unsure if it would be possible to pull the data from the second file as I'm iterating through the first.
I probably could upload to a staging table and call a proc, but I'd like to avoid that.
Yes, It is possible. You can use MERGE JOIN, DATA CONVERSION and DERIVED COLUMN
File1 Input
Name,Quantity
A123,100
A234,40
File2 Input
Name,Quantity
A123,80
A234,80
DataFlow Task
MERGE JOIN
Data conversion is simply to convert Quantities to DT_NUMERIC
Add the below in derived column
[Copy of Quantity_File1] - [Copy of Quantity_File2] > 0 ? [Copy of Quantity_File1] - [Copy of Quantity_File2] : 0
Output
As mentioned in the comment, the sources for MERGE JOIN should be sorted. You can do that by either using SORT or In the Advanced Properties --> Input and Output Properties.Change IsSorted to True .Then select your Name column and SortKeyPosition to 1
You can use a MERGE JOIN transformation in SSIS. And you can add a DERIVED column transformation to do the calculation after the tables have been joined.

SSIS Script Component - get raw row data in data flow

I am processing a flat file in SSIS and one of the requirements is that if a given row contains an incorrect number of delimiters, fail the row but continue processing the file.
My plan is to load the rows into a single column in SQL server, but during the load, I’d like to test each row during the data flow to see if it has the right number of delimiters, and add a derived column value to store the result of that comparison.
I’m thinking I could do that with a script task component, but I’m wondering if anyone has done that before and what would be the best method? If a script task component would be the way to go, how do I access the raw row with its delimiters inside the script task?
SOLUTION:
I ended up going with a modified version of Holder's answer as I found that TOKENCOUNT() will not count null values per this SO answer. When two delimiters are not separated by a value, it will result in an incorrect count (at least for my purposes).
I used the following expression instead:
LEN(EntireRow) - LEN(REPLACE(EntireRow, "|", ""))
This results in the correct count of delimiters in the row, regardless of whether there's a value in a given field or not.
My suggestion is to use Derrived Column to do your test
And then add a Conditional Split to decide if you want to insert the rows or not.
Something like this:
Use the TokenCount function in the Derrived Column box to get number of columns like this: TOKENCOUNT(EntireRow,"|")

Auto-generating destinations of split files in SSIS

I am working on my first SSIS package. I have a view with data that looks something like:
Loc Data
1 asd
1 qwe
2 zxc
3 jkl
And I need all of the rows to go to different files based on the Loc value. So all of the data rows where Loc = 1 should end up in the file named Loc1.txt, and the same for each other Loc.
It seems like this can be accomplished with a conditional split to flat file, but that would require a destination for each Location. I have a lot of Locations, and they all will be handled the same way other than being split in to different files.
Is there a built in way to do this without creating a bunch of destination components? Or can I at least use the script component to act as a way?
You should be able to set an expression using a variable. Define your path up to the directory and then set the variable equal to that column.
You'll need an Execute SQL task to return a Single Row result set, and loop that in a container for every row in your original result set.
I don't have access at the moment to post screenshots, but this link should help outline the steps.
So when your package runs the expression will look like:
'C:\Documents\MyPath\location' + #User::LocationColumn + '.txt'
It should end up feeding your directory with files according to location.
Set the User::LocationColumn equal to the Location Column in your result set. Write your result set to group by Location, so all your records write to a single file per Location.
I spent some time try to complete this task using the method #Phoenix suggest, but stumbled upon this video along the way.
I ended up going with the method shown in the video. I was hoping I wouldn't have to separate it in to multiple select statements for each location and an extra one to grab the distinct locations, but I thought the SSIS implementation in the video was much cleaner than the alternative.
Change the connection manager's connection string, in which you have to use variable which should be changed.
By varying the variable, destination file also changes
and connection string is :
'C:\Documents\ABC\Files\' + #User::data + '.txt'
vote this if it helps you

compare two excel files in a ssis "Foreach Loop Container"

Introduction : I have Multiple Excel files which loop through a Foreach Loop Container in SSIS Package.
The first Excel file Excel1.xlsx contains the old data (for example :I have a column named EffectiveDate populated with 2001-01-01 to 2013-04-01of
The second Excel file Excel2.xlsx contains the new entries with EffectiveDate from 2013-05-01 and also contains some old data from Excel1.xlsx.
These two files loop through Foreach Loop Container.
Problem : Once the first Excel file Excel1.xlsx is loaded , i want to compare it with second Excel file Excel2.xlsx and update the EffectiveDate of old data in Excel2.xlsx with EffectiveDate of matching rows in Excel1.xlsx
And all other rows( or new Entires) of Excel2.xlsx with GetDate().
Is it possible to get it done in a single Data Flow Task?
And also how do i compare two excel files in a single container?
You can have 2 Excel sources within one data flow task. You could use a merge join to compare the values, and feed that to an excel output.
If you want to loop through 10 excel files, comparing 1 to another, I would suggest that your merge join output be the 2nd excel source, and map your container variable to the first excel source. That way, Everything from Excel file 1 will be put into the output file, then for each subsequent file only the entries not listed already in the output file will be added.
If you get hung up on any of the steps individually I'm sure myself or others can help you push through the sticking points.

Resources