Unable to sort the records in Data Flow Task in SSIS - sql-server

I am trying to load around 7 million records from a flat file into a database. I need to sort these records for merging. My sort task within DFT(Data Flow Task) is able to read 7 million rows as input but outputs only 90k rows. Is there anny limit on the number of rows that can be sorted in SSIS? If yes, what are the possible alternatives.

The issue was blanks and null values existing in certain columns. I added the conditional split transformation and removed the null and blank values. The null and blank values in my file broke the sort transformation resulting in just few rows getting sorted.

Related

Incremental updates to a Transformed Table

I am trying to setup an ELT pipeline into Snowflake and it involves a transformation after loading.
This transformation will currently create or replace a Table using data queried from a source table in Snowflake after performing some manipulations of JSON data.
My question is, is this the proper way of doing it via create or replace Table everytime the transformation runs or is there a way to update the data in the transformed table incrementally?
Any advise would be greatly appreciated!
Thanks!
You can Insert into the load (soruce) table, and put into a stream, then you can know the rows, ranges of rows that need to be "reviewed" and then upsert into the output transform table.
That is is you doing something like "daily aggregates", thus if in "this batch you have data for the last 4 days, you then read the "last four days" of data from source (space a full read) and then aggregate and upsert via merge command. Thus with the model you can save reads/aggregate/write.
We have also used high water tables, to know last seen data, and/or lowest value in current batch.

Can SSIS support loading of files with varying column lengths in each row?

Currently I receive a daily file of around 750k rows and each row has a 3 character identifier at the start.
For each identifier, the number of columns can change but are specific to the identifier (e.g. SRH will always have 6 columns, AAA will always have 10 and so on).
I would like to be able to automate this file into an SQL table through SSIS.
This solution is currently built in MSACCESS using VBA just looping through recordsets using a CASE statement, it then writes a record to the relevant table.
I have been reading up on BULK INSERT, BCP (w/Format File) and Conditional Split in SSIS however I always seem to get stuck at the first hurdle of even loading the file in as SSIS errors due to variable column layouts.
The data file is pipe delimited and looks similar to the below.
AAA|20180910|POOL|OPER|X|C
SRH|TRANS|TAB|BARKING|FORM|C|1.026
BHP|1
*BPI|10|16|18|Z
BHP|2
*BPI|18|21|24|A
(* I have added the * to show that these are child records of the parent record, in this case BHP can have multiple BPI records underneath it)
I would like to be able to load the TXT file into a staging table, and then I can write the TSQL to loop through the records and parse them to their relevant tables (AAA - tblAAA, SRH - tblSRH...)
I think you should read each row as one column of type DT_WSTR and length = 4000 then you need to implement the same logic written using vba within a Script component (VB.NET / C#), there are similar posts that can give you some insights:
SSIS ragged file not recognized CRLF
SSIS reading LF as terminator when its set as CRLF
How to load mixed record type fixed width file? And also file contain two header
SSIS Flat File - CSV formatting not working for multi-line fileds
how to skip a bad row in ssis flat file source

Speed up Excel array formula to find unique distinct values

I have a workbook in which a variable number of rows of data (one per employee) are entered each week on one sheet (DATA ENTRY), and then stored on another sheet (LOG) with the help of a macro that is executed every time the file is saved.
To be able to then retrieve and review employee data for a specific week, I need a column of helper cells in which all the unique distinct dates (weeks) are listed.
I currently do this with the following array formula:
{=IFERROR(INDEX($B$2:$B$1600, MATCH(0,COUNTIF($K$1:K1, $B$2:$B$1600), 0)),"")}
This all works brilliantly, except that I found that this one specific formula slows my file down tremendously. When the file is saved (which triggers data to be copied over to the LOG sheet), it can take up to 10 seconds to process. When this array formula is disabled, it is pretty much instantaneous.
Limiting it to run over 1600 rows helped significantly (it took much longer before when I had it set to 20.000), but it is still not enough and I can't really have this check less than 1600 rows.
Any creative solutions to either make this formula run faster, or to get to the same result (a list of unique distinct dates from a large list of dates) without using an array formula?
Thanks!
You could use Power Query (Get & Transform Data) to populate your list of unique dates.

Copy a very large number of rows from one sheet to another, excluding blank rows in Excel 2010

I'm currently working on an excel workbook using the following formula to copy all rows from one sheet (Creation_Series_R) to another one, excluding empty rows.
{=IFERROR(INDEX(Creation_Series_R!C:C;SMALL(IF(Creation_Series_R!$C$3:$C$20402<>"";ROW(Creation_Series_R!$C$3:$C$20402));ROW()-ROW(Creation_Series_R!$C$3)+1));"")}
And the formula works very well. Except, when I did my proof of concept I only had a few rows but with the final data, I need to work on 20400 rows... adding to the fact that I have 17 columns, and 3 similar sheets with similar formula, my workbook takes an hour to compute every time I input just one value.
This workbook is designed as a way for a client to enter data, and then it reorganize the data so that it can be imported directly in our software. I already limited the number of data the user can enter per workbook (to their very big disappointment), so I can't really reduce it to less than 20400 rows (it's only a 100 funds financial data).
Is there a way, even maybe using macro, I could do this more efficiently ?
The big block of array formulas is killing your performance (time-wise).
If your data is in column A through Q, then I would use column R as a "helper" column. In R2 insert:
=COUNTA(A2:Q2)
and copy down. The macro would:
AutoFilter column R
Hide all rows showing 0 in column R
Copy the visible rows and paste elsewhere as a block

Using SSIS to insert records without inserting preexisting records

I have a 290 million source data set and I get a daily download of 12 million records daily which contain data from the previous days downloads. I am having trouble inserting the daily records into the source and excluding the records I already have. Some of the records that are new may not be from the previous day they could be several days back so a date restriction wont work. Please help.
I just had this exact same issue basically in your Data flow of your SSIS you need to add a Lookup. Have it match the data your inserting to the new data based on the PK. then you can separate the data from here, choose Redirect Rows to no match output. This will make the green arrow contain all data that is no present.
Lookup component using a key field and with the no match output, do an insert (you could also with the match output do an update; though 290million rows IS going to take A WHILE)...

Resources