I've been assigned to take over 5k of csv files and merge them to create seperate files which contain transposed data with each filename becoming a column in a new file (the source column 1 being extracted from each file as the data) and rows = dates.
I was after some input/suggestions on how to accomplish this..
Example details as follows:
File1.csv -> File5000.csv
Each file contains the following
Date, Quota, Price, % Value, BaseCost,...etc..,Units
'date1','value1-1','value1-2',....,'value1-8'
'date2','value2-1','value2-2',....,'value2-8'
....etc....
'date20000','value20000-1','value20000-2',....,'value20000-8'
The resulting/merged csv file(s) would look like this:
Filename- Quota.csv
Date,'File1','File2','File3',etc.,'File5000'
'date1','file1-value1-1''file2-value1-1','file3-value1-1',
etc.,'File5000-value20000-1'
'date20000','file1-value20000-1','file2-value20000-1','file3-value20000-1',
etc.,'File5000-value20000-1'
Filename Price,csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-1''file2-value2-1','file3-value2-1',
etc.,'File5000-value2-1'
'date20000','file1-value20000-2','file2-value20000-2','file3-value20000-2',
etc.,'File5000-value20000-2'
....up to Filename: Units.csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-8''file2-value2-8','file3-value2-8',
etc.,'File5000-value20000-8'
'date20000','file1-value20000-8','file2-value20000-8','file3-value20000-8',
etc.,'File5000-value20000-8'
I've been able to use an array contruct to reformat the data, but due to the shear number of files and entries it uses way too much RAM - the array gets too big, and this approach is not scalable.
I was thinking of simply loading each of the 5,000 files one at a time and extracting each line 'one at a time' per file, then outputing the results to each new files 1-8 row-by-row, however this may take an extremely long time to convert the data even on an SSD drive with over 80million lines of data in 5k+ files.
The idea was it would load File1.csv, extract the first line, store the Date and first column data into a simple array. Then load the second File2.csv, extract the first line, check if the Date matches and if so store the first column data in the same array....repeat for all 5k files and once completed store the array into a new file Column1-8.csv. Then repeat each file again for the corresponding dates and only extract the first data column of each file to add to the Value1.csv file. Then repeat the whole process for Column2 data, up to Column8....taking forever :(
Any ideas/suggestions on approach via scripting language?
Note: The machine it will likely run on only has 8GB RAM, using *nix.
I'm receiving around 100 excel files on daily basis ,in these 100 files there are 4 types of files which name start with (ALC,PLC,GLC and SLC) and then some random No. and each excel file sheetname is same as filename.
Now inside of each type and each file at cell A3 there is 'request by' and then user name for eg-Request by 'Ajeet' and we want to pick the file which is requested by only 'Ajeet', first few rows are not formatted, actual data start from.
ALC data start from A33 Cell
PLC data start from A36 Cell
GLC data start from A32 cell
SLC data start from A38 cell
And few files having no data so in that case "NoData" is mentioned in respective type of files from where data start.
All type of file containing same no. of column.
So how can we handle all these situation in SSIS and load the data into a single SQL table but without using script task. I have attached snapshot one of the file for your reference.
This will help.
how-to-read-data-from-an-excel-file-starting-from-the-nth-row-with-sql-server-integration-services
Copying the solution here in case the link is unavailable
Solution 1 - Using the OpenRowset Function
Solution 2 - Query Excel Sheet
Solution 3 - Google It
Google it, The information above is from the first search result
I have two Excel files named 'First' and 'Second' in same location .
They have same schema.
I used foreach loop counter and put Data flow task into it.
The data flow diagram looks like this:-
Here, I selected first excel file as the source....
My For Each Loop Container Editor:-
After running the SSIS package successfully the output came like this:-
Which took data only from First excel file and three times,I must have done something wrong in there,But I cant figure it out.
Check your Foreach Loop Editor:
Collection>Folder
Collection>Files
Your file should not have a particular file name, for multiple excel use *.xlsx.
Edit:
Use a Script task to Debug. Map the value of ForEach to a variable and display it through Script task.
Edit the script task with below code.
MessageBox.Show(Dts.Variables["Variable"].Value.ToString());
Also, Please check your Source Excel connetion is configured correctly with values coming from foreach.
I have a requirement of exporting the table data into a flat file (.csv format) with each of the files containing only 5000 records each. If the table has 10000 records then it has to create 2 files with 5000 records each. The records in the table will increase on daily basis. So basically I am looking for a dynamic solution which will export "n" number of records to "n" number of files with only 5000 records each.
*A simple visualization:
Assume the table has 10230 records. What i need is:
File_1.csv - 1 to 5000 records
File_2.csv - 5001 to 10000 records
File_3.csv - 10001 to 10230 records*
I have tried BCP command for the above mentioned logic. Can this be done using Data Flow Task?
No, that is not something SSIS is going to support well natively.
A Script Task, or Script Component acting as a destination, could accomplish this but you'd be re-inventing a significant portion of the wheel with all the file handling required.
The first step would be to add a row number to all the rows coming from the source in a repeatable fashion. That could be as simple as SELECT *, ROW_NUMBER() OVER (ORDER BY MyTablesKey) AS RN FROM dbo.MyTable
Now that you have a monotonically increasing value associated to each row, you could use the referenced answer to be able to pull the data in a given range if you take the ForEach approach.
If you could make a reasonable upper bounds on how many buckets/files of data you'd ever have, then you could use some of the analytic functions to specify the size of your groupings. Then all of the data is fed into the data flow and you have a conditional split that has that upper bounds worth of output buffers heading to flat file destinations.
An alternative approach would be to export the file as is and then use something like PowerShell to split it up into smaller units. Unix is nice as they have split as a native method for just this sort of thing.
Well, it can be done with standard SSIS components ans SQL 2012+. Idea is the following - use SELECT ... ORDER BY ... OFFSET <Row offset> ROWS FETCH NEXT <Row number> ROWS as bucket source and use it together with FOR container and Flat File Destination with expressions.
More details:
Create package with Iterator int variable with init value of 0 and Flat File Destination where connection string is defined as an Expression of `"\Filename_"+[User::Iterator]+".csv". Also define Bucket_size variable or parameter as int
Create For loop sequence container. Leave its parameters empty for now. Next steps will be inside For Loop.
On Loop container (or Package level - up to you) create SQL_rowcount variable with "SELECT count(*) FROM ... ORDER BY ... OFFSET "+(DT_WSTR,20)[User::Iterator]*[User::Bucket_Size]+" ROWS ". This command gives you remaining rows count in the current bucket.
Create Task Execute SQL Command with command from SQL_rowcount variable and storing single result into a variable Bucket_Rowcount.
Create a string variable SQL_bucket with expression "SELECT .. FROM ... ORDER BY ... OFFSET "+(DT_WSTR,20)[User::Iterator]*[User::Bucket_Size]+" ROWS FETCH NEXT "+(DT_WSTR,20)[User::Bucket_Size]+" ROWS".
Create a simple dataflow task - OLEDB Source with command from SQL_bucket variable and Flat File destination from step 1.
Now little trick - we have to define loop conditions. We do it based on current bucket rowcount - last bucket has no more than Bucket size rows. Continuation condition (checked before loop entry) - last iteration has more than Bucket rows (at least 1 row left for the next iteration).
So, define the following properties for For Loop Container
InitExpression - #Bucket_Rowcount = #Bucket_Size + 1
EvalExpression - #Bucket_Rowcount > #Bucket_Size
AssignExpression - #Iterator = #Iterator + 1
This is it.
You can optimize it if the source table is not modified during export; first (before For Loop) fetch number of rows and figure out number of buckets, and do this number of iterations. Thus you avoid repeating select count(*) statements in the loop.
I have been given a task to load a simple flat file into another using ssis package. The source flat file contains a zip code field, now my task is to extract and load into another flat file that accepts only the ones with correct zip code which is 5 digit zip code , and redirect the invalid rows to a new file.
Since I am new to SSIS, any help or ideas is much appreciated.
You can add a derived column which determines the length of the field. Then you can add a conditional split based on that column. <= 5 goes the good path, > 5 goes the reject path.