Providing Excel sheets and a connection string as inputs while Packaging a job in Talend - connection-string

I have designed a job to copy Excel data to SQL tables (18 of them). Now I want to package it.
Since I am new to this I don't know how to proceed.
I want to package it in a way that user can provide Excel input sheets and a connection string. And my executable file will copy data from this 'n' number of input sheets to the hard-coded tables.
Now user might want to copy data from any number of tables.
My workflow is
Stored Procedure -> excel input -> tMap -> tMSSql
|
on sub job ok
|
excel input -> tMap -> tMSSql
|
on sub job ok
|
......
|
excel input -> tMap -> tMSSql
I want to know where and how to define these as input parameters? Or whatever the other way it works in Talend.

You'll want to use context variables that can then be defined at run time.
In this case you'll want one variable for the location and file name for the Excel sheet and potentially you'll also want to contextualise your DB connection parameters too.
If you have varying amounts of sheets per workbook you'll need to reconfigure your job to loop through the sheets in the workbook and output to the appropriate table rather than hard code the amount of sheets you work through. This can also be done with context variables but these should be controlled by your loop rather than defined at run time.
If you are using Talend Open Studio then you'll want to pass your runtime contexts as a parameter when starting the job in either command line or shell/batch script. With the Enterprise editions you can specify context variables in the Talend Administration Console.
To overwrite any set context variable values at run time you can pass this as a parameter to your job when running it from a batch or shell script by adding --context_param [param-name]=[param-value] such as C:/Talend/Jobs/job.bat --context_param inputDir="C:/Talend/inputDir/"

Related

What is the most efficient way to remove a space from a string when importing a csv file to SQL Server using SSIS?

I will be importing records from thousands of CSV files using SSIS. These CSV files will contain a Postal Code column, which has the format A5A 5A5, where "A" is any letter and "5" is any number from 0 to 9.
There is a space between the "A5A" and "5A5" that I want to remove, so that all Postal Codes appear as "A5A5A5".
I am reviewing the documentation and see several options, and I'm trying to narrow down the best one, i.e. the one that requires the least number of steps. So far I am looking at the Derived Column transformation, but that would involve adding another column to my SQL Table.
Is there a way I can trim the space without having to add an extra column?
As #Larnu answers via comments, a Derived Column is likely the most appropriate component to use here.
The expression you're looking for is a REPLACE. Syntax ought to be
REPLACE([PostalCode], " ", "")
You have 10 columns from your CSV. The Derived Column can either replace and existing or add a new column to row buffer. I would advocate adding a new column. PostalCodeStripped or something like that. At some point, something weird is going to happen with the data and you'll get an A5A 5A5 that didn't get the space stripped. Having both the original and the parsed value available in debugging can help sort out problems (Oh, this has a non-breaking space or a tab instead of a space, or in addition to)
But, just because a column is in the buffer does not mean you need to create a column for that in the destination table. Just unmap the PostalCode from the row buffer and map PostalCodeStripped to the PostalCode column in the database. You'll see what I'm talking about in the destination component. By default, they'll map based on name matching but you're welcome to wire them up however you see fit.
ETL is an alternate option. Bulk load the data into a staging table. Then do a simple select into the destination to do the transformation. I might be tempted to not use SSIS. BCP or Import-DbaCsv (DBATools powershell module) would both be a quick alternates. If you know PowerShell and want to process the files in a pipe, you can pipe the files into Import-DbaCsv. The PowerShell script can also execute Invoke-DbaQuery to run update or insert queries to do the transformation.
SSIS can also just do the bulk load and then run the T-SQL to do the transformations. I don't like the overhead of maintaining and upgrading SSIS packages. I'd take T-SQL jobs over SSIS jobs any day. (We have about 1/2 year for a FTE to upgrade our SSIS packages to SQL 2019. The T-SQL jobs just keep working when moved to a new version.)
Or go the ETL route and do the transformation in the SSIS data flow. A Derived Column transformation between a flat file source and a OLE DB destination should do the trick.
To handle multiple files, you can use the Foreach Loop Container. There's an enumerator for files using a wildcard path. (The initial T-SQL task just truncates the table for testing.)
You'll need to parameterize the thing to get the file source to be each file.
For PowerShell it might be something like (no transformation yet) the script below.
Get-ChildItem 'C:\TestFolder\*.csv' |
import-dbacsv -SqlInstance 'localhost\DEV' -Database 'Test' -Schema 'dbo' -Table 'Test' -AutoCreateTable -verbose
If you run this in the ISE, be aware of a bug where the connection might not be released after calling import-dbacsv that will cause it to hang. This is not an issue in the command line from what I can tell. (If this happens to you, you might have to kill the ISE process - closing it is not enough.)

With SSIS, how do you export SQL results to multiple CSV files?

In my SSIS package, I have an Execute SQL Task that is supposed to return up to one hundred million (100,000,000) rows.
I would like to export these results to multiple CSV files, where each file has a maximum of 500,000 rows. So if the SQL task generates 100,000,000 results, I would like to produce 200 csv files with 500,000 records in each.
What are the best SSIS tasks that can automatically partition the results into many exported CSV files?
I am currently developing a script task but find that it's not very performant. I am a bit new to SSIS so I am not familiar with all the different tasks available, and I'm wondering if maybe there's another one that can do it much more efficiently.
Any recommendations?
Static approach
First add a dataflow task.
In the dataflow task add the following:
A source: in the screenshot ADO NET Source. That contains the query to retrieve the data
A conditional split: Every condtion you add will result in a blue output arrow. You need to connect every arrow to a destination
Excel destination or flat file destiation. Depending if you want Excel files or csv files. For CSV files you'll need to setup a file connection.
In the conditional split you can add multiple conditions to split out your data and have a default output.
Flat file connection manager:
Dynamic approach
Use Execute SQL Task to retrieve the variables to start a for loop. (BatchSize, Start, End)
Add a for / foreach
Add a dataflow task in the loop, pass in the parameters from the loop.
(You can pass parameters/expressions to sub process in the dataflow using the expressions property. )
Fetch the data with a source in a dataflow task based on the parameters from the for loop.
Write to a destination (Excel/CSV) with a dynamic name based from the parameters of the loop.

Excel to SQL (SSIS) - Importing more then 1 file, ever file has more then 1 sheet and the data from excel starts from the 3rd row

Excel to SQL (SSIS) - Importing more then 1 file, ever file has more then 1 sheet and the data from excel starts from the 3rd row.
How would you build this the best way?
I know how to do each 1 separate but together I got into a pickle.
Please help me as I haven't found Videos or sites regarding this.
Just to clarify -
The tables (in excel) have the same design (each in different sheet).
Some excel files have 4 sheets some have only 3.
Many thanks,
Eyal
Assuming that all of the Excel files to be imported are located in the same folder, you will first create a For-Each loop in your control flow. Here you will create a user variable that will be assigned the full path and file name of the Excel file being read (you'll need to define the .xls or .xlsx extension in the loop in order to limit it to reading only Excel files). The following link shows how to set up the first part.
How to read data from multiple Excel files with SQL Server Integration Services
Within this loop you will then create a another For-Each loop that will loop through all of the Worksheets in that current Excel file being read. Apply the following link to perform that task of reading the rows and columns from each worksheet into the database table.
Use SSIS to import all of the worksheets from an Excel file
The outer loop will pick up the Excel file and the inner loop will read each worksheet, regardless of the number. They key is that the format of each worksheet must be the same. Also, using the Excel data flow task, you can define from which line of each worksheet to begin reading. The process will continue until all of the Excel files have been read.
For good tracking and auditing purposes, it is a good idea to include counters in the automated process to track the number of files and worksheets for each that were read. I also like to first import all of the records into staging tables where any issues and cleaning can be performed for efficiently using SQL before populating the results to the final production tables.
Hope this all helps.

How to get information from Excel files using SSIS

I am new to SSIS and am trying to understand how to do the following:
I have a folder (TestFolder) that has multiple folders within it (SubFolder1, SubFolder2, etc). In each subfolder there are multiple Excel files that have various names but will end in a date (Formatted as YYYYMM). In each Excel workbook there is a tab named: AccessRates and this is the data I want to store in the table in SQL Server.
Okay, so the question: How do I set up my SSIS Control flow to handle such a task? I have built a Data Flow Task that handles the Data conversion, error handling and ultimate placement in the server table, but I can not figure out the Control Flow. I believe I need a ForEach loop container, but I can't figure out how to set it, along with the variables up.
Any help or direction would be greatly appreciated!
JP
Solution guidelines
You should follow these steps:
Use a foreach loop and enumerate on files.
Set the top folder and select traverse subfolders.
Set the file sequence something like [the start of all files]*.xlsx
Retrieve fully qualified file name and map to a variable.
Inside foreach, drop a dataflow task
Make an Excel connection to any of the files
Go to the properties of the connection (F4).
Set an expression map connection string to the variable from step 4
Set Delay Validation to true.
Do your data flow.
This should be it.
Step-by-step tutorials
There are many articles that describe the whole process step-by-step, you can refer to them if you need more details:
How to read data from multiple Excel files with SQL Server Integration Services
Loop Through Excel Files in SSIS
Loop through Excel Files and Tables by Using a Foreach Loop Container
Have a look at this link it shows you how to set up an environment var to store the sheet name you want to get the data from and then how to use that to get the data from an excel source.
Hope this helps!

Dynamic Columns in Flat File Destination

I am working on a generic SSIS package that receives a flat file, add new columns to it, and generate a new flat file.
The problem I have is that the number of new columns varies based on a stored procedure XML parameter. I tried to use the "Execute Process Task" to call BCP, but the XML parameter is too long for the command line.
I search on the web and found that you cannot dynamically change the SSIS package during runtime and that I would have to use a script task to generate the output. I started going trough that path and found that you still have to let the script component know how may columns will be receiving and that is exactly what I do not know at design time.
I found a third party SSIS extension from CozyRoc, but I want to do it without any extensions.
Has anyone done something like this?
Thanks!
If the number of columns is unknown at run time then you will have to do something dynamically, and that means using a script task and/or a script component.
The workflow could be:
Parse the XML to get the number of rows
Save the number of rows in a package variable
Add columns to the flat file based on the variable
This is all possible using script tasks, although if there is no data flow involved, it might be easier to do the whole thing in an external Perl script or C# program and just call that from your package.

Resources