What is the most efficient way to remove a space from a string when importing a csv file to SQL Server using SSIS? - sql-server

I will be importing records from thousands of CSV files using SSIS. These CSV files will contain a Postal Code column, which has the format A5A 5A5, where "A" is any letter and "5" is any number from 0 to 9.
There is a space between the "A5A" and "5A5" that I want to remove, so that all Postal Codes appear as "A5A5A5".
I am reviewing the documentation and see several options, and I'm trying to narrow down the best one, i.e. the one that requires the least number of steps. So far I am looking at the Derived Column transformation, but that would involve adding another column to my SQL Table.
Is there a way I can trim the space without having to add an extra column?

As #Larnu answers via comments, a Derived Column is likely the most appropriate component to use here.
The expression you're looking for is a REPLACE. Syntax ought to be
REPLACE([PostalCode], " ", "")
You have 10 columns from your CSV. The Derived Column can either replace and existing or add a new column to row buffer. I would advocate adding a new column. PostalCodeStripped or something like that. At some point, something weird is going to happen with the data and you'll get an A5A 5A5 that didn't get the space stripped. Having both the original and the parsed value available in debugging can help sort out problems (Oh, this has a non-breaking space or a tab instead of a space, or in addition to)
But, just because a column is in the buffer does not mean you need to create a column for that in the destination table. Just unmap the PostalCode from the row buffer and map PostalCodeStripped to the PostalCode column in the database. You'll see what I'm talking about in the destination component. By default, they'll map based on name matching but you're welcome to wire them up however you see fit.

ETL is an alternate option. Bulk load the data into a staging table. Then do a simple select into the destination to do the transformation. I might be tempted to not use SSIS. BCP or Import-DbaCsv (DBATools powershell module) would both be a quick alternates. If you know PowerShell and want to process the files in a pipe, you can pipe the files into Import-DbaCsv. The PowerShell script can also execute Invoke-DbaQuery to run update or insert queries to do the transformation.
SSIS can also just do the bulk load and then run the T-SQL to do the transformations. I don't like the overhead of maintaining and upgrading SSIS packages. I'd take T-SQL jobs over SSIS jobs any day. (We have about 1/2 year for a FTE to upgrade our SSIS packages to SQL 2019. The T-SQL jobs just keep working when moved to a new version.)
Or go the ETL route and do the transformation in the SSIS data flow. A Derived Column transformation between a flat file source and a OLE DB destination should do the trick.
To handle multiple files, you can use the Foreach Loop Container. There's an enumerator for files using a wildcard path. (The initial T-SQL task just truncates the table for testing.)
You'll need to parameterize the thing to get the file source to be each file.
For PowerShell it might be something like (no transformation yet) the script below.
Get-ChildItem 'C:\TestFolder\*.csv' |
import-dbacsv -SqlInstance 'localhost\DEV' -Database 'Test' -Schema 'dbo' -Table 'Test' -AutoCreateTable -verbose
If you run this in the ISE, be aware of a bug where the connection might not be released after calling import-dbacsv that will cause it to hang. This is not an issue in the command line from what I can tell. (If this happens to you, you might have to kill the ISE process - closing it is not enough.)

Related

Does adding simple script tasks to SSIS packages drastically reduce performance?

I am creating an SSIS package to import CSV file data into a SQL Server table.
Some of the rows in the CSV files will have missing values.
For example, if a row has the format: value1,value2,value3 and value2 is missing,
then it will render as: value1,,value3 in the csv file.
When the above happens (value2 is missing) in my SSIS package, I want NULL to go into the receiving SQL Server column that would hold value2.
I understand that I can add a "Script" task to my SSIS package to apply this rule. However, I'm concerned that this will drastically reduce the performance of my SSIS package. I'm not an expert on the inner workings of SSIS/SQL Server, but I'm concerned that this script will cause my script to lose "BULK INSERT" capabilities (and other efficiencies) since the script will have to inspect every row and apply the changes as needed.
Can anyone confirm if adding such a script will cause major performance impacts? Or does the SSIS/SQL-Server engine run the script on every row and then bulk-insert? Is there another way I can apply this rule without taking a performance hit?
Firstly, you can use script task when required. Script task will be executed only once for each execution of the whole package not for every row. For every row there is another component called script component. When the other regular SSIS tasks are not enough to achieve what you want you can surely use script component. I don't believe it is a performance killer unless you implement it badly.
Secondly, this particular requirement you can simply use Flat File Source task to import your csv file. It will put the value NULL when there is no value. I'm considering this is a valid csv value and each row has correct number of comma for every fields (total field - 1 actually) even if value is empty or null for some fields.

SSIS redirect empty rows as flat file source read errors

I'm struggling to find a built-in way to redirect empty rows as flat file source read errors in SSIS (without resorting to a custom script task).
as an example, you could have a source file with an empty row in the middle of it:
DATE,CURRENCY_NAME
2017-13-04,"US Dollar"
2017-11-04,"Pound Sterling"
2017-11-04,"Aus Dollar"
and your column types defined as:
DATE: database time [DT_DBTIME]
CURRENCY_NAME: string [DT_STR]
with all that, package still runs and takes the empty row all the way to destination where it, naturally fails. I was to be able to catch it early and identify as a source read failure. Is it possible w/o a script task? A simple derived column perhaps but I would prefer if this could be configured at the Connection Manager / Flat File Source level.
The only way to not rely on a script task is to define your source flat file with only one varchar(max) column, chose a delimiter that is never used within and write all the content into a SQL Server staging table. You can then clean those empty lines and parse the rest to a relational output using SQL.
This approach is not very clean and a takes lot more effort than using a script task to dump empty lines or ones not matching a pattern. It isn't that hard to create a transformation with the script component
This being said, my advise is to document a clear interface description and distribute it to all clients using your interface. Handle all files that throw an error while reading the flat file and send a mail with the file to the responsible client with information that it doesn't follow the interface rules and needs to be fixed.
Just imagine the flat file is manually generated, even worse using something like excel, you will struggle with wrong file encoding, missing columns, non ascii characters, wrong date format etc.
You will be working on handling all exceptions caused by quality issues.
Just add a Conditional Split component, and use the following expression to split rows
[DATE] == ""
And connect the default output connector to the destination
References
Conditional Split Transformation

Special character issue in slowly changing dimension?

I am using Slowly Changing Dimension task from SSIS 2008 for delta load. Flat file is the input to slowly changing dimension task. I have observed that '--' character from file is converted into ' â€' after delta load.
Input is the flat file and destination is the database table. Flat file contains few strings having '--' character but somehow after inserting this data to table this character is getting converted to 'â€'.
What can be the issue?
Kindly help me to resolve this issue.
Regards,
Sameer K.
In essence you need to scrub these characters from the data. This can be done in several places, but it's a well accepted design pattern to populate from the source file to a staging table where you can scrub the offending characters before bringing it into your slowly changing dimension. It's also possible to scrub the file prior to import, but it's typically easier to work with the data once it's in a database rather than in a flat file. You could also include a derived column task within SSIS to extract these characters one in the SSIS Pipeline, but you would need to manage this column by column which can become difficult to maintain.

Export large amounts of binary data from one SQL database and import it into another database of the same schema

I have one database with an image table that contains just over 37,000 records. Each record contains an image in the form of binary data. I need to get all of those 37,000 records into another database containing the same table and schema that has about 12,500 records. I need to insert these images into the database with an IF NOT EXISTS approach to make sure that there are no duplicates when I am done.
I tried exporting the data into excel and format it into a script. (I have doe this before with other tables.) The thing is, excel does not support binary data.
I also tried the "generate scripts" wizard in SSMS which did not work because the .sql file was well over 18GB and my PC could not handle it.
Is there some other SQL tool to be able to do this? I have Googled for hours but to no avail. Thanks for your help!
I have used SQL Workbench/J for this.
You can either use WbExport and WbImport through text files (the binary data will be written as separate files and the text file contains the filename).
Or you can use WbCopy to copy the data directly without intermediate files.
To achieve your "if not exists" approache you could use the update/insert mode, although that would change existing row.
I don't think there is a "insert only if it does not exist mode", but you should be able to achieve this by defining a unique index and ignore errors (although that wouldn't be really fast, but should be OK for that small number of rows).
If the "exists" check is more complicated, you could copy the data into a staging table in the target database, and then use SQL to merge that into the real table.
Why don't you try the 'Export data' feature? This should work.
Right click on the source database, select 'Tasks' and then 'Export data'. Then follow the instructions. You can also save the settings and execute the task on a regular basis.
Also, the bcp.exe utility could work to read data from one database and insert into another.
However, I would recommend using the first method.
Update: In order to avoid duplicates you have to be able to compare images. Unfortunately, you cannot compare images directly. But you could cast them to varbinary(max) for comparison.
So here's my advice:
1. Copy the table to the new database under the name tmp_images
2. use the merge command to insert new images only.
INSERT INTO DB1.dbo.table_name
SELECT * FROM DB2.dbo.table_name
WHERE column_name NOT IN
(
SELECT column_name FROM DB1.dbo.table_name
)

Export tables from SQL Server to be imported to Oracle 10g

I'm trying to export some tables from SQL Server 2005 and then create those tables and populate them in Oracle.
I have about 10 tables, varying from 4 columns up to 25. I'm not using any constraints/keys so this should be reasonably straight forward.
Firstly I generated scripts to get the table structure, then modified them to conform to Oracle syntax standards (ie changed the nvarchar to varchar2)
Next I exported the data using SQL Servers export wizard which created a csv flat file. However my main issue is that I can't find a way to force SQL Server to double quote column names. One of my columns contains commas, so unless I can find a method for SQL server to quote column names then I will have trouble when it comes to importing this.
Also, am I going the difficult route, or is there an easier way to do this?
Thanks
EDIT: By quoting I'm refering to quoting the column values in the csv. For example I have a column which contains addresses like
101 High Street, Sometown, Some
county, PO5TC053
Without changing it to the following, it would cause issues when loading the CSV
"101 High Street, Sometown, Some
county, PO5TC053"
After looking at some options with SQLDeveloper, or to manually try to export/import, I found a utility on SQL Server management studio that gets the desired results, and is easy to use, do the following
Goto the source schema on SQL Server
Right click > Export data
Select source as current schema
Select destination as "Oracle OLE provider"
Select properties, then add the service name into the first box, then username and password, be sure to click "remember password"
Enter query to get desired results to be migrated
Enter table name, then click the "Edit" button
Alter mappings, change nvarchars to varchar2, and INTEGER to NUMBER
Run
Repeat process for remaining tables, save as jobs if you need to do this again in the future
Use the SQLDeveloper migration tools
I think quoting column names in oracle is something you should not use. It causes all sort of problems.
As Robert has said, I'd strongly advise agains quoting column names. The result is that you'd have to quote them not only when importing the data, but also whenever you want to reference that column in a SQL statement - and yes, that probably means in your program code as well. Building SQL statements becomes a total hassle!
From what you're writing, I'm not sure if you are referring to the column names or the data in these columns. (Can SQLServer really have a comma in the column name? I'd be really surprised if there was a good reason for that!) Quoting the column content should be done for any string-like columns (although I found that other characters usually work better as the need to "escape" quotes becomes another issue). If you're exporting in CSV that should be an option .. but then I'm not familiar with the export wizard.
Another idea for moving the data (depending on the scale of your project) would be to use an ETL/EAI tool. I've been playing around a bit with the Pentaho suite and their Kettle component. It offered a good range of options to move data from one place to another. It may be a bit oversized for a simple transfer, but if it's a big "migration" with the corresponding volume, it may be a good option.

Resources