ETL Matching Code Page SSIS Data Flow - sql-server

I have found plenty online, but nothing specific to my problem. I have a CSV rendered in code page 65001 (Unicode). However, in the Advanced section of the Flat File Connection Manager, the column is data type string [DT_STR]
My database table I am loading to can be in any format; I don't care. My question is what is the best way to handle this?
1) Change the Advanced properties of the flat file connection columns?
2) Change the data types of the SQL table to NVARCHAR?
3) Change the OLE DB properties to AlwaysUseDefaultCodePage = TRUE?
4) Use Data Conversion task to convert the column data types?

If your source's code page doesn't change, my suggestion is to use a simple data conversion, try to avoid manipulating source and destination whenever possible. Always go for ETL solutions first.

I usually always start off with setting up my connection string for the flat file, then converting the data using a data conversion component (input/output datatypes) based on the flat file data types and destination datatypes. Then finally setting up the connection string for the DB destination. Below is an example of how my data flow looks.

Related

Column "" cannot convert between unicode and non-unicode string data types

I am trying to import the data from the flat file into the Azure SQL database table and I have a merge to merge with another source too. But when I map the fields from the flat file to the Azure SQL database I keep getting the error like
Column "Location" cannot convert between unicode and non-unicode string data types
Upon looking at some forums I tried to change the data type of the field to Unicode string[DT_WSTR] and even I tried to have string [DT_STR]
The Destination Azure SQL database below is the Location field
Can anyone please suggest what I am missing here? Any help is greatly appreciated
Changing the columns data types from the component's advanced editor will not solve the problem. If the values imported contain some Unicode characters, you cannot convert them to non-Unicode strings, and you will receive the following exception. Before providing some solution, I highly recommend reading this article to learn more on data type conversion in SSIS:
SSIS Data types: Change from the Advanced Editor vs. Data Conversion Transformations
Getting back to your issue, there are several solutions you could try:
Changing the destination column data type (if possible)
Using the Data conversion transformation component, implement an error handling logic where the values throwing exceptions are redirected to a staging table or manipulated before re-importing them to the destination table. You can refer to the following article: An overview of Error Handling in SSIS packages
From the flat file connection manager, got to the "Advanced Tab", and change the column data type to DT_STR.

How do I fix the Code Page in SSIS Lookup Transformation to be 65001?

I have an SQL server 2019, DB and tables ALL set to Latin1_General_100_CI_AS_SC_UTF8,
relevant table has code and desc columns both varchar.
In SSIS project, single Data Flow component:
I have a UTF-8 CSV file read with flat file connection, text column code to match is DT_STR, 65001
I have a Lookup that is set to "Full Cache" and loads the Latin1_General_100_CI_AS_SC_UTF8 table, but SSIS thinks the varchar columns are DT_STR, 1252
Finally the code in both CSV and lookup are matched and desc is sent to destination table which is on the same Latin1_General_100_CI_AS_SC_UTF8 collation. The destination component is set to AlwaysUseDefaultCodePage True and DefaultCodePage 65001.
I now get an error saying the column has more than one code page and cannot run the package.
If not for the mislabeled 1252, this package should run. I believe its something to do with ExternalMetadataXml, which is read-only and says all my lookup varchar columns are CodePage="1252".
If I manually edit the package .dtsx with npp and replace all instances of 1252 with 65001, the package can run and seems to do what I expected, as long as I never touch the lookup component again.. That seems a bit messed up of a solution tho, I am hoping there's someone else who has a cleaner way to fix this. Thanks.
With the disclaimer that I'm a "dumb American" who doesn't deal with non-English data but did work with a friend recently on using bulk import with UTF-8 data, here's what I see.
I have a pipe separated value file that looks like this
level|name
7|"Ovasino Poste de Santé"
Notepad++ indicates I have saved it as UTF-8.
I created two flat file connection managers in SSIS: Codepage65001STR and Codepage65001WSTR. They both use a Code Page of 65001 (UTF-8)
In the advanced tab for the STR variant, I left the data type as DT_STR
In the advanced tab for WSTR variant, I changed the data type to DT_WSTR
I also created a table and loaded it with the same data
DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
level int NOT NULL
, name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.dba_286478
(
level
, name
)
VALUES
(
7 -- level - int
, 'Ovasino Poste de Santé' -- name - varchar(75)
);
DROP TABLE IF EXISTS dbo.dba_286478;
CREATE TABLE dbo.dba_286478
(
level int NOT NULL
, name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
);
I then created a data flow task with a Flat File Source using the different Flat File Connection Managers and added data viewers between them and an empty derived column (so I had an anchor point for the data viewer).
I did the same thing with an OLE DB Source pointing at my table as well as a custom query of
SELECT
T.level
, CAST(T.name AS varchar(75)) AS name
FROM
dbo.dba_286478 AS T;
as well as explicitly defining the collation as it makes no different in SSIS
, CAST(T.name COLLATE Latin1_General_100_CI_AS_SC_UTF8 AS varchar(75)) AS name
The results all show the same, the final word is an accented Sante. If the UTF-8 hadn't happened, it'd show as Santé
At this point, it doesn't matter whether we DT_STR or DT_WSTR in our flat file source column definition, the component understands UTF-8 and UTF-16.
Properties, metadata of each. The Codepage 65001 STR looks as expected. code page of 65001 and data type DT_STR
Unicode, DT_WSTR looks good
The OLE components however, they're a different animal. The component is returning a metadata of DT_WSTR (full Unicode/UTF-16) regardless of whether we do an explicit cast to DT_STR, optionally specifying collation, or let the natural metadata flow through.
Either way, it doesn't detect the code page/collation stuff and just says Nope, you're Unicode
So, when we get to trying to use a Lookup task with an OLE DB connection manager, we can expect and receive the same inability to delineate between UTF-8 string/varchar and UTF-16/nvarchar
The error would indicate and that's true, DT_STR can't match DT_WSTR
Cannot map the input column, 'name', to the lookup column, 'name', because the data types do not match.
So what do I do?
You must have type alignment to make the lookup component work which means the source data needs to be of type DT_WSTR. You can either bring the data in from the Flat File as Unicode or leave it as string with code page 65001. If you go the latter route, then you need to make a copy, Derived Column or Data Conversion work, of that column and use it in the Lookup component.
If you're pulling text out of the lookup component, that's now in your pipeline as Unicode so you probably want to then convert that to a string type with code page. Again, Derived Column or Data Conversion will be used.
SSIS OLE components don't understand UTF8
We saw with the source and lookup component that SSIS is going to treat the UTF-8 strings as UTF-16 but I assumed it would handle storing to the table just fine. Not so much.
My server collation is Latin1_General_100_CI_AI_SC_UTF8 and while I switched accent sensitivities between server and table definition of dbo.dba_286478, it doesn't matter in this case as it's UTF-8 all the way down.
For my Flat File Source, I use the STR based file which has the metadata shown above with the yellow highlighting. The Codepage 65001 for data type DT_STR is what we want.
I added an OLE DB Destination and pointed it at my table which again has the "name" column defined as UTF-8
name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
Check this error!
Validation error. Data Flow Task OLE DB Destination [138]: The column "name" cannot be processed because more than one code page (65001 and 1252) are specified for it.
We only have code page 65001 at play in this data flow and yet, something in SSIS space is inferring/defaulting to a 1252 code page during validation.
Making it work
The componentry in a data flow task was built with OLE DB connections in mind. That's why the Lookup task supported OLE DB Connections for 2005, 2008 and maybe 2008R2? Long time ago now, I know but the Cache Connection Manager (aka anything else) option was added in later iterations because of the need to use something besides OLE connection managers especially given the push then was to deprecate the OLE driver.
An ADO.NET Connection Manager does slightly better than OLE in this case and that's likely what you're going to have to use to work with UTF8 data in an SSIS package. It will be implicitly converting to UTF-16 when it presents to the table and then SQL Server is going to snap it back into UTF-8 space (best I can tell).
For reference, bringing UTF-8 data into the pipeline with ADO Source will still be flagged as DT_WSTR/UTF-16/unicode.
But you can land DT_STR code page 65001 to an ADO.NET Destination without a code page mismatch error like I'm seeing for OLE DB Destination.
The data from the database is going to appear as DT_WSTR regardless of how you bring it into the pipeline. That means you can define and OLE and an ADO connection manager to use the Lookup component as is.
Or you can add a precursor data flow step to populate the Cache Connection Manager and only have an ADO.NET connection manager. Were you to go that route, convert the DT_WSTR data to DT_STR with codepage 65001 and store that data into the cache.
DFT - Populate Cache -> DFT - Load data
DFT - Populate Cache
ADO.NET Source -> Data Conversion -> Cache Connection Manager
DFT - Load Data
Flat File Source -> Lookup Component -> ADO.NET Destination
Cross answered from https://dba.stackexchange.com/questions/286478/how-do-i-fix-the-code-page-in-ssis-lookup-transformation-to-be-65001/286520#286520
It sounds like you haven't changed the Code page on your Flat File Connection Manager. Open your Connection Manager, and on there there is a drop down menu for Code Page, select 65001 for UTF-8 there.
You'll then likely need to change your Data Flow task, as the nodes (prior to any Derived Column Transformations you have to convert to code page) will likely be treating the data as 1252 and you'll get an error, as SSIS doesn't allow for implicit conversions.

Set datatypes for SSIS Connection Manager object other than manually / one-by-one?

Have a large number of (TSV) files that need to be imported (periodically) (via SSIS package) into existing MSSQL DB tables. Getting many data type issues from the OLE DB Destination tasks eg:
[Flat File Source [2]] Error: Data conversion failed. The data conversion for column "PRC_ID" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
and the type suggestions from the connection managers for each table from Flat File Source tasks are not accurate enough to prevent errors when running the import package (and the DB types are the correct ones, so don't want to just make them all (wrongly) strings for the sake of loading the TSVs).
Is there a way to load the type data for the columns in some single file rather than one by one in the connection manager window for the Flat File Source tasks (this would be hugely inconvenient as each table may have many fields)?
I have the creation statements that were used to create each of the tables that the TSVs correspond to, could that be used in any way? Can a Flat File Source inherent data types for columns from its OLE DB Destination? Any other ways to avoid having to set each type by hand?
There is no difference between changing columns data type from the Flat File Source, and between keeping all data types as string and linking them to OLE DB Destination (different data types). Since both methods are performing Implicit Data conversion since flat file are text files and store all data as text (columns don't have metadata).
Change data types from Advanced Editor vs Data Conversion Transformation
If you are looking to automatically set the data types, I don't think there is another way to do that other than the solution you mentioned in the comments or creating package programmatically(even I don't find it useful to do it that way).

Specifying flat file data types vs using data conversion

This may be a stupid question but I must ask since I see it a lot... I have inherited quite a few packages in which developers will use the the Data Conversion transformation shape when dumping flat files into their respective sql server tables. This is pretty straight forward however I always wonder why wouldn't the developer just specify the correct data types within the flat file connection and then do a straight load into the the table?
For example:
Typically I will see flat file connections with columns that are DT_STR and then converted into the correct type within the package ie: DT_STR of length 50 to DT_I4. However, if the staging table and the flat file are based on the same schema - why wouldn't you just specify the correct types (DT_I4) in the flat file connection? Is there any added benefit (performance, error handling) for using the data conversion task that I am not aware of?
This is a good question with not one right answer. Here is the strategy that I use:
If the data source is unreliable
i.e. sometimes int or date values are strings, like when you have the literal word 'null' instead of the value being blank. I would let the data source be treated as strings and deal with converting the data downstream.
This could mean just staging the data in a table and using the database to do conversions and loading from there. This pattern avoid the source component throwing errors which is always tricky to troubleshoot. Also, it avoids having to add error handling into data conversion components.
Instead, if the database throws a conversion error, you can easily look at the data in your staging table to examine the problem. Lastly, SQL is much more forgiving with date conversions than ssis.
If the data source is reliable
If the dates and numbers are always dates and numbers, I would define the datatypes in the connection manager. This makes it clear what you are expecting from the file and makes the package easier to maintain with fewer components.
Additionally, if you go to the advanced properties of the flatfile source, integers and dates can be set to fast parse which will speed up the read time: https://msdn.microsoft.com/en-us/library/8893ea9d-634c-4309-b52c-6337222dcb39?f=255&MSPPError=-2147217396
When I use data conversion
I rarely use the data conversion component. But one case I find it useful is for converting from / to unicode. This could be necessary when reading from an ado.net source which always treats the input as unicode, for example.
You could change the output data type in the flat file connection manager in Advanced page or right click the source in Data flow, Advanced editor to change the data type before loading it.
I think one of the benefit is the conversion transformation could allow you output the extra column, usually named copy of .., which in some case, you might use both of the two columns. Also, sometimes when you load the data from Excel source, all coming with Unicode, you need to use Data conversion to do the data TF, etc.
Also, just FYI, you could also use Derived Column TF to convert the data type.
UPDATE [Need to be further confirmed]:
From the flat file source connection manager, the maximum length of string type is 255, while in the Data Conversion it could be set over 255.

SSIS (ASCII needed): "Code page is 1252 and is required to be 20127"

I have a requirement to export a database to a tab-delimited file in the ASCII format. I am using derived columns to convert any Unicode strings to non-Unicode strings. For example, a former Unicode text stream is now casted as this:
(DT_TEXT,20127)incomingMessage
But SSIS is still looking for ANSI. I am still seeing an error at the Flat File Destination:
The code page on input column <column_name> is 1252 and is required to be 20127.
This happens for any column in the table, not just Unicode ones.
This is what I have been doing to ensure ASCII is used:
In the Flat File Connection Manager, used Code page "20127 (US-ASCII)"
Used a Derived Column to cast data types
In the OLE DB source, set the default code page to 20127
Any thoughts?
How about using the Data Conversion task? Connect the Flat File task to the Data Conversion and then change the metadata on the fly to suit your needs. You should be able to delete the derived column task if you change the metadata to handle the unicode issues in the data conversion task. Then you can process the records accordingly into the OLE DB Source without issues.

Resources