Treat empty string as NULL in Snowflake - snowflake-cloud-data-platform

I am handling a migration project where our database is getting changed from Oracle 12c to Snowflake.
Currently there are many IICS (Informatica) integrations that load data into Oracle from different source systems and when it extracts an empty string from source systems, it would be treated and loaded as NULL in Oracle.
During the testing phase of database migration, we observed that empty string is treated as empty string itself in Snowflake and this is causing a lot of data comparison issues while doing reconciliation between Oracle and Snowflake data and other issues in the downstream.
Is there a way we can handle this scenario, wherein we can force Snowflake or IICS to treat empty string as NULL ? There are integration level functions to check the length of each field and perform validation but we are talking about hundreds of such integrations. I am looking for a global setting which can be applied on all the integrations or solution with minimal code changes. Any thoughts, suggestions or ideas are greatly appreciated.

I'm not sure exactly how you're loading your data into snowflake, but I think your best option here is to have your load process convert empty strings to null. That's the only option I can really think of where you don't have to specify per column.
For example, in COPY INTO statements (and by extension PIPEs), you can use the EMPTY_FIELD_AS_NULL option for your file formats. See Format options here.

You have two related options in the COPY INTO command:
NULL_IF = ('', 'null', 'NULL') - by default '\\N'. When unloading data from Snowflake (in case you later upload it to Oracle), or loading data into Snowflake. What my sample here does is replacing any empty string, and 'null' or 'NULL' values (from a file when loading, or from a table when unloading) by an actual SQL NULL.
EMPTY_FIELD_AS_NULL = TRUE - by default TRUE. When loading data into Snowflake, a field like ",," (assuming comma as field separator) will be inserted as SQL NULL in the loading table, when the option is set. When unloading, use with FIELD_OPTIONALLY_ENCLOSED_BY, to distinguish between empty strings and NULLs.

Related

Column "" cannot convert between unicode and non-unicode string data types

I am trying to import the data from the flat file into the Azure SQL database table and I have a merge to merge with another source too. But when I map the fields from the flat file to the Azure SQL database I keep getting the error like
Column "Location" cannot convert between unicode and non-unicode string data types
Upon looking at some forums I tried to change the data type of the field to Unicode string[DT_WSTR] and even I tried to have string [DT_STR]
The Destination Azure SQL database below is the Location field
Can anyone please suggest what I am missing here? Any help is greatly appreciated
Changing the columns data types from the component's advanced editor will not solve the problem. If the values imported contain some Unicode characters, you cannot convert them to non-Unicode strings, and you will receive the following exception. Before providing some solution, I highly recommend reading this article to learn more on data type conversion in SSIS:
SSIS Data types: Change from the Advanced Editor vs. Data Conversion Transformations
Getting back to your issue, there are several solutions you could try:
Changing the destination column data type (if possible)
Using the Data conversion transformation component, implement an error handling logic where the values throwing exceptions are redirected to a staging table or manipulated before re-importing them to the destination table. You can refer to the following article: An overview of Error Handling in SSIS packages
From the flat file connection manager, got to the "Advanced Tab", and change the column data type to DT_STR.

Make Kingswaysoft truncate input data that is too long

I have an SSIS project that I'm using to automate pulling CRM data into a SQL Server Database using Kingswaysoft. These SSIS packages are autogenerated, so my solution to this issue needs to be compatible with that.
The description field on Contact in CRM is a nvarchar(2000), but this CRM org still has old data, and some of those old contact records have a description longer than 2000 characters. When I try to pull those using Kingsway, I get this error:
Error: 0xC002F304 at Stage Data for contact, Export contact Data [2]: An error occurred with the following error message: "The input value for 'description' field (or one of its related fields) does not fit into the output buffer, please consider increasing the output column's Length property or changing its data type to one that can accommodate more data such as ntext (DT_NTEXT). This change can be done using the component's Advanced Editor window.".
This makes sense, since I'm pulling a column longer than specified in the metadata, but the problem is that I want to ignore this error, truncate the column, and continue the data load. Obviously I could set the column to DT_NTEXT and not worry about it, but since these packages are autogenerated I have no way of knowing beforehand which columns have old data and which don't, so I won't know which should be DT_NTEXT.
So is there a way to make Kingswaysoft truncate input data which is longer than what's specified in the metadata?
Thank you for choosing KingswaySoft as your integration solution. For this situation, unfortunately there is no way to make that work without making those changes in the component’s Advanced Editor.
If the source component just simply ignores the error and truncates the value, you will lose some of your data and thus affect the data integrity during the integration. Therefore, you may need to change the data type to DT_NTEXT or increase the length of this field in order to handle this situation properly. Alternatively, you can try to change the field length on your CRM side so that the SSIS package can be generated correctly.

ETL Matching Code Page SSIS Data Flow

I have found plenty online, but nothing specific to my problem. I have a CSV rendered in code page 65001 (Unicode). However, in the Advanced section of the Flat File Connection Manager, the column is data type string [DT_STR]
My database table I am loading to can be in any format; I don't care. My question is what is the best way to handle this?
1) Change the Advanced properties of the flat file connection columns?
2) Change the data types of the SQL table to NVARCHAR?
3) Change the OLE DB properties to AlwaysUseDefaultCodePage = TRUE?
4) Use Data Conversion task to convert the column data types?
If your source's code page doesn't change, my suggestion is to use a simple data conversion, try to avoid manipulating source and destination whenever possible. Always go for ETL solutions first.
I usually always start off with setting up my connection string for the flat file, then converting the data using a data conversion component (input/output datatypes) based on the flat file data types and destination datatypes. Then finally setting up the connection string for the DB destination. Below is an example of how my data flow looks.

Specifying flat file data types vs using data conversion

This may be a stupid question but I must ask since I see it a lot... I have inherited quite a few packages in which developers will use the the Data Conversion transformation shape when dumping flat files into their respective sql server tables. This is pretty straight forward however I always wonder why wouldn't the developer just specify the correct data types within the flat file connection and then do a straight load into the the table?
For example:
Typically I will see flat file connections with columns that are DT_STR and then converted into the correct type within the package ie: DT_STR of length 50 to DT_I4. However, if the staging table and the flat file are based on the same schema - why wouldn't you just specify the correct types (DT_I4) in the flat file connection? Is there any added benefit (performance, error handling) for using the data conversion task that I am not aware of?
This is a good question with not one right answer. Here is the strategy that I use:
If the data source is unreliable
i.e. sometimes int or date values are strings, like when you have the literal word 'null' instead of the value being blank. I would let the data source be treated as strings and deal with converting the data downstream.
This could mean just staging the data in a table and using the database to do conversions and loading from there. This pattern avoid the source component throwing errors which is always tricky to troubleshoot. Also, it avoids having to add error handling into data conversion components.
Instead, if the database throws a conversion error, you can easily look at the data in your staging table to examine the problem. Lastly, SQL is much more forgiving with date conversions than ssis.
If the data source is reliable
If the dates and numbers are always dates and numbers, I would define the datatypes in the connection manager. This makes it clear what you are expecting from the file and makes the package easier to maintain with fewer components.
Additionally, if you go to the advanced properties of the flatfile source, integers and dates can be set to fast parse which will speed up the read time: https://msdn.microsoft.com/en-us/library/8893ea9d-634c-4309-b52c-6337222dcb39?f=255&MSPPError=-2147217396
When I use data conversion
I rarely use the data conversion component. But one case I find it useful is for converting from / to unicode. This could be necessary when reading from an ado.net source which always treats the input as unicode, for example.
You could change the output data type in the flat file connection manager in Advanced page or right click the source in Data flow, Advanced editor to change the data type before loading it.
I think one of the benefit is the conversion transformation could allow you output the extra column, usually named copy of .., which in some case, you might use both of the two columns. Also, sometimes when you load the data from Excel source, all coming with Unicode, you need to use Data conversion to do the data TF, etc.
Also, just FYI, you could also use Derived Column TF to convert the data type.
UPDATE [Need to be further confirmed]:
From the flat file source connection manager, the maximum length of string type is 255, while in the Data Conversion it could be set over 255.

ColdFusion 8 + MSSQL 2005 and CLOB datatype on resultset

The environment I am working with is CF8 and SQL 2005 and the datatype CLOB is disabled on the CF administrator. My concern is, will there be a performance ramification by enabling the CLOB datatype in the CF Administrator.
The reason I want/need to enable it is, SQL is building the AJAX XML response. When the response is large, the result is either truncated or returned with multiple rows (depending on how the SQL developer created the stored proc). Enabling CLOB allows the entire result to be returned. The other option I have is to have SQL always return the XML result in multiple rows and have CF join the string for each result row.
Anyone with some experience with this idea or have any thoughts?
Thanks!
I really think that returning Clob data is likely to be less expensive then concatenating multiple rows of data into an XML string and then parsing it (ick!). What you are trying to do is what CLOB is designed for. JDBC handles it pretty well. The performance hit is probably negligible. After all - you have to return the same amount of character data either way, whether in multiple rows or a single field. And to have to "break it up" on the SQL side and then "reassemble" it on the CF side seems like reinventing the wheel to be sure.
I would add that questions like this sometimes mystify me. A modest amount of testing would seem to be able to answer this question to your own satisfaction - no?
I would just have the StoredProc return the data set, or multiple data sets, and just build the XML the way you need it via CF.
I've never needed to use CLOB. I almost always stick to the varchar datatype, and it seems to do the job just fine.
There are also options where you could call the Stored Proc, which triggers MSSQL to generate an actual xml file (not just a string) and simply return you the file name. Then you can use CFFILE action="read" to grab the xml string and parse it accrodingly. Assuming your web server and db have a common file storage area.

Resources