NULL Value Handling for CSV Files Via External Tables in Snowflake - snowflake-cloud-data-platform

I am trying to get the NULL_IF parameter of a file format working when applied to an external table.
I have a source CSV file containing NULL values in some columns. NULLS in the source file appear in the format "\N" (all non numeric values in the file are quoted). Here is an example line from the raw csv where the ModifiedOn value is NULL in the source system:
"AirportId" , "IATACode" , "CreatedOn" , "ModifiedOn"
1 , "ACU" , "2015-08-25 16:58:45" , "\N"
I have a file format defined including the parameter NULL_IF = "\\N"
The following select statement successfully interprets the correct rows as holding NULL values.
SELECT $8
FROM #MyS3Bucket
(
file_format => 'CSV_1',
pattern => '.*MyFileType.*.csv.gz'
)
However if I use the same file format with an external table like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (value:c8::varchar)
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
Each row holds \N as a value rather than NULL.
I assume this is caused by external tables providing a single variant output that can then be further split rather than directly presenting individual columns in the csv file.
One solution is to code the NULL handling into the external view like this:
CREATE OR REPLACE EXTERNAL TABLE MyTable
MyColumn varchar as (NULLIF(value:c8::varchar,'\\N'))
WITH LOCATION = #MyS3Bucket
FILE_FORMAT = (FORMAT_NAME = 'CSV_1')
PATTERN = '.*MyFileType_.*.csv.gz';
However this leaves me at risk of having to re-write a lot of external table code if the file format changes whereas the file format could\should centralise that NULL definition. It would also mean the NULL conversion would have to be handled column by column rather than file by file increasing code complexity.
Is there a way that I can have the NULL values appear through an external table without handling them explicitly through column definitions?
Ideally this would be applied through a file format object but changes to the format of the raw file are not impossible.

I am able to reproduce the issue, and it seems like a bug. If you have access to Snowflake support, it could be better to submit a support case regarding to this issue, so you can easily follow the process.

Related

How to load Parquet/AVRO into multiple columns in Snowflake with schema auto detection?

When trying to load a Parquet/AVRO file into a Snowflake table I get the error:
PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
But I don't want to load these files into a new one column table — I need the COPY command to match the columns of the existing table.
What can I do to get schema auto detection?
Good news, that error message is outdated, as now Snowflake supports schema detection and COPY INTO multiple columns.
To reproduce the error:
create or replace table hits3 (
WatchID BIGINT,
JavaEnable SMALLINT,
Title TEXT
);
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet);
-- PARQUET file format can produce one and only one column of type variant or object or array.
-- Use CSV file format if you want to load more than one column.
To fix the error and have Snowflake match the columns from the table and Parquet/AVRO files just add the option MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE (or MATCH_BY_COLUMN_NAME=CASE_SENSITIVE):
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet)
match_by_column_name = case_insensitive;
Docs:
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
https://docs.snowflake.com/en/user-guide/data-load-overview.html?#detection-of-column-definitions-in-staged-semi-structured-data-files

How to pass optional column in TABLE VALUE TYPE in SQL from ADF

I have the following table value type in SQL which is used in Azure Data Factory to import data from a flat file in a bulk copy activity via a stored procedure. File 1 has all three columns in it so this works fine. File 2 only has Column1 and Column2, but NOT Column3. I figured since the column was defined as NULL it would be ok but ADF complains that its attempting to pass in 2 columns when the table type expects 3. Is there a way to reuse this type for both files and make Column3 optional?
CREATE TYPE [dbo].[TestType] AS TABLE(
Column1 varchar(50) NULL,
Column2 varchar(50) NULL,
Column3 varchar(50) NULL
)
Operation on target LandSource failed:
ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A
database operation failed with the following error: 'Trying to pass a
table-valued parameter with 2 column(s) where the corresponding
user-defined table type requires 3 column(s)
Would be nice if the copy activity behavior was consistent regardless of whether or not a stored procedure with table type is used or native BCP in the activity. When not using the table type and using the default bulk insert, missing columns in the source file end up being NULL in the target table without error (assumming the column is NULLABLE).
It will cause the mapping error in ADF.
In the Copy Activity, every column needs to be mapped.
If the source file only has two columns, it will cause mapping error.
So, I suggest you to create two different Copy activities and create a two columns table type.
You can pass optional column, I've made a test successfully, but the steps will be a bit complex. In my case, File 1 has all three columns, File 2 only has Column1 and Column2, but NOT Column3. It will use Get Metadata activity, Set Variable activity, ForEach activity, IfCondition activity.
Please follow my steps:
You need to define a variable FileName to foreach.
In the Get Metadata1 activity, I specified the file path.
In the ForEach1 activity, use #activity('Get Metadata1').output.childItems to foreach the filelist. It need to be Sequential.
Inside the ForEach1 activity, use Set Variable1 to set the FileName variable.
In the Get Metadata2, use item().name to specify the file.
In the Get Metadata2, use Column count to get the column count from the file.
In the If Contdition1, use #greater(activity('Get Metadata2').output.columnCount,2) to determine whether the file is larger than two columns.
In the True activity, use variable FileName to specify the file.
In the False activity, use Additional columns to add a Column.
When I run debug, the result shows:

Insert fixed values using BCP

I'm trying to import a TXT file using bcp.
My TXT file is like this:
abc|cba
xyz|zyx
My Table is like this:
Field_1 -> Identity field
Field_2 -> Varchar(3)
Field_3 -> Varchar(3)
Filed_4 -> Varchar(1) In this case I must set with default value 'P'
Filed_5 -> Varchar(1) In this case I must set with default value 'C'
My table with values must be:
1,abc,cba,P,C
2,xyz,zyx,P,C
Note-> My TXT file is huge (around 200GB), I can't import into another table to then pass all value to this table (just saying).
##Version-> SQL Server 2014 (SP2)
You cannot generate data via BCP, you must depend on SQL Server to do that as Jeroen commented. To add to his comment, the identity value is not a default, you should continue to use the identity property of the column.
For both (identity and default), you must use the -f option to BCP. This the option to include a format file to direct the BCP utility to see and handle the data as stated in the format file.
Using a format file, you can specify which columns in the file are mapped to which columns are in the destination table. To exclude a column, just set its destination value to "0".
The format files and the bcp utility are much larger topics in and of themselves, but to answer your question; yes it is possible and using a format file with modified destination values (set to "0") is the way to do it.
Doing this, you can process the data once. Using powershell to append data is possible, but unnecessary and less efficient. To do this in one action with bcp, you need to use a format file.

Staged Internal file csv.gz giving error that file does not match size of corresponding table?

I am trying to copy a csv.gz file into a table I created to start analyzing location data for a map. I was running into an error that says that there are too many characters, and I should add a on_error option. However, I am not sure if that will help load the data, can you take a look?
Data source: https://data.world/cityofchicago/array-of-things-locations
SELECT * FROM staged/array-of-things-locations-1.csv.gz
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location_2 variant, location variant);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE OR REPLACE FILE FORMAT t_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
CREAT OR REPLACE STAGE staged
FILE_FORMAT='t_csv';
COPY INTO ARRAYLOC FROM #~/staged file_format = (format_name = 't_csv');
Error message:
Number of columns in file (8) does not match that of the corresponding table (9), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/staged/array-of-things-locations-1.csv.gz', line 2, character 1 Row 1 starts at line 1, column "ARRAYLOC"["LOCATION_2":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Solved:
The real issue was that I need to better clean the data I was staging. This was my error. This is what I ended up changing: the column types, changing the file from " to ' and had to separate one column due to a comma in the middle of the data.
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude float, longitude varchar, location varchar);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE or Replace FILE FORMAT r_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
SKIP_HEADER = 1
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
EMPTY_FIELD_AS_NULL = TRUE;
create or replace stage staged
file_format='r_csv';
copy into ARRAYLOC from #~/staged
file_format = (format_name = 'r_csv');
SELECT * FROM ARRAYLOC LIMIT 10;
Your error doesn't say that you have too many characters but that your file has 8 columns and your table has 9 columns, so it doesn't know how to align the columns from the file to the columns in the table.
You can list out the columns specifically using a subquery in your COPY INTO statement.
Notes:
Columns from the file are positional based, so $1 is the first column in the file, $2 is the second, etc....
You can put the columns from the file in any order that you need to match your table.
You'll need to find the column that doesn't have data coming in from the file and either fill it with null or some default value. In my example, I assume it is the last column and in it I will put the current timestamp.
It helps to list out the columns of the table behind the table name, but this is not required.
Example:
COPY INTO ARRAYLOC (COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8,COLUMN9)
FROM (
SELECT $1
,$2
,$3
,$4
,$5
,$6
,$7
,$8
,CURRENT_TIMESTAMP()
FROM #staged/array-of-things-locations-1.csv.gz
);
I will advise against changing the ERROR_ON_COLUMN_COUNT_MISMATCH parameter, doing so could result in data ending up in the wrong column of the table. I would also advise against changing the ON_ERROR parameter as I believe it is best to be alerted of such errors rather than suppressing them.
Yes, setting that option should help. From the documentation:
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE
Use: Data loading only
Definition: Boolean that specifies whether to generate a parsing error
if the number of delimited columns (i.e. fields) in an input file does
not match the number of columns in the corresponding table.
If set to FALSE, an error is not generated and the load continues. If
the file is successfully loaded:
If the input file contains records with more fields than columns in
the table, the matching fields are loaded in order of occurrence in
the file and the remaining fields are not loaded.
If the input file contains records with fewer fields than columns in
the table, the non-matching columns in the table are loaded with NULL
values.
This option assumes all the records within the input file are the same
length (i.e. a file containing records of varying length return an
error regardless of the value specified for this parameter).
So assuming you are okay with getting NULL values for the missing column in your input data, you can use ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE to load the file successfully.
When viewing that table directly on data.world, there are columns named both location and location_2 with identical data. It looks like that display is erroneous, because when downloading the CSV, it has only a single location column.
I suspect if you change your CREATE OR REPLACE statement with the following statement that omits the creation of location_2, you'll get to where you want to go:
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location variant);

Workaround to BULK INSERT NULL values [SQL Server]

I have not used SQL Server much (I usually use PostgreSQL) and I find hard to believe / accept that one simply cannot insert NULL values from a text file using BULK INSERT, if the file has a value that indicates null or missing data (NULL, NA, na, null, -, ., etc.).
I know BULK INSERT can keep NULL if the field is empty (link, and this is not a nice solution for my case because I have > 50 files, all of them relatively big > 25GB, so I do not want to). But I cannot find a way to tell SQL Server / BULK INSERT that certain value should be interpreted as NULL.
This is, I would say, pretty standard in importing data from text files in most tools. (e.g. COPY table_name FROM 'file_path' WITH (DELIMITER '\t', NULL 'NULL') in PostgreSQL, or readr::read_delim(file = "file", delim = "\t", na = "NULL") in R and the readr package, just to name a couple of examples).
Even more annoying is the fact that the file I want to import was actually exported from SQL Server. It seems that by default, instead of leaving NULL as empty fields in the text files, it writes the value NULL (which makes the file bigger, but anyway). So it seems very odd that the "import" feature (BULK INSERT or the bcp utility) of one tool (SQL Server) cannot properly import the files exported by default by the very same tool.
I've been googling around (link1, link2, link3, link4) and cannot find a workaround for this (different than editing my files to change NULL for empty fields, or import everything as varchar and later work in database to change types and so on). So I would really appreciate any ideas.
For the sake of a reproducible example, here is a sample table where I want to import this sample data stored in a text file:
Sample table:
CREATE TABLE test
(
[item][varchar](255) NULL,
[price][int] NULL
)
Sample data stored in file.txt:
item1, 34
item2, NULL
item3, 55
Importing the data ...
BULK INSERT test
FROM 'file.txt'
WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')
But this fails because on the second line it finds NULL for an integer field. This field, however, allows NULL values. So I want it to understand that this is just a NULL value and not a character value.

Resources