Data load in Snowflake: NULL result in a non-nullable column - snowflake-cloud-data-platform

I am getting the error message: NULL result in a non-nullable column on my loading my parquet files into Snowflake.
I have NOT null columns in Snowflake for example, NAME2, NAME3, but the values against them in the parquet files are empty string.
So my question is how can I resolve this constraint without changing my table definition or without removing not null constraint?
COPY INTO "DB_STAGE"."SCH_ABC_INIT"."T_TAB" FROM (
SELECT
$1:OPSYS::VARCHAR,
$1:MANDT::VARCHAR,
$1:LIFNR::VARCHAR,
$1:LAND1::VARCHAR,
$1:NAME1::VARCHAR,
$1:NAME2::VARCHAR,
$1:NAME3::VARCHAR,
$1:NAME4::VARCHAR,
..
..
$1:OPTYPE::VARCHAR
FROM #DB_STAGE.SCH_ABC_INIT.initial_load_stage_ABC)
file_format = (type = 'parquet', NULL_IF=('NULL','',' ','NULL','NULL','//N'))
pattern = '.*/ABC-TAB-prod/.*snappy.parquet';

I believe that this line
file_format = (type = 'parquet', NULL_IF=('NULL','',' ','NULL','NULL','//N'))
is explicitly asking to take the empty strings and make them into NULL values, which obviously won't work going into fields that are NOT NULL in your table. You should probably try something like this:
file_format = (type = 'parquet', NULL_IF=('NULL','//N'))
Your other option is to remove the NOT NULL in your table and allow the conversion to NULL.

Looking at these options there is some guidance on NULLs particularly for parquet
NULL_IF = ( 'string1' [ , 'string2' , ... ] )
Use
Data loading only
Definition
String used to convert to and from SQL NULL. Snowflake replaces these strings in the data load source with SQL NULL. To specify more than one string, enclose the list of strings in parentheses and use commas to separate each value.
This file format option is applied to the following actions only when loading Parquet data into separate columns using the MATCH_BY_COLUMN_NAME copy option.
Note that Snowflake converts all instances of the value to NULL, regardless of the data type. For example, if 2 is specified as a value, all instances of 2 as either a string or number are converted.
For example:
NULL_IF = ('\N', 'NULL', 'NUL', '')
Note that this option can include empty strings.
Default
\N (i.e. NULL, which assumes the ESCAPE_UNENCLOSED_FIELD value is \)
from the docs: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
There's also more content I found under communities: https://community.snowflake.com/s/question/0D50Z00009Vw7ktSAB/how-can-i-get-schema-file-when-copy-into-file-returns-empty-row
https://community.snowflake.com/s/question/0D50Z00008UE4MKSA1/while-loading-a-parquet-file-to-snowflake-all-the-optional-field-in-parquet-schema-are-coming-as-null-any-idea-why-it-is-happening-all-other-fields-which-are-mandatory-in-parquet-schema-as-coming-as-expected
Let me know if these help

Try switching empty_field_as_null=false in file_format.

Related

Snowflake JSON with foreign language to tabular format dynamically

I read through snowflake documentation and the web and found only one solution to my problem by https://stackoverflow.com/users/12756381/greg-pavlik which can be found here Snowflake JSON to tabular
This doesn't work on data with Russian attribute names and attribute values. What modifications can be made for this to fit my case?
Here is an example:
create or replace table target_json_table(
v variant
);
INSERT INTO target_json_table SELECT parse_json('{
"at": {
"cf": "NV"
},
"pd": {
"мо": "мо",
"ä": "ä",
"retailerName": "retailer",
"productName":"product"
}
}');
call create_view_over_json('target_json_table', 'V', 'MY_VIEW');
ERROR: Encountered an error while creating the view. SQL compilation error: syntax error line 7 at position 7 unexpected 'ä:'. syntax error line 8 at position 7 unexpected 'мо'.
There was a bug in the original SQL used as a basis for the creation of the stored procedure. I have corrected that. You can get an update on the Github page. The changed section is here:
sql =
`
SELECT DISTINCT '"' || array_to_string(split(f.path, '.'), '"."') || '"' AS path_nAme, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
'"' || array_to_string(split(f.path, '.'), '.') || '"' AS alias_name -- This generates column aliases based on the path
FROM
#~TABLE_NAME~#,
LATERAL FLATTEN(#~COL_NAME~#, RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[') -- This prevents traversal down into arrays
limit ${ROW_SAMPLE_SIZE}
`;
Previously this SQL simply replaced non-ASCII characters with underscores. The updated SQL will wrap key names in double quotes to create non-ASCII key names.
Be sure that's what you want it to do. Also, the keys are nested. I decided that the best way to handle that is to create column names in the view with dot notation, for example one column name is pd.ä. That will require wrapping the column name with double quotes, such as:
select * from MY_VIEW where "pd.ä" = 'ä';
Final note: The name of your stored procedure is create_view_over_json, however, in the Github project the name is create_view_over_variant. When you update, be sure to call the right procedure.

Snowflake Parquet Copy Into NULL_IF

I have a staged parquet file in and s3 location. I am attempting to parse the parquet file into a relational table, the field i'm having an issue with is a timestamp_ntz field.
In the file, there is a field called "due_date", and while most of the time it is populated with data, on occasion there is an empty string like below:
"due_date":""
The error that i'm receiving is 'Failed to cast variant value "" to TIMESTAMP_NTZ.'
Using the NULL_IF parameter in the copy into is not yielding any results and is set to:
file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'))
I have seen other users replace the NULL's in the SELECT portion of the COPY INTO statement, but this would be a hard to implement option due to the fields being dynamic.
Could anyone shed any light on this, other than the knowledge that empty strings shouldn't form part of parquet?
Full query below: USE SCHEMA MY_SCHEMA; COPY INTO MY_SCHEMA.MY_TABLE(LOAD_DATE,ACCOUNTID,APPID,CREATED_AT,CREATED_ON,DATE,DUE_DATE,NUMEVENTS,NUMMINUTES,REMOTEIP,SERVER,TIMESTAMP,TRACKNAME,TRACKTYPEID,TRANSACTION_DATE,TYPE,USERAGENT,VISITORID) FROM (SELECT CURRENT_TIMESTAMP(),$1:accountId,$1:appId,$1:created_at,$1:created_on,$1:date,$1:due_date,$1:numEvents,$1:numMinutes,$1:remoteIp,$1:server,$1:timestamp,$1:trackName,$1:trackTypeId,$1:transaction_date,$1:type,$1:userAgent,$1:visitorId FROM #my_stage ) PATTERN = '.*part.*' file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'));
You can use TRY_TO_TIMESTAMP. Since TRY_TO_TIMESTAMP does not accept variant, you need to cast it to string first:
TRY_TO_TIMESTAMP($1:due_date::string)
instead of just
$1:due_date
If the due_date is empty, the result will be NULL in the timestamp field in the target table after insert.

Why does ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE does not work in snowflake?

I'm creating table in snowflake using copy into clause, the values are coming from another table.
now I want to raise an error if in the table I have more columns than the new one.
for example I have in the new table 13 columns, and want to raise exception if the current record I'm trying to insert has 14 columns.
for that I used the parameter ERROR_ON_COLUMN_COUNT_MISMATCH and set this to TRUEv within the file_format = () clause.
the data I want to insert is file, pipe delimited, which sometimes when parsing it we get more columns than we except (14 instead of 13 ex.).
file_format = (format_name = MY_FORMAT, ERROR_ON_COLUMN_COUNT_MISMATCH =
TRUE)
suppose to work... but not working.
anybody saw thing like this before?

COPY INTO query on Snowflake returns TABLE does not exist error

I am trying to load data from azure blob storage.
The data has already been staged.
But, the issue is when I try to run
copy into random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
Below is the error I encounter:
raise error_class(
snowflake.connector.errors.ProgrammingError: 001757 (42601): SQL compilation error:
Table 'random_table_name' does not exist
Basically, it says table does not exist, which it does not, but the syntax on website is the same as mine.
COPY INTO query on Snowflake returns TABLE does not exist error
In my case the table name is case-sensitive. Snowflake seems to convert everything to upper case. I changed the database/schema/table names to all upper-case and it started working.
First run the below query to fetch the column headers
select $1 FROM #stage_name_i_created/filename.csv limit 1
Assuming below are the header lines from your csv file
id;first_name;last_name;email;age;location
Create a file_format csv
create or replace file format semicolon
type = 'CSV'
field_delimiter = ';'
skip_header=1;
Then you should define the datatype and field name as below
create or replace table <yourtable> as
select $1::varchar as id
,$2::varchar as first_name
,$3::varchar as last_name
,$4::varchar as email
,$5::int as age
,$6::varchar as location
FROM #stage_name_i_created/yourfile.csv
(file_format => semicolon );
The table must exist prior to running a COPY INTO command. In your post, you say that the table does not exist...so that is your issue.
If your table exist, try by forcing the table path like this:
copy into <database>.<schema>.<random_table_name>
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
or by steps like this:
use database <database_name>;
use schema <schema_name>;
copy into database.schema.random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv';
rbachkaniwala, what do you mean by 'How do I create a table?( according to snowflake syntax it is not possible to create empty tables)'.
You can just do below to create a table
CREATE TABLE random_table_name (FIELD1 VARCHAR, FIELD2 VARCHAR)
The table does need to exist. You should check the documentation for COPY INTO.
Other areas to consider are
do you have the right context set for the database & schema
does the user / role have access to the table or object.
It basically seems like you don't have the table defined yet. You should
ensure the table is created
ensure all columns in the CSV exist as columns in the table
ensure the order of the columns are the same as in the CSV
I'd check data types too.
"COPY INTO" is not a query command, it is the actual data transfer execution from source to destination, which both must exist as others commented here but If you want just to query without loading the files then run the following SQL:
//Display list of files in the stage to verify stage
LIST #stage_name_i_created;
//Create a file format
CREATE OR REPLACE FILE FORMAT RANDOM_FILE_CSV
type = csv
COMPRESSION = 'GZIP' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//Now select the data in the files
Select $1 as first_col,$2 as second_col //can add as necessary number of columns ...etc
from #stage_name_i_created
(FILE_FORMAT => RANDOM_FILE_CSV)
More information can be found in the documentation link here
https://docs.snowflake.com/en/user-guide/querying-stage.html

presto syntax for csv external table with array in one of the fields

I'm having trouble creating a table in Athena - that points at files with the following format:
string, string, string, array.
when I wrote the file - I delimited the array items with '|'.
I delimited each line with '\n' and each column with ','.
so for example, a row in my CSV would look like that:
Garfield, 15, orange, fish|milk|lasagna
in hive (according to the documentation i read)- when creating a table with a row delimited format - while stating the delimiters you can state a 'collection items' delimiter - that states the delimiter between elements in array columns.
I could not find an equivalent for Presto in the documentation,
Is anyone aware if it's possible, if so - what is the format, or where can I find it?
i tried "guessing" many forms, including 'collection items', none seem to work.
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
Would really appreciate any insight, Thanks! :)
According to AWS Athena docs on using SerDe, your guess was 100% correct.
In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify ROW FORMAT DELIMITED
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
Now, when I simply tried your DDL statement, I would get
line 1:8: no viable alternative at input 'create external'
However by deleting LINES TERMINATED BY '\n', I was able to create table schema in meta catalog
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
Sample file with lines as shown in your file would get parsed correctly and I was able to do UNNEST on foods column:
SELECT *
FROM "cats"
CROSS JOIN UNNEST(foods) as t(food)
which resulted in
Moreover, it was also enough to simply swap lines LINES TERMINATED BY '\n' and COLLECTION ITEMS TERMINATED BY '|' for query to work (although I don't have an explanation for it)
CREATE EXTERNAL TABLE `cats`(
`name` string,
`age` string,
`color` string,
`foods` array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'some-location'
(Note: this answer is applicable to Presto in general, but not to Athena)
Currently you cannot set collection delimiter in Presto.
Please create a feature request # https://github.com/prestosql/presto/issues/
Note, we plan to provide generic support for table properties to address cases like this holistically -- https://github.com/prestosql/presto/issues/954. You can track the issue and associated pull request for updates.
I'm use presto engine creating a hive table , set collection delimiter in Presto, for example:
CREATE TABLE IF NOT EXISTS test (
id bigint COMMENT 'ID',
type varchar COMMENT 'TYPE',
content varchar COMMENT 'CONTENT',
create_time timestamp(3) COMMENT 'CREATE TIME',
pt varchar
)
COMMENT 'create time 2021/11/04 11:27:53'`
WITH (
format = 'TEXTFILE',
partitioned_by = ARRAY['pt'],
textfile_field_separator = U&'\0001'
)

Resources