External table in Snowflake with CSV format - snowflake-cloud-data-platform

I am trying to load data from CSV file (comma separated) from Azure Blob to Snowflake External table .
Now the cell content has comma separated values due to which data is getting jumbled up.
There is no possibility that the CSV file can be changed.
What should then be the approach ?
CREATE OR REPLACE EXTERNAL TABLE ANALYTICSLAYER.AN_MEDALLIA_P.EXT_DIM_APP_TRAINING_SURVEY
(
"Col1" varchar AS (value:c1::varchar),
"Col2" varchar AS (value:c2::varchar),
"Col3" varchar AS (value:c3::varchar),
"Col4" varchar AS (value:c4::varchar),
"Col5" varchar AS (value:c5::varchar),
"Col6_DATE" varchar AS (value:c6::varchar),
"Col7" varchar AS (value:c7::varchar)
)
WITH
LOCATION=#ST_TESTAZURE_P
AUTO_REFRESH = true
FILE_FORMAT = 'FF_TEST_CSV_P'
PATTERN='.*DIM_TEST_FILE_DATA.csv';

Related

How to load Parquet/AVRO into multiple columns in Snowflake with schema auto detection?

When trying to load a Parquet/AVRO file into a Snowflake table I get the error:
PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
But I don't want to load these files into a new one column table — I need the COPY command to match the columns of the existing table.
What can I do to get schema auto detection?
Good news, that error message is outdated, as now Snowflake supports schema detection and COPY INTO multiple columns.
To reproduce the error:
create or replace table hits3 (
WatchID BIGINT,
JavaEnable SMALLINT,
Title TEXT
);
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet);
-- PARQUET file format can produce one and only one column of type variant or object or array.
-- Use CSV file format if you want to load more than one column.
To fix the error and have Snowflake match the columns from the table and Parquet/AVRO files just add the option MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE (or MATCH_BY_COLUMN_NAME=CASE_SENSITIVE):
copy into hits3
from #temp.public.my_ext_stage/files/
file_format = (type = parquet)
match_by_column_name = case_insensitive;
Docs:
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
https://docs.snowflake.com/en/user-guide/data-load-overview.html?#detection-of-column-definitions-in-staged-semi-structured-data-files

Snowflake Copy Into failing when insert Null in timestamp column

Trying to load file data into Snowflake using COPY INTO. The table has a timestamp column. The file has only Null's empty string "" in that column.
On running copy into with File Format Timestamp option set AUTO, the statement is failing stating Can't parse '' as timestamp.
Is there any way to handle this
Using NULL_IF option:
NULL_IF = ( 'string1' [ , 'string2' ... ] )
String used to convert to and from SQL NULL. Snowflake replaces these strings in the data load source with SQL NULL. To specify more than one string, enclose the list of strings in parentheses and use commas to separate each value.
NULL_IF = ('\\N', '')

Migrating from SQL Server to Hive Table using flat file

I am migrating my data from SQL Server to Hive using following steps but there is data issue with the resulting table. I tried various options including checking datatype, Using csvSerde but not able to get data aligned properly in respective columns. I followed following steps:
Export SQL Server data to flat file with fields separated by comma.
Create external table in Hive as given below and load data.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable (
r_date timestamp
, v_nbr varchar(12)
, d_account int
, d_amount decimal(19,4)
, a_account varchar(14)
)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable/mytable.txt' OVERWRITE INTO TABLE myschema.mytable;
There is issue with data with all combination I could try.
I also tried OpenCSVSerde but the result was worse than simple text file. I also tried by changing delimiter to semicolon but no luck.
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ( "separatorChar" = ",") stored as textfile
location 'gs://mybucket/myschema.db/mytable/';
Can you please suggest some robust approach so that I don't have to deal with data issue.
Note: Currently I don't have option of connecting my SQL Server table with Sqoop.

Staged Internal file csv.gz giving error that file does not match size of corresponding table?

I am trying to copy a csv.gz file into a table I created to start analyzing location data for a map. I was running into an error that says that there are too many characters, and I should add a on_error option. However, I am not sure if that will help load the data, can you take a look?
Data source: https://data.world/cityofchicago/array-of-things-locations
SELECT * FROM staged/array-of-things-locations-1.csv.gz
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location_2 variant, location variant);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE OR REPLACE FILE FORMAT t_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
CREAT OR REPLACE STAGE staged
FILE_FORMAT='t_csv';
COPY INTO ARRAYLOC FROM #~/staged file_format = (format_name = 't_csv');
Error message:
Number of columns in file (8) does not match that of the corresponding table (9), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/staged/array-of-things-locations-1.csv.gz', line 2, character 1 Row 1 starts at line 1, column "ARRAYLOC"["LOCATION_2":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Solved:
The real issue was that I need to better clean the data I was staging. This was my error. This is what I ended up changing: the column types, changing the file from " to ' and had to separate one column due to a comma in the middle of the data.
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude float, longitude varchar, location varchar);
COPY INTO ARRAYLOC
FROM #staged/array-of-things-locations-1.csv.gz;
CREATE or Replace FILE FORMAT r_csv
TYPE = "CSV"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'csv.gz'
SKIP_HEADER = 1
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
EMPTY_FIELD_AS_NULL = TRUE;
create or replace stage staged
file_format='r_csv';
copy into ARRAYLOC from #~/staged
file_format = (format_name = 'r_csv');
SELECT * FROM ARRAYLOC LIMIT 10;
Your error doesn't say that you have too many characters but that your file has 8 columns and your table has 9 columns, so it doesn't know how to align the columns from the file to the columns in the table.
You can list out the columns specifically using a subquery in your COPY INTO statement.
Notes:
Columns from the file are positional based, so $1 is the first column in the file, $2 is the second, etc....
You can put the columns from the file in any order that you need to match your table.
You'll need to find the column that doesn't have data coming in from the file and either fill it with null or some default value. In my example, I assume it is the last column and in it I will put the current timestamp.
It helps to list out the columns of the table behind the table name, but this is not required.
Example:
COPY INTO ARRAYLOC (COLUMN1,COLUMN2,COLUMN3,COLUMN4,COLUMN5,COLUMN6,COLUMN7,COLUMN8,COLUMN9)
FROM (
SELECT $1
,$2
,$3
,$4
,$5
,$6
,$7
,$8
,CURRENT_TIMESTAMP()
FROM #staged/array-of-things-locations-1.csv.gz
);
I will advise against changing the ERROR_ON_COLUMN_COUNT_MISMATCH parameter, doing so could result in data ending up in the wrong column of the table. I would also advise against changing the ON_ERROR parameter as I believe it is best to be alerted of such errors rather than suppressing them.
Yes, setting that option should help. From the documentation:
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE
Use: Data loading only
Definition: Boolean that specifies whether to generate a parsing error
if the number of delimited columns (i.e. fields) in an input file does
not match the number of columns in the corresponding table.
If set to FALSE, an error is not generated and the load continues. If
the file is successfully loaded:
If the input file contains records with more fields than columns in
the table, the matching fields are loaded in order of occurrence in
the file and the remaining fields are not loaded.
If the input file contains records with fewer fields than columns in
the table, the non-matching columns in the table are loaded with NULL
values.
This option assumes all the records within the input file are the same
length (i.e. a file containing records of varying length return an
error regardless of the value specified for this parameter).
So assuming you are okay with getting NULL values for the missing column in your input data, you can use ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE to load the file successfully.
When viewing that table directly on data.world, there are columns named both location and location_2 with identical data. It looks like that display is erroneous, because when downloading the CSV, it has only a single location column.
I suspect if you change your CREATE OR REPLACE statement with the following statement that omits the creation of location_2, you'll get to where you want to go:
CREATE OR REPLACE TABLE ARRAYLOC(name varchar, location_type varchar, category varchar, notes varchar, status1 varchar, latitude number, longitude number, location variant);

hive "\n" value in records

I am processing a large 120 GB file using hive. Data is first loaded from sql server table to aws s3 as csv file (tab separated) and then hive external table is created on top of this file. I have encountered a problem while querying data from hive external table. I noticed that csv contains \n in many columns fields (which was actually “null” in sql server). Now when I create hive table the \n that appears in any record takes hive to new record and generate NULL for rest of the columns in that record. I tried lines terminated by "001" but no success. I get error that hive only supports only "lines terminated by \n". My question is if hive supports only \n as line separator how would you handle columns that contains \n values?
Any suggestions?
This is how I am creating my external table:
DROP TABLE IF EXISTS IMPT_OMNITURE__Browser;
CREATE EXTERNAL TABLE IMPT_OMNITURE__Browser (
ID int, Region string, Description string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://abm-dw/data-import/omniture/Browser/';
You could alter the table with the below command or add the property in the create statement in the TBL properties ;
ALTER TABLE table set SERDEPROPERTIES ('serialization.null.format' = "");
This would make the data in the file as NULL.

Resources