Data not loaded correctly into snowflake table from S3 bucket - snowflake-cloud-data-platform

I am having trouble loading data from an amazon S3 bucket to the snowflake table. This is my command:
copy into myTableName
from 's3://dev-allocation-storage/data_feed/'
credentials=(aws_key_id='***********' aws_secret_key='**********')
PATTERN='.*/.*/.*/.*'
file_format = (type = csv field_delimiter = '|' skip_header = 1 error_on_column_count_mismatch=false );
I have 3 CSV files in my bucket and they are all being loaded into the table. But I have 8 columns in my target table, but they are all being loaded into the first columns as a JSON object.

Check that you do not have each row enclosed in double-quotes. Something like "f1|f2|...|f8". This will be treated like one single column value. Unlike "f1"|"f2"|...|"f8".

Related

Snowflake Parquet Copy Into NULL_IF

I have a staged parquet file in and s3 location. I am attempting to parse the parquet file into a relational table, the field i'm having an issue with is a timestamp_ntz field.
In the file, there is a field called "due_date", and while most of the time it is populated with data, on occasion there is an empty string like below:
"due_date":""
The error that i'm receiving is 'Failed to cast variant value "" to TIMESTAMP_NTZ.'
Using the NULL_IF parameter in the copy into is not yielding any results and is set to:
file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'))
I have seen other users replace the NULL's in the SELECT portion of the COPY INTO statement, but this would be a hard to implement option due to the fields being dynamic.
Could anyone shed any light on this, other than the knowledge that empty strings shouldn't form part of parquet?
Full query below: USE SCHEMA MY_SCHEMA; COPY INTO MY_SCHEMA.MY_TABLE(LOAD_DATE,ACCOUNTID,APPID,CREATED_AT,CREATED_ON,DATE,DUE_DATE,NUMEVENTS,NUMMINUTES,REMOTEIP,SERVER,TIMESTAMP,TRACKNAME,TRACKTYPEID,TRANSACTION_DATE,TYPE,USERAGENT,VISITORID) FROM (SELECT CURRENT_TIMESTAMP(),$1:accountId,$1:appId,$1:created_at,$1:created_on,$1:date,$1:due_date,$1:numEvents,$1:numMinutes,$1:remoteIp,$1:server,$1:timestamp,$1:trackName,$1:trackTypeId,$1:transaction_date,$1:type,$1:userAgent,$1:visitorId FROM #my_stage ) PATTERN = '.*part.*' file_format = (TYPE='PARQUET' COMPRESSION = SNAPPY BINARY_AS_TEXT = true TRIM_SPACE = false NULL_IF = ('\\N','NULL','NUL','','""'));
You can use TRY_TO_TIMESTAMP. Since TRY_TO_TIMESTAMP does not accept variant, you need to cast it to string first:
TRY_TO_TIMESTAMP($1:due_date::string)
instead of just
$1:due_date
If the due_date is empty, the result will be NULL in the timestamp field in the target table after insert.

SQL Compilation error while loading CSV file from S3 to Snowflake

we are facing below issue while loading csv file from S3 to Snowflake.
SQL Compilation error: Insert column value list does not match column list expecting 7 but got 6
we have tried removing the column from table and tried again but this time it is showing expecting 6 but got 5
below are the the commands that we have used for stage creation and copy command.
create or replace stage mystage
url='s3://test/test'
STORAGE_INTEGRATION = test_int
file_format = (type = csv FIELD_OPTIONALLY_ENCLOSED_BY='"' COMPRESSION=GZIP);
copy into mytable
from #mystage
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' COMPRESSION=GZIP error_on_column_count_mismatch=false TRIM_SPACE=TRUE NULL_IF=(''))
FORCE = TRUE
ON_ERROR = Continue
PURGE=TRUE;
You can not use MATCH_BY_COLUMN_NAME for the CSV files, this is why you get this error.
This copy option is supported for the following data formats:
JSON
Avro
ORC
Parquet
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

How to get the header names of a staged csv file on snowflake?

Is it possible to get the header of staged csv file on snowflake into an array ?
I need to loop over all fields to insert data into our data vault model and it is really needed to get these column names onto an array.
Actually it was solved by using the following query over a staged file in a JavaScript stored procedure:
var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
"split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
The ONE_COLUMN_FORMAT_FILE will put all fields into one in order to make this query works:
CREATE FILE FORMAT ONE_COLUMN_FILE_FORMAT
TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = '|' RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
Yes, you can query the following metadata of your staged files:
METADATA$FILENAME: Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER: Row number for each record in the container staged data file.
So there is not enough information. But: There is the parameter SKIP_HEADER that can be used in your COPY INTO-command. So my suggestion for a workaround is:
Copy your data into a table by using SKIP_HEADER and thus also load your header into your table as regular column values
Query the first row which are the column names
Use this as input for further processing
More infos about the parameter within the COPY INTO-Command https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Currently dynamically generating the column list from a csv file is not currently available in snowflake or most platform afaik.
csv is not the ideal format for this kind of schema on read operation.
if you are able to work with your input files, I would suggest converting the csv to json. If you use json instead, you can then use snowflake to process the file.
here is some context:
Load CSV files with dynamic column headers Not Supported
Example of loading json data
Example Converting CSV to JSON with Pandas
import pandas as pd
import csvkit
filepath = '/home/username/data/sales.csv'
jsonfilepath = filepath.replace('.csv','.json')
df = pd.read_csv(filepath)
# df.to_json(jsonfilepath, orient="table", date_format="iso", index=False)
df.to_json(jsonfilepath, orient="records", date_format="iso")
print("Input File: {}\r\nOutput File: {}".format(filepath, jsonfilepath))
Example Converting CSV to JSON with csvkit
csvjson -i 4 '/home/username/data/sales.csv' > '/home/username/data/sales.csvkit.json'
Querying Semi-Structured Data in Snowflake
Loading JSON Data into Snowflake
/* Create a target relational table for the JSON data. The table is temporary, meaning it persists only for */
/* the duration of the user session and is not visible to other users. */
create or replace temporary table home_sales (
city string,
zip string,
state string,
type string default 'Residential',
sale_date timestamp_ntz,
price string
);
/* Create a named file format with the file delimiter set as none and the record delimiter set as the new */
/* line character. */
/* */
/* When loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). */
/* You could use the JSON file format, but any error in the transformation would stop the COPY operation, */
/* even if you set the ON_ERROR option to continue or skip the file. */
create or replace file format sf_tut_csv_format
field_delimiter = none
record_delimiter = '\\n';
/* Create a temporary internal stage that references the file format object. */
/* Similar to temporary tables, temporary stages are automatically dropped at the end of the session. */
create or replace temporary stage sf_tut_stage
file_format = sf_tut_csv_format;
/* Stage the data file. */
/* */
/* Note that the example PUT statement references the macOS or Linux location of the data file. */
/* If you are using Windows, execute the following statement instead: */
-- PUT file://%TEMP%/sales.json #sf_tut_stage;
put file:///tmp/sales.json #sf_tut_stage;
/* Load the JSON data into the relational table. */
/* */
/* A SELECT query in the COPY statement identifies a numbered set of columns in the data files you are */
/* loading from. Note that all JSON data is stored in a single column ($1). */
copy into home_sales(city, state, zip, sale_date, price)
from (select substr(parse_json($1):location.state_city,4), substr(parse_json($1):location.state_city,1,2),
parse_json($1):location.zip, to_timestamp_ntz(parse_json($1):sale_date), parse_json($1):price
from #sf_tut_stage/sales.json.gz t)
on_error = 'continue';
/* Query the relational table */
select * from home_sales;

Data Load into Snowflake table - Geometry data

I have a requirement to load a csv file which contains geometry data into a Snowflake table.
I am using data load option which is available in the Snowflake WebGUI.
The sample geometry data is as below.
LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624)
Since commas are present in the geometry data, the data load option is treating them as a separate column and throwing a error.
I tried using updating the csv file with "to_geography" function as below, but still no luck.
TO_GEOGRAPHY(LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624))
So any pointers on this would be appreciated, the full content of the csv file is as below.
ID," GEOGRAPHIC_ROUTE",Name
12421,"LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624)",Winston
As I see, the fields are enclosed by double quotes to prevent misinterpretion of comma characters of geographic data (which is good!)
Could you set FIELD_OPTIONALLY_ENCLOSED_BY to '"' (double quote) for your file format, try to re-import the file?
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#format-type-options-formattypeoptions
I am able to ingest the sample data using the following COPY command:
copy into TEST_TABLE from #my_stage
FILE_FORMAT = (type = csv, FIELD_OPTIONALLY_ENCLOSED_BY='"', skip_header =1 );

COPY INTO query on Snowflake returns TABLE does not exist error

I am trying to load data from azure blob storage.
The data has already been staged.
But, the issue is when I try to run
copy into random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
Below is the error I encounter:
raise error_class(
snowflake.connector.errors.ProgrammingError: 001757 (42601): SQL compilation error:
Table 'random_table_name' does not exist
Basically, it says table does not exist, which it does not, but the syntax on website is the same as mine.
COPY INTO query on Snowflake returns TABLE does not exist error
In my case the table name is case-sensitive. Snowflake seems to convert everything to upper case. I changed the database/schema/table names to all upper-case and it started working.
First run the below query to fetch the column headers
select $1 FROM #stage_name_i_created/filename.csv limit 1
Assuming below are the header lines from your csv file
id;first_name;last_name;email;age;location
Create a file_format csv
create or replace file format semicolon
type = 'CSV'
field_delimiter = ';'
skip_header=1;
Then you should define the datatype and field name as below
create or replace table <yourtable> as
select $1::varchar as id
,$2::varchar as first_name
,$3::varchar as last_name
,$4::varchar as email
,$5::int as age
,$6::varchar as location
FROM #stage_name_i_created/yourfile.csv
(file_format => semicolon );
The table must exist prior to running a COPY INTO command. In your post, you say that the table does not exist...so that is your issue.
If your table exist, try by forcing the table path like this:
copy into <database>.<schema>.<random_table_name>
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
or by steps like this:
use database <database_name>;
use schema <schema_name>;
copy into database.schema.random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv';
rbachkaniwala, what do you mean by 'How do I create a table?( according to snowflake syntax it is not possible to create empty tables)'.
You can just do below to create a table
CREATE TABLE random_table_name (FIELD1 VARCHAR, FIELD2 VARCHAR)
The table does need to exist. You should check the documentation for COPY INTO.
Other areas to consider are
do you have the right context set for the database & schema
does the user / role have access to the table or object.
It basically seems like you don't have the table defined yet. You should
ensure the table is created
ensure all columns in the CSV exist as columns in the table
ensure the order of the columns are the same as in the CSV
I'd check data types too.
"COPY INTO" is not a query command, it is the actual data transfer execution from source to destination, which both must exist as others commented here but If you want just to query without loading the files then run the following SQL:
//Display list of files in the stage to verify stage
LIST #stage_name_i_created;
//Create a file format
CREATE OR REPLACE FILE FORMAT RANDOM_FILE_CSV
type = csv
COMPRESSION = 'GZIP' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//Now select the data in the files
Select $1 as first_col,$2 as second_col //can add as necessary number of columns ...etc
from #stage_name_i_created
(FILE_FORMAT => RANDOM_FILE_CSV)
More information can be found in the documentation link here
https://docs.snowflake.com/en/user-guide/querying-stage.html

Resources