Snowflake: How to extract file name from the empty file - snowflake-cloud-data-platform

I have a requirement where input file name has to be captured and stored in the snowflake table. I am using snowflake snowpipe & stage to query the file which is in s3. My code/query works fine when the input file has data, however if the input file is empty, copy command is not getting the file name.
How to get the file name when the input file is empty/zero byte ? Thanks
Snowpipe syntax
CREATE OR REPLACE PIPE DEV.schema.load_pipe auto_ingest = true
AS
COPY INTO schema.TMP_table FROM (
SELECT
$1::variant AS MESSAGE,
SPLIT_PART(METADATA$FILENAME,'/',4) AS FILE_NAME,
current_timestamp::timestamp_ntz as LOAD_TS
FROM #DEV.STAGE.DEV_STG/input/
)
pattern = 'File_DLYERR_.*'

Related

how to Read headers of a CSV file in Snowflake stage

I am learning snowflake ,I was enter image description here trying to read the headers of CSV file stored in aws bucket ..I used the metadata fields that required me to input $1,$2 as column names and so on to obtain headers(for copy into table creation)..
is there a better alternative to this?
Statement :
select
Top 100 metadata$filename,
metadata$file_row_number,
t.$1,
t.$2,
t.$3,
t.$4,
t.$5,
t.$6
from
#aws_stage t
where
metadata$filename = 'OrderDetails.csv'

unable to load csv file into snowflake with the COPY INTO command

End of record reached while expected to parse column '"VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]'
File 'veg_plant_height.csv', line 8, character 14
Row 3, column "VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
this is my table
create or replace table VEGETABLE_DETAILS_PLANT_HEIGHT (
PLANT_NAME text(7),
VEG_HEIGHT_CODE text(1),
UNIT_OF_MEASURE text(2),
LOW_END_OF_RANGE number(2),
HIGH_END_OF_RANGE number(2)
);
and the COPY INTO command I used
copy into vegetable_details_plant_height
from #like_a_window_into_an_s3_bucket
files = ( 'veg_plant_height.csv')
file_format = ( format_name=VEG_CHALLENGE_CC );
and the csv file https://uni-lab-files.s3.us-west-2.amazonaws.com/veg_plant_height.csv
The error "End of record reached while expected to parse column" means Snowflake detected there were less than expected columns when processing the current row.
Please review your CSV file and make sure each row has correct number of columns. The error said on line 8.
The table has 5 columns but source file consist values for four columns due to this copy command returns the error. In order to resolve the issue you can modified the copy command as mentioned below:
copy into vegetable_details_plant_height(PLANT_NAME, UNIT_OF_MEASURE, LOW_END_OF_RANGE, HIGH_END_OF_RANGE)
from (select $1, $2, $3, $4 from #like_a_window_into_an_s3_bucket)
files = ( 'veg_plant_height.csv') file_format = ( format_name=VEG_CHALLENGE_CC );
As you can see in csv file data in one column is in "" and the names are separated by , so u need to use that FIELD_OPTIONALLY_ENCLOSED_BY = '"' type option

Copy the same file into table using COPY command & snowpipe

I coudln't load the samefile into table in snowflake using COPY command/snowpipe.
I am always getting the following result
Copy executed with 0 files processed.
I have re-created the table. Truncated the table. But the copy_history doesn't show any data
select * from table(information_schema.copy_history(table_name=>'mytable', start_time=> dateadd(hours, -10, current_timestamp())));
I have used FORCE = true in COPY Command and COPY command didnt load the same file into Table. I have explicitly mentioned file path in COPY COMMAND
FROM
#STAGE_DEV/myfile/05/28/16/myfile_1.csv
) file_format = (
format_name = STANDARD_CSV_FORMAT Skip_header = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"' NULL_IF = 'NULL'
)
on_error = continue
Force = True;
Anyone faced similar issue and what would the process to load the same file again using COPY command or SNOWPIPE ? I dont have option to change file name or put the files in different S3 bucket.
ls#stage shows the following files ls#stage
I have reloaded files to S3 bucket and it's working. Thank you guys for all the responses. –

How to get the header names of a staged csv file on snowflake?

Is it possible to get the header of staged csv file on snowflake into an array ?
I need to loop over all fields to insert data into our data vault model and it is really needed to get these column names onto an array.
Actually it was solved by using the following query over a staged file in a JavaScript stored procedure:
var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
"split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
The ONE_COLUMN_FORMAT_FILE will put all fields into one in order to make this query works:
CREATE FILE FORMAT ONE_COLUMN_FILE_FORMAT
TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = '|' RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
Yes, you can query the following metadata of your staged files:
METADATA$FILENAME: Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER: Row number for each record in the container staged data file.
So there is not enough information. But: There is the parameter SKIP_HEADER that can be used in your COPY INTO-command. So my suggestion for a workaround is:
Copy your data into a table by using SKIP_HEADER and thus also load your header into your table as regular column values
Query the first row which are the column names
Use this as input for further processing
More infos about the parameter within the COPY INTO-Command https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Currently dynamically generating the column list from a csv file is not currently available in snowflake or most platform afaik.
csv is not the ideal format for this kind of schema on read operation.
if you are able to work with your input files, I would suggest converting the csv to json. If you use json instead, you can then use snowflake to process the file.
here is some context:
Load CSV files with dynamic column headers Not Supported
Example of loading json data
Example Converting CSV to JSON with Pandas
import pandas as pd
import csvkit
filepath = '/home/username/data/sales.csv'
jsonfilepath = filepath.replace('.csv','.json')
df = pd.read_csv(filepath)
# df.to_json(jsonfilepath, orient="table", date_format="iso", index=False)
df.to_json(jsonfilepath, orient="records", date_format="iso")
print("Input File: {}\r\nOutput File: {}".format(filepath, jsonfilepath))
Example Converting CSV to JSON with csvkit
csvjson -i 4 '/home/username/data/sales.csv' > '/home/username/data/sales.csvkit.json'
Querying Semi-Structured Data in Snowflake
Loading JSON Data into Snowflake
/* Create a target relational table for the JSON data. The table is temporary, meaning it persists only for */
/* the duration of the user session and is not visible to other users. */
create or replace temporary table home_sales (
city string,
zip string,
state string,
type string default 'Residential',
sale_date timestamp_ntz,
price string
);
/* Create a named file format with the file delimiter set as none and the record delimiter set as the new */
/* line character. */
/* */
/* When loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). */
/* You could use the JSON file format, but any error in the transformation would stop the COPY operation, */
/* even if you set the ON_ERROR option to continue or skip the file. */
create or replace file format sf_tut_csv_format
field_delimiter = none
record_delimiter = '\\n';
/* Create a temporary internal stage that references the file format object. */
/* Similar to temporary tables, temporary stages are automatically dropped at the end of the session. */
create or replace temporary stage sf_tut_stage
file_format = sf_tut_csv_format;
/* Stage the data file. */
/* */
/* Note that the example PUT statement references the macOS or Linux location of the data file. */
/* If you are using Windows, execute the following statement instead: */
-- PUT file://%TEMP%/sales.json #sf_tut_stage;
put file:///tmp/sales.json #sf_tut_stage;
/* Load the JSON data into the relational table. */
/* */
/* A SELECT query in the COPY statement identifies a numbered set of columns in the data files you are */
/* loading from. Note that all JSON data is stored in a single column ($1). */
copy into home_sales(city, state, zip, sale_date, price)
from (select substr(parse_json($1):location.state_city,4), substr(parse_json($1):location.state_city,1,2),
parse_json($1):location.zip, to_timestamp_ntz(parse_json($1):sale_date), parse_json($1):price
from #sf_tut_stage/sales.json.gz t)
on_error = 'continue';
/* Query the relational table */
select * from home_sales;

Snowflake ON_ERROR=CONTINUE abort the COPY command for file

Snowkflake documentation for COPY INTO command states (for COPY options)
ON_ERROR = CONTINUE | SKIP_FILE | SKIP_FILE_num | SKIP_FILE_num% | ABORT_STATEMENT
Continue loading the file. The COPY statement returns an error message
for a maximum of one error encountered per data file. Note that the
difference between the ROWS_PARSED and ROWS_LOADED column values
represents the number of rows that include detected errors. However,
each of these rows could include multiple errors. To view all errors
in the data files, use the VALIDATION_MODE parameter or query the
VALIDATE function.
But for me, it just doesn't seem to obey, as I see the default value i.e SKIP_FILE is getting applied as files are getting skipped on any error in the file.
create or replace file format jsonThing type = 'json' DATE_FORMAT='yyyy-mm-dd'
TIMESTAMP_FORMAT='YYYY-MM-DD"T"HH24:MI:SSZ' TRIM_SPACE=TRUE NULL_IF=('\\N', 'NULL','');
create or replace stage snowflake_json_stage
storage_integration = snowflake_json_storage_integration
url = 'azure://snowflakejson.blob.core.windows.net/cdrs'
file_format = jsonThing
COPY_OPTIONS = (ON_ERROR=CONTINUE PURGE=TRUE MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE)
COMMENT='The snowflake json stage';
CREATE or REPLACE PIPE SNOWFLAKE_JSON_PIPE
AUTO_INGEST = TRUE
integration = snowflake_json_notification_integration
as
COPY INTO purge.public.cdrs
from #SNOWFLAKE_JSON_STAGE
ON_ERROR=CONTINUE
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;
Do ON_ERROR=CONTINUE options work with PIPE?
NOTE: The file is an NDJSON file.

Resources