I am trying to create a SNOWPIPE based data ingestion on SNOWFLAKE(on Azure) to ingest data staged on AWS. When I try to create the PIPE, I get an error as below
Pipe Notifications bind failure "Integration cannot be null for Azure"
Can anyone please let me know why I am getting this error. I am doing the below steps
---CREATE AN EXTERNAL STAGE
CREATE OR REPLACE STAGE SNOWPIPE_SATGE URL = 'S3://MY-BUCKET/FILES';
---LIST THE FILES IN THE BUCKET
LIST #SNOWPIPE_SATGE
--- CREATE THE TABLE WHERE THE DATA FROM PIPE WILL BE LOADED
CREATE OR REPLACE SCHOOLS_TABLE(GEOID INTEGER, SCHOOL_ID VARCHAR, SCHOOL_NAME VARCHAR, LAT NUMBER(10,5), LONG NUMBER(10,5);
---CREATE THE PIPE
CREATE OR REPLACE PIPE SNOWPIPE_FOR_SCHOOL_TABLE
AUTO_INGEST=TRUE
AS COPY INTO SCHOOLS_TABLE FROM #SNOWPIPE_SATGE
FILE_FORMAT=(TYPE=CSV FIELD_DELIMITER=',' SKIP_HEADER=1);
This is where(pipe creation) the error is coming
Related
When a pipe is re-created, there is a chance of missing some notifications. Is there any way to replay these missed notifications? Refreshing the pipe is dangerous (so not an option), as the load history is lost when the pipe is re-created (and hence could result in ingesting the same files twice & creating duplicate records)
Snowflake has documented a process on how to re-create pipes with automated data loading (link). Unfortunately, any new notifications coming in between step 1 (pause the pipe) and step 3 (re-create the pipe) can be missed. Even by automating the process with a procedure, we can shrink the window, but not eliminate it. I have confirmed this with multiple tests. Even without pausing the previous pipe, there's still a slim chance for this to happen.
However, Snowflake is aware of the notifications, as the notification queue is separate from the pipes (and shared for the entire account). But the notifications received at the "wrong" time are just never processed (which I guess makes sense if there's no active pipe to process them at the time).
I think we can see those notifications in the numOutstandingMessagesOnChannel property of the pipe status, but I can't find much more information about this, nor how to get those notifications processed. I think they might just become lost when the pipe is replaced. 😞
Note: This is related to another question I asked about preserving the load history when re-creating pipes in Snowflake (link).
Assuming there's no way to replay outstanding notifications, I've instead created a procedure to detect files that have failed to load automatically. A benefit of this approach is that it can also detect any file that has failed to load for any reason (not only missed notifications).
The procedure can be called like this:
CALL verify_pipe_load(
'my_db.my_schema.my_pipe', -- Pipe name
'my_db.my_schema.my_stage', -- Stage name
'my_db.my_schema.my_table', -- Table name
'/YYYY/MM/DD/HH/', -- File prefix
'YYYY-MM-DD', -- Start time for the loads
'ERROR' -- Mode
);
Here's how it works, at a high level:
First, it finds all the files in the stage that match the specified prefix (using the LIST command), minus a slight delay to account for latency.
Then, out of those files, it finds all of those that have no records in COPY_HISTORY.
Finally, it handles those missing file loads in one of three ways, depending on the mode:
The 'ERROR' mode will abort the procedure by throwing an exception. This is useful to automate the continuous monitoring of pipes and ensure no files are missed. Just hook it up to your automation tool of choice! We use DBT + DBT Cloud.
The 'INGEST' mode will automatically re-queue the files for ingestion by Snowpipe using the REFRESH command for those specific files only.
The 'RETURN' mode will simply return the list of files in the response.
Here is the code for the procedure:
-- Returns a list of files missing from the destination table (separated by new lines).
-- Returns NULL if there are no missing files.
CREATE OR REPLACE PROCEDURE verify_pipe_load(
-- The FQN of the pipe (used to auto ingest):
PIPE_FQN STRING,
-- Stage to get the files from (same as the pipe definition):
STAGE_NAME STRING,
-- Destination table FQN (same as the pipe definition):
TABLE_FQN STRING,
-- File prefix (to filter files):
-- This should be based on a timestamp (ex: /YYYY/MM/DD/HH/)
-- in order to restrict files to a specific time interval
PREFIX STRING,
-- The time to get the loaded files from (should match the prefix):
START_TIME STRING,
-- What to do with the missing files (if any):
-- 'RETURN': Return the list of missing files.
-- 'INGEST': Automatically ingest the missing files (and return the list).
-- 'ERROR': Make the procedure fail by throwing an exception.
MODE STRING
)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
MODE = MODE.toUpperCase();
if (!['RETURN', 'INGEST', 'ERROR'].includes(MODE)) {
throw `Exception: Invalid mode '${MODE}'. Must be one of 'RETURN', 'INGEST' or 'ERROR'`;
}
let tableDB = TABLE_FQN.split('.')[0];
let [pipeDB, pipeSchema, pipeName] = PIPE_FQN.split('.')
.map(name => name.startsWith('"') && name.endsWith('"')
? name.slice(1, -1)
: name.toUpperCase()
);
let listQueryId = snowflake.execute({sqlText: `
LIST #${STAGE_NAME}${PREFIX};
`}).getQueryId();
let missingFiles = snowflake.execute({sqlText: `
WITH staged_files AS (
SELECT
"name" AS name,
TO_TIMESTAMP_NTZ(
"last_modified",
'DY, DD MON YYYY HH24:MI:SS GMT'
) AS last_modified,
-- Add a minute per GB, to account for larger file size = longer ingest time
ROUND("size" / 1024 / 1024 / 1024) AS ingest_delay,
-- Estimate the time by which the ingest should be done (default 5 minutes)
DATEADD(minute, 5 + ingest_delay, last_modified) AS ingest_done_ts
FROM TABLE(RESULT_SCAN('${listQueryId}'))
-- Ignore files that may not be done being ingested yet
WHERE ingest_done_ts < CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP())::TIMESTAMP_NTZ
), loaded_files AS (
SELECT stage_location || file_name AS name
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), stage AS (
SELECT DISTINCT stage_location
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), missing_files AS (
SELECT REPLACE(name, stage_location) AS prefix
FROM staged_files
CROSS JOIN stage
WHERE name NOT IN (
SELECT name FROM loaded_files
)
)
SELECT LISTAGG(prefix, '\n') AS "missing_files"
FROM missing_files;
`});
if (!missingFiles.next()) return null;
missingFiles = missingFiles.getColumnValue('missing_files');
if (missingFiles.length == 0) return null;
if (MODE == 'ERROR') {
throw `Exception: Found missing files:\n'${missingFiles}'`;
}
if (MODE == 'INGEST') {
missingFiles
.split('\n')
.forEach(file => snowflake.execute({sqlText: `
ALTER PIPE ${PIPE_FQN} REFRESH prefix='${file}';
`}));
}
return missingFiles;
$$
;
I am trying to set up a Snowpipe, and I have created my warehouse, database and table and am trying to stage the filew with snowsql.
USE WAREHOUSE IoT;
USE DATABASE SNOWPIPE_TEST;
CREATE OR REPLACE STAGE my_stage;
CREATE OR REPLACE FILE_FORMAT r_json;
CREATE OR REPLACE PIPE snowpipe_pipe
AUTO_INGEST = TRUE,
COMMENT = 'add items IoT',
VALIDATION_MODE = RETURN_ALL_ERRORS
AS (COPY INTO snowpipe_test.public.mytable
from #snowpipe_db.public.my_stage
FILE_FORMAT = (type = 'JSON');
CREATE PIPE mypipe AS COPY INTO mytable FROM #my_stage;
I think something is locked but I am not sure.
I tried to save the config file as config1 and made a copy. It hung, then I remove the copy and tried to connect and there was no error, it just hung
Am I missing something?
To specify the auto ingest parameter it's AUTO_INGEST rather than AUTO-INGEST, but note that this option is not available for an internal stage. So when you try to run this command using an internal stage it should error with a message pointing this out.
https://docs.snowflake.net/manuals/sql-reference/sql/create-pipe.html#optional-parameters
Also you don't need the bracket between the "AS" and "copy" on line 5.
I started to go through the first tutorial for how to load data into Snowflake from a local file.
This is what I have set up so far:
CREATE WAREHOUSE mywh;
CREATE DATABASE Mydb;
Use Database mydb;
CREATE ROLE ANALYST;
grant usage on database mydb to role sysadmin;
grant usage on database mydb to role analyst;
grant usage, create file format, create stage, create table on schema mydb.public to role analyst;
grant operate, usage on warehouse mywh to role analyst;
//tutorial 1 loading data
CREATE FILE FORMAT mycsvformat
TYPE = "CSV"
FIELD_DELIMITER= ','
SKIP_HEADER = 1;
CREATE FILE FORMAT myjsonformat
TYPE="JSON"
STRIP_OUTER_ARRAY = true;
//create stage
CREATE OR REPLACE STAGE my_stage
FILE_FORMAT = mycsvformat;
//Use snowsql for this and make sure that the role, db, and warehouse are seelcted: put file:///data/data.csv #my_stage;
// put file on stage
PUT file://contacts.csv #my
List #~;
list #%mytable;
Then in my active Snowsql when I run:
Put file:///Users/<user>/Documents/data/data.csv #my_table;
I have confirmed I am in the correct role Accountadmin:
002003 (02000): SQL compilation error:
Stage 'MYDB.PUBLIC.MY_TABLE' does not exist or not authorized.
So then I try to create the table in Snowsql and am successful:
create or replace table my_table(id varchar, link varchar, stuff string);
I still run into this error after I run:
Put file:///Users/<>/Documents/data/data.csv #my_table;
002003 (02000): SQL compilation error:
Stage 'MYDB.PUBLIC.MY_TABLE' does not exist or not authorized.
What is the difference between putting a file to a my_table and a my_stage in this scenario? Thanks for your help!
EDIT:
CREATE OR REPLACE TABLE myjsontable(json variant);
COPY INTO myjsontable
FROM #my_stage/random.json.gz
FILE_FORMAT = (TYPE= 'JSON')
ON_ERROR = 'skip_file';
CREATE OR REPLACE TABLE save_copy_errors AS SELECT * FROM TABLE(VALIDATE(myjsontable, JOB_ID=>'enterid'));
SELECT * FROM SAVE_COPY_ERRORS;
//error for random: Error parsing JSON: invalid character outside of a string: '\\'
//no error for generated
SELECT * FROM Myjsontable;
REMOVE #My_stage pattern = '.*.csv.gz';
REMOVE #My_stage pattern = '.*.json.gz';
//yay your are done!
The put command copies the file from your local drive to the stage. You should do the put to the stage, not that table.
put file:///Users/<>/Documents/data/data.csv #my_stage;
The copy command loads it from the stage.
But in document its mention like it gets created by default for every stage
Each table has a Snowflake stage allocated to it by default for storing files. This stage is a convenient option if your files need to be accessible to multiple users and only need to be copied into a single table.
Table stages have the following characteristics and limitations:
Table stages have the same name as the table; e.g. a table named mytable has a stage referenced as #%mytable
in this case without creating stage its should load into default Snowflake stage allocated
I create table with this statement:
CREATE TABLE event(
date Date,
src UInt8,
channel UInt8,
deviceTypeId UInt8,
projectId UInt64,
shows UInt32,
clicks UInt32,
spent Float64
) ENGINE = MergeTree(date, (date, src, channel, projectId), 8192);
Raw data looks like:
{ "date":"2016-03-07T10:00:00+0300","src":2,"channel":18,"deviceTypeId ":101, "projectId":2363610,"shows":1232,"clicks":7,"spent":34.72,"location":"Unknown", ...}
...
Files with data loaded with the following command:
cat *.data|sed 's/T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]+0300//'| clickhouse-client --query="INSERT INTO event FORMAT JSONEachRow"
clickhouse-client throw exception:
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: location: (at row 1)
Is it possible to skip fields from JSON object that not presented in table description?
The latest ClickHouse release (v1.1.54023) supports input_format_skip_unknown_fields user option which eneables skipping of unknown fields for JSONEachRow and TSKV formats.
Try
clickhouse-client -n --query="SET input_format_skip_unknown_fields=1; INSERT INTO event FORMAT JSONEachRow;"
See more details in documentation.
Currently, it is not possible to skip unknown fields.
You may create temporary table with additional field, INSERT data into it, and then do INSERT SELECT into final table. Temporary table may have Log engine and INSERT into that "staging" table will work faster than into final MergeTree table.
It is relatively easy to add possibility to skip unknown fields into code (something like setting 'format_skip_unknown_fields').
I'm busy writing a script to restore a database backup and I've run into something strange.
I have a table.sql file which only contains create table structures like
create table ugroups
(
ug_code char(10) not null ,
ug_desc char(60) not null
);
I have a second data.csv file which only contains delimiter data such as
xyz | dummy data
abc | more nothing
fun | is what this is
Then I have a third index.sql file which only creates indexes as such
create unique index i_ugroups on ugroups
(ug_code);
I use the commands from the terminal like so
/opt/postgresql/bin/psql -d dbname -c "\i /tmp/table.sql" # loads table.sql
I have a batch script that loads in the data which works perfectly. Then I use the command
/opt/postgresql/bin/psql -d dbname -c "\i /tmp/index.sql" # loads index.sql
When I try to create the unique indexes it is giving me the error
ERROR: could not create unique index "i_ugroups"
DETAIL: Key (ug_code)=(transfers ) is duplicated.
What's strange is that when I execute the table.sql file and the index.sql file together and load the data last I get no errors and it all works.
Is there something I am missing? why would it not let me create the unique indexes after the data has been loaded?
There are two rows in your column ug_code with the data "transfers " and that's why it can't create the index.
Why it would succeed if you create the index first, I don't know. But I would suspect that the second time it tries to insert "transfers " into database, it just fails the insert that time and other data gets inserted succesfully.