Error parsing JSON exception for xml filed in copy command Snowflake - snowflake-cloud-data-platform

Hi I have declared a table like this
create or replace table app_event (
ID varchar(36) not null primary key,
VERSION number,
ACT_TYPE varchar(255),
EVE_TYPE varchar(255),
CLI_ID varchar(36),
DETAILS variant,
OBJ_TYPE varchar(255),
DATE_TIME timestamp,
AAPP_EVENT_TO_UTC_DT timestamp,
GRO_ID varchar(36),
OBJECT_NAME varchar(255),
OBJ_ID varchar(255),
USER_NAME varchar(255),
USER_ID varchar(255),
EVENT_ID varchar(255),
FINDINGS varchar(255),
SUMMARY variant
);
DETAILS column will contain xml file so that i can run xml function and get element of that xml file .
My sample rows looks like this
dfjkghdfkjghdf8gd7f7997,0,TEST_CASE,CHECK,74356476476DFD,<?xml version="1.0" encoding="UTF-8"?><testPayload><testId>3495864795uiyiu</testId><testCode>COMPLETED</testCode><testState>ONGOING</testState><noOfNewTest>1</noOfNewTest><noOfReviewRequiredTest>0</noOfReviewRequiredTest><noOfExcludedTest>0</noOfExcludedTest><noOfAutoResolvedTest>1</noOfAutoResolvedTest><testerTypes>WATCHLIST</testerTypes></testPayload>,CASE,41:31.3,NULL,948794853948dgjd,(null),dfjkghdfkjghdf8gd7f7997,test user,dfjkghdfkjghdf8gd7f7997,NULL,(null),(null)
When i declare DETAILS as varchar i am able to load file but when i declare this as variant i get below error for that column only
Error parsing JSON:
dfjkghdfkjghdf8gd7f7997COMPLETED</status
File 'SNOWFLAKE/Sudarshan.csv', line 1, character 89 Row 1, column
"AUDIT_EVENT"["DETAILS":6]
Can you please help on this ?
I can not use varchar as i need to query element of xml also in my query .
This is how i load into table and i use default CSV format ,file is available in S3 .
COPY INTO demo_db.public.app_event
FROM #my_s3_stage/
FILES = ('app_Even.csv')
file_format=(type='CSV');
Based on Answer this is how i am loading
copy into demo_db.public.app_event from (
select
$1,$2,$3,$4,$5,
parse_xml($6),$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,parse_xml($17)
from #~/Audit_Even.csv d
)
file_format = (
type = CSV
)
But when i execute it says zero row processed and no mentioned stage here

If you are using a COPY INTO statement then you need to put in a subquery to convert the data before loading it into the table. Use the parse_xml within your copy statement's subquery, something like this:
copy into app_event from (
select
$1,
parse_xml($2) -- <---- "$2" is the column number in the CSV that contains the xml
from #~/test.csv.gz d -- <---- This is my own internal user stage. You'll need to change this to your external stage or whatever
)
file_format = (
type = CSV
)
It is hard to provide you with a good SQL statement without a full example of your existing code (your copy / insert statement). In my example above, I'm copying a file in my own user stage (#~/test.csv.gz) with the default CSV file format options. You are likely using an external stage but it should be easy to adapt this to your own example.

Related

SQL Server deserialize XML into table

I have generate many xml with the SQL Server command : FOR XML RAW
In that way i have filled a table with the following schema
ACTION as CHAR(1)
TABLE_NAME as NVARCHAR(25)
PAYLOAD as NVARCHAR(MAX)
I
tbl_1
xmlrow_1
I
tbl_1
xmlrow_2
U
tbl_1
xmlrow_3
D
tbl_1
xmlrow_4
D
tbl_2
xmlrow_5
ACTION is a char ( I = insert, U = update, D = delete)
TABLE_NAME is the table on which i have to act (for insert the data, update it or delete it)
PAYLOAD is a XML serialized by SQL Server using the command FOR XML RAW on original table
PAYLOAD Example :
<row COL1="val_col_1" COL2="val_col_2" .. COLN="val_col_n"/>
I am looking for a way (i am writing a stored procedure so i am looking for TSQL) to deserialize it on the "configured" TABLE_NAME possibly for the UPSERT.
If this is not faseable (as i suspect) i will build the SQL script for insert,update or delete dynamically but i still need to deserialize the XML in the PAYLOAD and don't know how to do
I mean, if there is not a better way, how i can do some like that ?
UPDATE [dbo].[tbl_1]
SET [COL1] = CURRENT_ROW.COL1
,[COL2] = CURRENT_ROW.COL2
,[COL3] = CURRENT_ROW.COL3
FROM ( xmlrow_3 --DESERIALIZE ) AS CURRENT_ROW
--EIDTED : added working example in fiddler
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=ac5925a0a2f93791cc7e7c34179137ae

another case of "Table options do not contain an option key 'connector'"

I have read the link Table options do not contain an option key 'connector'
it said we should set the format.
But My Scene is datagen->hive.
Here's my completed example(it's wrong Now)
drop table if exists datagen;
CREATE TABLE datagen (
f_sequence INT,
f_random INT,
f_random_str STRING,
ts AS localtimestamp,
WATERMARK FOR ts AS ts
) WITH (
'connector' = 'datagen',
-- optional options --
'rows-per-second'='5',
'fields.f_sequence.kind'='sequence',
'fields.f_sequence.start'='1',
'fields.f_sequence.end'='50',-- 这个地方限制了一共会产生的条数
'fields.f_random.min'='1',
'fields.f_random.max'='50',
'fields.f_random_str.length'='10'
);
SET table.sql-dialect=hive;
drop table if exists hive_table;
CREATE TABLE hive_table (
f_sequence INT,
f_random INT,
f_random_str STRING
) PARTITIONED BY (dt STRING, hr STRING, mi STRING) STORED AS parquet TBLPROPERTIES (
'partition.time-extractor.timestamp-pattern'='$dt $hr:$mi:00',
'sink.partition-commit.trigger'='partition-time',
'sink.partition-commit.delay'='1 min',
'sink.partition-commit.policy.kind'='metastore,success-file'
);
Flink SQL> insert into hive_table select f_sequence,f_random,f_random_str ,DATE_FORMAT(ts, 'yyyy-MM-dd'), DATE_FORMAT(ts, 'HH') ,DATE_FORMAT(ts, 'mm') from datagen;
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: Table options do not contain an option key 'connector' for discovering a connector.
Is the solution from above link suitable for this case?
Need your help,Thanks~!
please use SET table.sql-dialect=default; before call insert into hive_table..., the statement insert into hive_table... using datagen connector which hive dialect should not support.

SQL Data Warehouse External Table with String fields

I am unable to find a way to create an external table in Azure SQL Data Warehouse (Synapse SQL Pool) with Polybase where some fields contain embedded commas.
For a csv file with 4 columns as below:
myresourcename,
myresourcelocation,
"""resourceVersion"": ""windows"",""deployedBy"": ""john"",""project_name"": ""test_project""",
"{ ""ResourceType"": ""Network"", ""programName"": ""v1""}"
Tried with the following Create External Table statements.
CREATE EXTERNAL FILE FORMAT my_format
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR=',',
STRING_DELIMITER='"',
First_Row = 2
)
);
CREATE EXTERNAL TABLE my_external_table
(
resourceName VARCHAR,
resourceLocation VARCHAR,
resourceTags VARCHAR,
resourceDetails VARCHAR
)
WITH (
LOCATION = 'my/location/',
DATA_SOURCE = my_source,
FILE_FORMAT = my_format
)
But querying this table gives the following error:
Failed to execute query. Error: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line.
Any help will be appreciated.
Currently this is not supported in polybase, need to modify the input data accordingly to get it working.

COPY INTO query on Snowflake returns TABLE does not exist error

I am trying to load data from azure blob storage.
The data has already been staged.
But, the issue is when I try to run
copy into random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
Below is the error I encounter:
raise error_class(
snowflake.connector.errors.ProgrammingError: 001757 (42601): SQL compilation error:
Table 'random_table_name' does not exist
Basically, it says table does not exist, which it does not, but the syntax on website is the same as mine.
COPY INTO query on Snowflake returns TABLE does not exist error
In my case the table name is case-sensitive. Snowflake seems to convert everything to upper case. I changed the database/schema/table names to all upper-case and it started working.
First run the below query to fetch the column headers
select $1 FROM #stage_name_i_created/filename.csv limit 1
Assuming below are the header lines from your csv file
id;first_name;last_name;email;age;location
Create a file_format csv
create or replace file format semicolon
type = 'CSV'
field_delimiter = ';'
skip_header=1;
Then you should define the datatype and field name as below
create or replace table <yourtable> as
select $1::varchar as id
,$2::varchar as first_name
,$3::varchar as last_name
,$4::varchar as email
,$5::int as age
,$6::varchar as location
FROM #stage_name_i_created/yourfile.csv
(file_format => semicolon );
The table must exist prior to running a COPY INTO command. In your post, you say that the table does not exist...so that is your issue.
If your table exist, try by forcing the table path like this:
copy into <database>.<schema>.<random_table_name>
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
or by steps like this:
use database <database_name>;
use schema <schema_name>;
copy into database.schema.random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv';
rbachkaniwala, what do you mean by 'How do I create a table?( according to snowflake syntax it is not possible to create empty tables)'.
You can just do below to create a table
CREATE TABLE random_table_name (FIELD1 VARCHAR, FIELD2 VARCHAR)
The table does need to exist. You should check the documentation for COPY INTO.
Other areas to consider are
do you have the right context set for the database & schema
does the user / role have access to the table or object.
It basically seems like you don't have the table defined yet. You should
ensure the table is created
ensure all columns in the CSV exist as columns in the table
ensure the order of the columns are the same as in the CSV
I'd check data types too.
"COPY INTO" is not a query command, it is the actual data transfer execution from source to destination, which both must exist as others commented here but If you want just to query without loading the files then run the following SQL:
//Display list of files in the stage to verify stage
LIST #stage_name_i_created;
//Create a file format
CREATE OR REPLACE FILE FORMAT RANDOM_FILE_CSV
type = csv
COMPRESSION = 'GZIP' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//Now select the data in the files
Select $1 as first_col,$2 as second_col //can add as necessary number of columns ...etc
from #stage_name_i_created
(FILE_FORMAT => RANDOM_FILE_CSV)
More information can be found in the documentation link here
https://docs.snowflake.com/en/user-guide/querying-stage.html

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:
Bucket-a>2017>data
Bucket-a>2018>more data
Bucket-a>2019>more data
All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.
What I have to do is read all the data from the bucket as a single table adding a year as column or partition.
I tried doing this way, but didn't work:
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
partitioned_by = ARRAY['year'],
external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'
)
and also
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
bucketed_by = ARRAY['year'],
bucket_count = 3,
external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)
All of the above didn't work.
I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.
Thanks.
If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.
Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)

Resources