String delimiter present in string not permitted in Polybase? - sql-server

I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.

I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:

String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table

Related

The merge into command on Snowflake loads only null values when loading data from S3

I'm having issues when running the command MERGE INTO on Snowflake. The data is located in a bucket on S3. The files format are .snappy.parquet.
The command runs well, it identifies the files in S3, but it loads only NULL values to the table. The total row numbers are also good.
I confirmed that #myExternalStageToS3 points to the right location by running a query which returned the expected values:
SELECT
$1:DAY,
$1:CHANNEL_CATEGORY,
$1:SOURCE,
$1:PLATFORM,
$1:LOB
#myExternalStageToS3
(FILE_FORMAT => 'sf_parquet_format')
As it is a new table with no records, the condition uses INSERT.
MERGE INTO myTable as target USING
(
SELECT
$1:DAY,
$1:CHANNEL_CATEGORY,
$1:SOURCE,
$1:PLATFORM,
$1:LOB
FROM #myExternalStageToS3
(FILE_FORMAT => 'sf_parquet_format')
) as src
ON target.CHANNEL_CATEGORY = src.$1:CHANNEL_CATEGORY
AND target.SOURCE = src.$1:SOURCE
WHEN MATCHED THEN
UPDATE SET
DAY= src.$1:DAY
,CHANNEL_CATEGORY= src.$1:CHANNEL_CATEGORY
,SOURCE= src.$1:SOURCE
,PLATFORM= src.$1:PLATFORM
,LOB= src.$1:LOB
WHEN NOT MATCHED THEN
INSERT
(
DAY,
CHANNEL_CATEGORY,
SOURCE,
PLATFORM,
LOB
) VALUES
(
src.$1:DAY,
src.$1:CHANNEL_CATEGORY,
src.$1:SOURCE,
src.$1:PLATFORM,
src.$1:LOB
);
The sf_parque_format was created with these details:
create or replace file format sf_parquet_format
type = 'parquet'
compression = auto;
Do you have any idea what am I missing?
The query inside USING part was altered(data type casts and aliases):
MERGE INTO myTable as target USING (
SELECT
$1:DAY::TEXT AS DAY,
$1:CHANNEL_CATEGORY::TEXT AS CHANNEL_CATEGORY,
$1:SOURCE::TEXT AS SOURCE,
$1:PLATFORM::TEXT AS PLATFROM,
$1:LOB::TEXT AS LOB
FROM #myExternalStageToS3
(FILE_FORMAT => 'sf_parquet_format')
) as src
ON target.CHANNEL_CATEGORY = src.CHANNEL_CATEGORY
AND target.SOURCE = src.SOURCE
WHEN MATCHED THEN
UPDATE SET
DAY= src.DAY
,PLATFORM= src.PLATFORM
,LOB= src.LOB
WHEN NOT MATCHED THEN
INSERT (
DAY,
CHANNEL_CATEGORY,
SOURCE,
PLATFORM,
LOB
) VALUES (
src.DAY,
src.CHANNEL_CATEGORY,
src.SOURCE,
src.PLATFORM,
src.LOB
);
The UPDATE part does not require ,CHANNEL_CATEGORY= src.CHANNEL_CATEGORY ,SOURCE= src.SOURCE as condition is already met by ON clasue.

Cannot insert Array in Snowflake

I have a CSV file with the following data:
eno | phonelist | shots
"1" | "['1112223333','6195551234']" | "[[11,12]]"
The DDL statement I have used to create table in snowflake is as follows:
CREATE TABLE ArrayTable (eno INTEGER, phonelist array,shots array);
I need to insert the data from the CSV into the Snowflake table and the method I have used is:
create or replace stage ArrayTable_stage file_format = (TYPE=CSV)
put file://ArrayTable #ArrayTable_stage auto_compress=true
copy into ArrayTable from #ArrayTable_stage/ArrayTable.gz
file_format = (TYPE=CSV FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY='\"\')
But when I try to run the code, I get the error:
Copy to table failed: 100069 (22P02): Error parsing JSON:
('1112223333','6195551234')
How to resolve this?
FIELD_OPTIONALLY_ENCLOSED_BY='\"\' base on the row you have that should just be '\"'
select parse_json('[\'1112223333\',\'6195551234\']');
works (the back slashes are to get around the sql parser)
but your output has parens (, ) which is different.
SELECT column2, TRY_PARSE_JSON(column2) as j
FROM #ArrayTable_stage/ArrayTable.gz
file_format = (TYPE=CSV FIELD_DELIMITER='|' FIELD_OPTIONALLY_ENCLOSED_BY='\"\')
WHERE j is null;
will show which values are failing to parse..
failing that you might want to use to_array to parse column2 and thus insert into you table the SELECTED/transformed data, that is failing to auto transform

Converting a table in one form to another using Snowflake

I am trying to load a CSV file into Snowflake. The sample format of the input csv table in s3 location is as follows (with 2 columns: ID, Location_count):
Input csv table
I need to transform it in the below format:(with 3 columns:ID, Location, Count)
Output csv table
However when I am trying to load the input file using the following query after creating database, external stage and file format, it returns LOAD_FAILED
create or replace table table_name
(
id integer,
Location_count variant
);
select parse_json(Location_count) as c;
list #stage_name;
copy into table_name from #stage_name file_format = 'fileformatname' on_error = 'continue';
you will probably need to parse_json that 2nd column as part of a copy-transformation. For example:
create file format myformat
type = csv field_delimiter = ','
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
create or replace stage csv_stage file_format = (format_name = myformat);
copy into #csv_stage from
( select '1',
'{"SHS-TRN":654738,"PRN-UTN":78956,"NCT-JHN":96767}') ;
create or replace table blah (id integer, something variant);
copy into blah from (select $1, parse_json($2) from #csv_stage);

SQL Server: How to remove a key from a Json object

I have a query like (simplified):
SELECT
JSON_QUERY(r.SerializedData, '$.Values') AS [Values]
FROM
<TABLE> r
WHERE ...
The result is like this:
{ "2019":120, "20191":120, "201902":121, "201903":134, "201904":513 }
How can I remove the entries with a key length less then 6.
Result:
{ "201902":121, "201903":134, "201904":513 }
One possible solution is to parse the JSON and generate it again using string manipulations for keys with desired length:
Table:
CREATE TABLE Data (SerializedData nvarchar(max))
INSERT INTO Data (SerializedData)
VALUES (N'{"Values": { "2019":120, "20191":120, "201902":121, "201903":134, "201904":513 }}')
Statement (for SQL Server 2017+):
UPDATE Data
SET SerializedData = JSON_MODIFY(
SerializedData,
'$.Values',
JSON_QUERY(
(
SELECT CONCAT('{', STRING_AGG(CONCAT('"', [key] ,'":', [value]), ','), '}')
FROM OPENJSON(SerializedData, '$.Values') j
WHERE LEN([key]) >= 6
)
)
)
SELECT JSON_QUERY(d.SerializedData, '$.Values') AS [Values]
FROM Data d
Result:
Values
{"201902":121,"201903":134,"201904":513}
Notes:
It's important to note, that JSON_MODIFY() in lax mode deletes the specified key if the new value is NULL and the path points to a JSON object. But, in this specific case (JSON object with variable key names), I prefer the above solution.

snowflake - How to use a file format to decode a csv column?

I've got some data in a string column that is in a strange csv format. I can write a file format that correctly interprets it. How do I use my file format against data that has already been imported?
create table test_table
(
my_csv_column string
)
How do I split/flatten this column with:
create or replace file format my_csv_file_format
type = 'CSV'
RECORD_DELIMITER = '0x0A'
field_delimiter = ' '
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
VALIDATE_UTF8 = FALSE
Please assume that I cannot use split, as I want to use the rich functionality of the file format (optional escape characters, date recognition etc.).
What I'm trying to achieve is something like the below (but I cannot find how to do it)
copy into destination_Table
from
(select
s.$1
,s.$2
,s.$3
,s.$4
from test_table s
file_format = (column_name ='my_csv_column' , format_name = 'my_csv_file_format'))

Resources