(Posting this Q&A sequence in SO as I'm sure some newer users may be running into similar "speed bumps". -G)
How to handle special characters in Snowpipe
I am creating snowpipe based on csv file. my csv file contain special characters in few columns. Please let me know how to write select statement in snowpipe that will take care of any special characters.
To handle special characters you need to escape them.
There are 2 ways to escape special characters, but unfortunately each of them requires you to modify the file
1) you can escape a special character by duplicating it (so to escape a ' you make it a '')
2) when defining your file format you can add ESCAPE parameter to define an explicit escape character. For example, you could use ESCAPE='\\' and then add a single \ character before each of the special characters you want to escape.
The snowpipe command embeds the copy statement which can contain a select statement (transformation), and we can use string functions to remove the special character.
/* Data load with transformation */
COPY INTO [<namespace>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
FROM ( SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] ... ]
FROM { internalStage | externalStage } )
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
The above command is part of following url "https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html".
NOTE: I'd be interested to see if anyone else has utilized an alternative solution with success... -G
Related
In the Snowflake Documentation for "Querying Data in Staged Files" why is the syntax for the "Pattern" & "Format" parameters "=>" instead of "=" whereas for the COPY INTO syntax the "Pattern" & "Format" parameters have "="?
The documentation doesn't mention anything about this difference so I'm confused.
">=" means Greater than or Equal to
"<=" means Less than or Equal to
So, what the hell does "=>" mean?
Link to the documentation for "Querying Data in Staged Files": https://docs.snowflake.com/en/user-guide/querying-stage.html
Link to the documentation for "COPY INTO ": https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Link to the documentation for Snowflake Query Operators: https://docs.snowflake.com/en/sql-reference/operators.html
In general when you define function or stored procedure it will have a specific signature. This signature has to be matched during the routine call.
Example:
CREATE OR REPLACE FUNCTION test(a INT, b TEXT)
RETURNS TEXT
AS
$$
CONCAT(a, ' ', b)
$$;
SHOW FUNCTIONS LIKE 'TEST';
-- TEST(NUMBER, VARCHAR) RETURN VARCHAR
When calling test function argument order has to match its signature("positional notation"):
SELECT test(1, 'b');
-- 1 b
Unfortunately it is not possible to use named parameters for user defined objects and explicitly state the parameters("named notation"):
SELECT test(a => 1, b => 'b');
SELECT test(b => 'b', a => 1);
SELECT test(b => 'b');
Some built-ins constructs however allows to use named parameters => (for instance FLATTEN or staged file clause).
Using FLATTEN as it is easier to produce self-contained example:
FLATTEN( INPUT => <expr> [ , PATH => <constant_expr> ]
[ , OUTER => TRUE | FALSE ]
[ , RECURSIVE => TRUE | FALSE ]
[ , MODE => 'OBJECT' | 'ARRAY' | 'BOTH' ] )
All 3 invocations are correct:
-- no explicit parameters names
SELECT * FROM TABLE(FLATTEN(parse_json('{"a":1, "b":[77,88]}'), 'b')) f;
-- parameters names order: input, path
SELECT * FROM TABLE(FLATTEN(input => parse_json('{"a":1, "b":[77,88]}'), path => 'b')) f;
-- parameters names order: path, input
SELECT * FROM TABLE(FLATTEN(path => 'b', input => parse_json('{"a":1, "b":[77,88]}'))) f;
I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.
I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:
String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table
I have a unique situation while loading data from a csv file into Snowflake.
I have multiple columns that need some re-work
Column enclosed in " and contains columns - this is handled properly
Columns that are enclosed in " but also contain " within the data i.e. ( "\"DataValue\"")
My File Format is as such:
ALTER FILE FORMAT DB.SCHEMA.FF_CSV_TEST
SET COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE
ESCAPE = NONE
ESCAPE_UNENCLOSED_FIELD = 'NONE'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
My columns enclosed in " that contain commas are being handled fine. However the remaining columns that resemble ( "\"DataValue\"") are returning errors:
Found character 'V' instead of field delimiter ','
Are there there any ways to handle this?
I have attempted using a select against the stage itself:
select t.$1, t.$2, t.$3, t.$4, t.$5, TRIM(t.$6,'"')
from #STAGE_TEST/file.csv.gz t
LIMIT 1000;
with t.$5 being the column enclosed with " and containing commas
and t.$6 being the ( "\"DataValue\"")
Are there any other options than developing python (or other) code that strips out this before processing into Snowflake?
Add the \ to your escape parameter. It looks like your quote values are properly escaped, so that should take care of those quotes.
(Submitting on behalf of a Snowflake User)
For example - ""NiceOne"" LLC","Robert","GoodRX",,"Maxift","Brian","P,N and B","Jane"
I have been able use create a file format that satisfies each of these conditions, but not one that satisfies all three.
I've used the following recommendation:
Your first column is malformed, missing the initial ", it should be:
"""NiceOne"" LLC"
After fixing that, you should be able to load your data with almost
default settings,
COPY INTO my_table FROM #my_stage/my_file.csv FILE_FORMAT = (TYPE =
CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"');
...but the above format returns:
returns -
"""NiceOne"" LLC","Robert","GoodRX","","Maxift","Brian","P,N and B","Jane"
I don't want quotes around empty fields. I'm looking for
"""NiceOne"" LLC","Robert","GoodRX",,"Maxift","Brian","P,N and B","Jane"
Any recommendations?
If you use the following you will not get quotes around NULL fields, but you will get quotes on '' (empty text). You can always concatenate the fields and format the resulting line manually if this doesn't suite you.
COPY INTO #my_stage/my_file.CSV
FROM (
SELECT
'"NiceOne" LLC' A, 'Robert' B, 'GoodRX' C, NULL D,
'Maxift' E, 'Brian' F, 'P,N and B' G, 'Jane' H
)
FILE_FORMAT = (
TYPE = CSV
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ()
COMPRESSION = NONE
)
OVERWRITE = TRUE
SINGLE = TRUE
I am using mysql2sqlite.sh from script Github to change my mysql database to sqlite. But the problem i am getting is that in my table the data 'E-001' gets changed to 'E?001'.
I have no idea how to modify the script to get the required result. Please help me.
the script is
#!/bin/sh
# Converts a mysqldump file into a Sqlite 3 compatible file. It also extracts the MySQL `KEY xxxxx` from the
# CREATE block and create them in separate commands _after_ all the INSERTs.
# Awk is choosen because it's fast and portable. You can use gawk, original awk or even the lightning fast mawk.
# The mysqldump file is traversed only once.
# Usage: $ ./mysql2sqlite mysqldump-opts db-name | sqlite3 database.sqlite
# Example: $ ./mysql2sqlite --no-data -u root -pMySecretPassWord myDbase | sqlite3 database.sqlite
# Thanks to and #artemyk and #gkuenning for their nice tweaks.
mysqldump --compatible=ansi --skip-extended-insert --compact "$#" | \
awk '
BEGIN {
FS=",$"
print "PRAGMA synchronous = OFF;"
print "PRAGMA journal_mode = MEMORY;"
print "BEGIN TRANSACTION;"
}
# CREATE TRIGGER statements have funny commenting. Remember we are in trigger.
/^\/\*.*CREATE.*TRIGGER/ {
gsub( /^.*TRIGGER/, "CREATE TRIGGER" )
print
inTrigger = 1
next
}
# The end of CREATE TRIGGER has a stray comment terminator
/END \*\/;;/ { gsub( /\*\//, "" ); print; inTrigger = 0; next }
# The rest of triggers just get passed through
inTrigger != 0 { print; next }
# Skip other comments
/^\/\*/ { next }
# Print all `INSERT` lines. The single quotes are protected by another single quote.
/INSERT/ {
gsub( /\\\047/, "\047\047" )
gsub(/\\n/, "\n")
gsub(/\\r/, "\r")
gsub(/\\"/, "\"")
gsub(/\\\\/, "\\")
gsub(/\\\032/, "\032")
print
next
}
# Print the `CREATE` line as is and capture the table name.
/^CREATE/ {
print
if ( match( $0, /\"[^\"]+/ ) ) tableName = substr( $0, RSTART+1, RLENGTH-1 )
}
# Replace `FULLTEXT KEY` or any other `XXXXX KEY` except PRIMARY by `KEY`
/^ [^"]+KEY/ && !/^ PRIMARY KEY/ { gsub( /.+KEY/, " KEY" ) }
# Get rid of field lengths in KEY lines
/ KEY/ { gsub(/\([0-9]+\)/, "") }
# Print all fields definition lines except the `KEY` lines.
/^ / && !/^( KEY|\);)/ {
gsub( /AUTO_INCREMENT|auto_increment/, "" )
gsub( /(CHARACTER SET|character set) [^ ]+ /, "" )
gsub( /DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP|default current_timestamp on update current_timestamp/, "" )
gsub( /(COLLATE|collate) [^ ]+ /, "" )
gsub(/(ENUM|enum)[^)]+\)/, "text ")
gsub(/(SET|set)\([^)]+\)/, "text ")
gsub(/UNSIGNED|unsigned/, "")
if (prev) print prev ","
prev = $1
}
# `KEY` lines are extracted from the `CREATE` block and stored in array for later print
# in a separate `CREATE KEY` command. The index name is prefixed by the table name to
# avoid a sqlite error for duplicate index name.
/^( KEY|\);)/ {
if (prev) print prev
prev=""
if ($0 == ");"){
print
} else {
if ( match( $0, /\"[^"]+/ ) ) indexName = substr( $0, RSTART+1, RLENGTH-1 )
if ( match( $0, /\([^()]+/ ) ) indexKey = substr( $0, RSTART+1, RLENGTH-1 )
key[tableName]=key[tableName] "CREATE INDEX \"" tableName "_" indexName "\" ON \"" tableName "\" (" indexKey ");\n"
}
}
# Print all `KEY` creation lines.
END {
for (table in key) printf key[table]
print "END TRANSACTION;"
}
'
exit 0
I can't give a guaranteed solution, but here's a simple technique I've been using successfully to handle similar issues (See "Notes", below). I've been wrestling with this script the last few days, and figure this is worth sharing in case there are others who need to tweak it but are stymied by the awk learning curve.
The basic idea is to have the script output to a text file, edit the file, then import into sqlite (More detailed instructions below).
You might have to experiment a bit, but at least you won't have to learn awk (though I've been trying and it's pretty fun...).
HOW TO
Run the script, exporting to a file (instead of passing directly
to sqlite3):
./mysql2sqlite -u root -pMySecretPassWord myDbase > sqliteimport.sql
Use your preferred text editing technique to clean up whatever mess
you've run into. For example, search/replace in sublimetext. (See the last note, below, for a tip.)
Import the cleaned up script into sqlite:
sqlite3 database.sqlite < sqliteimport.sql
NOTES:
I suspect what you're dealing with is an encoding problem -- that '-' represents a character that isn't recognized by, or means something different to, either your shell, the script (awk), or your sqlite database. Depending on your situation, you may not be able to finesse the problem (see the next note).
Be forewarned that this is most likely only going to work if the offending characters are embedded in text data (not just as text, but actual text content stored in a text field). If they're in a machine name (foreign key field, entity id, e.g.), binary data stored as text, or text data stored in a binary field (blob, eg), be careful. You could try it, but don't get your hopes up, and even if it seems to work be sure to test the heck out of it.
If in fact that '-' represents some unusual character, you probably won't be able to just type a hyphen into the 'search' field of your search/replace tool. Copy it from the source data (eg., open the file, highlight and copy to clipboard) then paste into the tool.
Hope this helps!
To convert mysql to sqlite3 you can use Navicom Premium.