Snowpipe fails to read AVRO compressed by DEFLATE exported from BigQuery - snowflake-cloud-data-platform

I am trying to import data exported from BigQuery as AVRO and compressed as DEFLATE. The only encoding common to both is DEFLATE besides NONE.
I am exporting one of publicly available datasets bigquery-public-data:covid19_open_data.covid19_open_data with 13,343,598 rows. I am using the following command to export:
bq extract --destination_format=AVRO --compression=DEFLATE bigquery-public-data:covid19_open_data.covid19_open_data gs://staging/covid19_open_data/avro_deflate/covid19_open_data_2_*.avro
The command creates 17 files in GCP. When I query the data in the files with command:
SELECT count(*) FROM #shared.data_warehouse_ext_stage/covid19_open_data/avro_deflate;
I only get a count of 684,5021 rows. To troubleshoot the error in the pipe I issue the command:
SELECT * from table(information_schema.copy_history(table_name=>'covid19_open_data', start_time=> dateadd(hours, -1, current_timestamp())));
The error reported by the pipeline is as follows:
Invalid data encountered during decompression for file: 'covid19_open_data_3_000000000006.avro',compression type used: 'DEFLATE', cause: 'data error'
The SQL for the File Format command is:
CREATE OR REPLACE FILE FORMAT monitoring_blocking.dv_avro_deflate_format TYPE = AVRO COMPRESSION = DEFLATE;
I know the problem is only related to the compression being DEFLATE. There are only two compressions for AVRO that are common for both BigQuery and Snowflake NONE and DEFLATE. I also created two pipes one file format AVRO with compression NONE and the second with CSV and GZIP. They both load data into the table. The two AVRO pipelines are a mirror of each other except for the file format. Here is snippet of the SQL for the pipe:
CREATE OR REPLACE PIPE covid19_open_data_avro
AUTO_INGEST = TRUE
INTEGRATION = 'GCS_PUBSUB_DATA_WAREHOUSE_NOTIFICATION_INT' AS
COPY INTO covid19_open_data(
location_key
,date
,place_id
,wikidata_id
...
)
FROM
(SELECT
$1:location_key
,$1:date AS date
,$1:place_id AS place_id
,$1:wikidata_id AS wikidata_id
...
FROM #shared.staging/covid19_open_data/avro_deflate)
FILE_FORMAT = monitoring_blocking.dv_avro_deflate_format;

The problem lies within Snowflake. When we change the compression format in the FILE FORMAT definition to AUTO it worked
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = DEFLATE;
to
CREATE OR REPLACE FILE FORMAT my_schema.avro_compressed_format
TYPE = AVRO
COMPRESSION = AUTO;

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

I am using Snowflake Data Load Wizard to upload csv file to Snowflake table. The Snowflake table structure identifies a few columns as 'NOT NULL' (non-nullable). Problem is, the wizard is treating empty strings as null and the Data Load Wizard issues the following error:
Unable to copy files into table. NULL result in a non-nullable column
File '#<...../load_data.csv', line 2, character 1 Row, 1 Column
"<TABLE_NAME>" ["PRIMARY_CONTACT_ROLE":19)]
I'm sharing my File Format parameters from the wizard:
I then updated the DDL of the table by removing the "NOT NULL" declaration of the PRIMARY_CONTACT_ROLE column, then re-create the table and this time the data load of 20K records is successful.
How do we fix the file format wizard to make SNOWFLAKE not consider empty strings as NULLS?
The option you have to set EMPTY_FIELD_AS_NULL = FALSE. Unfortunately, modifying this option is not possible in the wizard. You have to create your file format or alter your existing file format in a worksheet manually as follows:
CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
EMPTY_FIELD_AS_NULL = FALSE;
This will cause empty strings to be not treated as NULL values but as empty strings.
The relevant documentation can be found at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv.
Let me know if you you need a full example of how to upload a CSV file with the SnowSQL CLI.
Just to add, there are additional ways you can load your CSV into snowflake without having to specify a file format.
You can use pre-built third-party modeling tools to upload your raw file, adjust the default column types to your preference and push your modeled data back into snowflake.
My team and I are working on such a tool - Datameer, feel free to check it out here
https://www.datameer.com/upload-csv-to-snowflake/

Extract data from Word documents with SSIS to ETL into SQL

I could really use some help in how to extract data from Word documents using SSIS and inserting the extracted data in SQL. There are 10,000 - 13,000 Word files to process. The files most likely aren't consistent over the years. Any help is greatly appreciated!
Below is the example data from the Word documents that I'm interested in capturing. Note that Date and Job No are in the Header section.
Customer : Test Customer
Customer Ref. : 123456
Contact : Test Contact
Part No. : 123456789ABCDEFG
Manufacturer : Some Mfg.
Package : 123-456
Date Codes : 1234
Lot Number : 123456
Country of Origin : Country
Total Incoming Qty : 1 pc
XRF Test Result : PASS
HCT Result : PASS
Solder Test Result : PASS
My approach would be this:
Create a script in Python that extracts your data from the Word files and save them in XML or JSON format
Create SSIS package to load the data from each XML/JSON file to SQL Server
1. Using a script component as a source
To import data from Microsoft Word into SQL Server, you can use a script component as a data source where you can implement a C# script to parse document files using Office Interoperability libraries or any third-party assembly.
Example of reading tables from a Word file
2. Extracting XML from DOCX file
DOCX file is composed of several embedded files. Text is mainly stored within an XML file. You can use a script task or Execute Process Task to extract the DOCX file content and use an XML source to read the data.
How can I extract the data from a corrupted .docx file?
How to extract just plain text from .doc & .docx files?
3. Converting the Word document into a text file
The third approach is to convert the Word document into a text file and use a flat-file connection manager to read the data.
convert a word doc to text doc using C#
Converting a Microsoft Word document to a text file in C#

Snowflake "Max LOB size (16777216) exceeded" error when loading data from Parquet

I am looking to load data from S3 into Snowflake. My files are in Parquet format, and were created by a spark job. There are 199 parquet files in my folder in S3, each with about 5500 records. Each parquet file is snappy compressed and is about 485 kb.
I have successfully created a storage integration and staged my data. However, when I read my data, I get the following message:
Max LOB size (16777216) exceeded, actual size of parsed column is 19970365
I believe I have followed the General File Sizing Recommendations but I have not been able to figure out a solution to this issue, or even a clear description of this error message.
Here is the basics of my SQL query:
CREATE OR REPLACE TEMPORARY STAGE my_test_stage
FILE_FORMAT = (TYPE = PARQUET)
STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
URL = 's3://my-bucket/my-folder';
SELECT $1 FROM #my_test_stage(PATTERN => '.*\\.parquet')
I seem to be able to read each parquet file individually by changing the URL parameter in the CREATE STAGE query to the full path of the parquet file. I really don't want to have to iterate through each file to load them.
The VARIANT data type imposes a 16 MB (compressed) size limit on individual rows.
The resultset is actually a display as a virtual column, so the 16MB limit also applied
Docs Reference:
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#semi-structured-data-size-limitations
There may be issue with one or more records from your file, try to run copy command with "ON_ERROR" option, to debug whether all the records has similar problem or only few.

Importing a tsv file into dynamoDB

I have a dataset, got from the registry of open data on AWS.
This is the link of my data set. I want to import this data set into a DynamoDb table, but I don't know how to do it
I tried to use data Pipeline, from S3 bucket to dynamoDB but it didn't work
at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) Caused
by: com.google.gson.stream.MalformedJsonException: Expected ':' at
line 1 column 20 at
com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1505) at
com.google.gson.stream.JsonReader.doPeek(JsonReader.java:519) at
com.google.gson.stream.JsonReader.peek(JsonReader.java:414) at
com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:157)
at
com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:187)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:145)
at com.google.gson.Gson.fromJson(Gson.java:803) ... 15 more Exception
in thread "main" java.io.
I have this error and I don't know how to fix it.
than I downloaded the file locally but I can't import it into my table in dynamoDb
There is no code for the moment, all i do is configuration
I'm expecting to have the data set into my table, but unfortunately I couldn't reach my goal
i finally converted the TSV file to a CSV file then to a Json file using a python script.
with a Json file it's much easier

load data into impala partitioned table

I have data in HDFS in following dir structure :
/exported/2014/07/01/00/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
/exported/2014/07/01/02/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
part-m-00003.bz2
.
.
.
.
/exported/2014/08/01/09/SEARCHES/part-m-00005 .bz2
there are multiple part files in each subdirectory.
I want to load this dataset into impala table, so use following query to create table :
CREATE EXTERNAL TABLE search(time_stamp TIMESTAMP, ..... url STRING,domain STRING) PARTITIONED BY (year INT, month INT, day INT. hour INT) row format delimited fields terminated by '\t';
Then
ALTER TABLE search ADD PARTITION (year=2014, month=08, day=01) LOCATION '/data/jobs/exported/2014/08/01/*/SEARCHES/';
But it failed to load with following error :
ERROR: AnalysisException: Failed to load metadata for table: magneticbi.search_mmx
CAUSED BY: TableLoadingException: Failed to load metadata for table: search_mmx
CAUSED BY: RuntimeException: Compressed text files are not supported: part-m-00000.bz2
not sure what is the correct way to do this.
Anyone can help in this ?
Thanks
Here's a link to a table from Cloudera that describes your options. To summarize:
Impala supports the following compression codecs:
Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy compression is very fast, but GZIP provides greater space savings. Not supported for text files.
GZIP. Recommended when achieving the highest level of compression (and therefore greatest disk-space savings) is desired. Not supported for text files.
Deflate. Not supported for text files.
BZIP2. Not supported for text files.
LZO, for Text files only. Impala can query LZO-compressed Text tables, but currently cannot create them or insert data into them; perform these operations in Hive.

Resources