Athena data mismatch when attempting to query multiple partitioned files - database

I have multiple parquet files partitioned by day in S3. Those parquet files are all onboarded with 'string' types for each column, which I checked with the following code:
import awswrangler as wr
results = wr.s3.list_objects('s3://datalake/buildings/upload_year=*/upload_month=*/upload_day=*/*.parquet.gzip')
for result in results:
df = wr.s3.read_parquet(result)
column_not_string = sum([1 if str(x) != 'string' else 0 for x in df.dtypes])
print(column_not_string)
column_not_string is the sum of fields that aren't string type in each file. Each iteration of a file returns 0, as expected.
These parquet files are then crawled by an AWS Glue crawler. After I run the crawler I inspect the newly created table. All Data types are 'string'. I inspect the partitions, and each field in the partition properties are all 'string'.
The problem:
When I go to Athena to query the newly created table:
select *
from "default".buildings;
I get the following error:
SQL Error [100071] [HY000]: [Simba]AthenaJDBC An error has
been thrown from the AWS Athena client. HIVE_BAD_DATA: Field
buildingcode's type INT64 in parquet is incompatible with type string
defined in table schema
I don't understand what is going on here. If all parquet file metadata is string type, the files are crawled as strings, table is created as string, how can there be a data mismatch for an int type?

While I was poking around in the table partition properties in Glue (AWS Glue > Tables > [your table] > View Partitions > View properties) I noticed that the 'objectCount' property was 1 on some partitions and > 1 on others. I noticed the partitions that had > 1 were also the partitions Athena was erroring on if I filtered my SQL down to that particular partition.
I inspected the S3 location representative of that partition and noticed I had multiple objects where I was only expecting 1. The extra object was a parquet file that had data types that weren't string. Lesson learned.

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

I am using Snowflake Data Load Wizard to upload csv file to Snowflake table. The Snowflake table structure identifies a few columns as 'NOT NULL' (non-nullable). Problem is, the wizard is treating empty strings as null and the Data Load Wizard issues the following error:
Unable to copy files into table. NULL result in a non-nullable column
File '#<...../load_data.csv', line 2, character 1 Row, 1 Column
"<TABLE_NAME>" ["PRIMARY_CONTACT_ROLE":19)]
I'm sharing my File Format parameters from the wizard:
I then updated the DDL of the table by removing the "NOT NULL" declaration of the PRIMARY_CONTACT_ROLE column, then re-create the table and this time the data load of 20K records is successful.
How do we fix the file format wizard to make SNOWFLAKE not consider empty strings as NULLS?
The option you have to set EMPTY_FIELD_AS_NULL = FALSE. Unfortunately, modifying this option is not possible in the wizard. You have to create your file format or alter your existing file format in a worksheet manually as follows:
CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
EMPTY_FIELD_AS_NULL = FALSE;
This will cause empty strings to be not treated as NULL values but as empty strings.
The relevant documentation can be found at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv.
Let me know if you you need a full example of how to upload a CSV file with the SnowSQL CLI.
Just to add, there are additional ways you can load your CSV into snowflake without having to specify a file format.
You can use pre-built third-party modeling tools to upload your raw file, adjust the default column types to your preference and push your modeled data back into snowflake.
My team and I are working on such a tool - Datameer, feel free to check it out here
https://www.datameer.com/upload-csv-to-snowflake/

COPY INTO with partitioned ADLS

I have a container with partitioned parquet files that I want to use with the copy into command. My directories look like the below.
ABC_PARTITIONED_ID=1 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
4-snappy.parquet
ABC_PARTITIONED_ID=2 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
ABC_PARTITIONED_ID=3 (directory)
1-snappy.parquet
2-snappy.parquet
....
Each partitioned directory can contain multiple parquet files. I do not have a hive partition column that matches the pattern of the directories (ID1, ID2 etc).
How do I properly use the pattern parameter in the copy into command to write to a SF table from my ADLS? I am using this https://www.snowflake.com/blog/how-to-load-terabytes-into-snowflake-speeds-feeds-and-techniques/ as an example.
I do not think that you have anything to do with the pattern parameter.
You said you do not have a hive partition column that matches the pattern of the directories. If you do not have a column to use these partitions, then they are probably not beneficial for querying the data. Maybe they were generated to help maintenance. If this is the case, ignore the partition, and read all files with the COPY command.
If you think having such a column would help, then the blog post (you mentioned) already shows how you can parse the filenames to generate the column value. Add the partition column to your table (and even you may define it as the clustering key), and run the COPY command to read all files in all partitions/directories, parse the value of the column from the file name.
For parsing the partition value, I would use this one which seems easier:
copy into TARGET_TABLE from (
select
REGEXP_SUBSTR (
METADATA$FILENAME,
'.*\ABC_PARTITIONED_ID=(.*)\/.*',
1,1,'e',1
) partitioned_column_value,
$1:column_name,
...
from #your_stage/data_folder/);
If the directory/partition name doesn't matter to you, then you can use some of the newer functions in Public Preview that support Parquet format to create the table and ingest the data. Your question on how to construct the pattern would be PATTERN='*.parquet' as all subfolders would be read.
//create file format , only required to create one time
create file format my_parquet_format
type = parquet;
//EXAMPLE CREATE AND COPY INTO FOR TABLE1
//create an empty table using this file format and location. name the table table1
create or replace table ABC
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>'#mystage/ABC_PARTITIONED_ROOT',
file_format=>'my_parquet_format'
)
));
//copy parquet files in folder /table1 into table TABLE1
copy into ABC from #mystage/ABC_PARTITIONED_ROOT pattern = '*.parquet' file_format=my_parquet_format match_by_column_name=case_insensitive;
This should be possible by creating a storage integration, granting access in Azure for Snowflake to access the storage location, and then creating an external stage.
Alternatively you can generate a shared access signature (SAS) token to grant Snowflake (limited) access to objects in your storage account. You can then access an external (Azure) stage that references the container using the SAS token.
Snowflake metadata provides
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
we need to handle deducing the column from it. We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
More detailed answer and other approaches to solve this issue is available on this stackoverflow answer

Snowflake "Max LOB size (16777216) exceeded" error when loading data from Parquet

I am looking to load data from S3 into Snowflake. My files are in Parquet format, and were created by a spark job. There are 199 parquet files in my folder in S3, each with about 5500 records. Each parquet file is snappy compressed and is about 485 kb.
I have successfully created a storage integration and staged my data. However, when I read my data, I get the following message:
Max LOB size (16777216) exceeded, actual size of parsed column is 19970365
I believe I have followed the General File Sizing Recommendations but I have not been able to figure out a solution to this issue, or even a clear description of this error message.
Here is the basics of my SQL query:
CREATE OR REPLACE TEMPORARY STAGE my_test_stage
FILE_FORMAT = (TYPE = PARQUET)
STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
URL = 's3://my-bucket/my-folder';
SELECT $1 FROM #my_test_stage(PATTERN => '.*\\.parquet')
I seem to be able to read each parquet file individually by changing the URL parameter in the CREATE STAGE query to the full path of the parquet file. I really don't want to have to iterate through each file to load them.
The VARIANT data type imposes a 16 MB (compressed) size limit on individual rows.
The resultset is actually a display as a virtual column, so the 16MB limit also applied
Docs Reference:
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#semi-structured-data-size-limitations
There may be issue with one or more records from your file, try to run copy command with "ON_ERROR" option, to debug whether all the records has similar problem or only few.

Presto query Array of Rows

So I have a hive external table with schema looks like this :
{
.
.
`x` string,
`y` ARRAY<struct<age:string,cId:string,dmt:string>>,
`z` string
}
So basically I need to query a column(column "y") which is array of nested json,
I can see data of column "y" from hive, but data in that column seems invisible to presto, even though presto knows schema of this field, like this:
array(row(age varchar,cid varchar,dmt varchar))
As you can see presto already knows this field is array of row.
Notes:
1.The table is a hive external table.
2.I get schema of field "y" by using ODBC driver, but data is just all empty, however I can see something like this in hive :
[{"age":"12","cId":"bx21hdg","dmt":"120"}]
3.Presto queries hivemetastore for schema.
4.Table was stored as parquet format.
So how can I see my data in field "y" please?
Please try the below. This should work in Presto.
"If the array element is a row data type, the result is a table with one column for each row field in the element data type. The result table column data types match the corresponding array element row field data types"
select
y,age,cid,dmt
from
table
cross join UNNEST(y) AS nested_data(age,cid,dmt)
Reference: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0055064.html

load data into impala partitioned table

I have data in HDFS in following dir structure :
/exported/2014/07/01/00/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
/exported/2014/07/01/02/SEARCHES/part-m-00000.bz2
part-m-00001.bz2
part-m-00003.bz2
.
.
.
.
/exported/2014/08/01/09/SEARCHES/part-m-00005 .bz2
there are multiple part files in each subdirectory.
I want to load this dataset into impala table, so use following query to create table :
CREATE EXTERNAL TABLE search(time_stamp TIMESTAMP, ..... url STRING,domain STRING) PARTITIONED BY (year INT, month INT, day INT. hour INT) row format delimited fields terminated by '\t';
Then
ALTER TABLE search ADD PARTITION (year=2014, month=08, day=01) LOCATION '/data/jobs/exported/2014/08/01/*/SEARCHES/';
But it failed to load with following error :
ERROR: AnalysisException: Failed to load metadata for table: magneticbi.search_mmx
CAUSED BY: TableLoadingException: Failed to load metadata for table: search_mmx
CAUSED BY: RuntimeException: Compressed text files are not supported: part-m-00000.bz2
not sure what is the correct way to do this.
Anyone can help in this ?
Thanks
Here's a link to a table from Cloudera that describes your options. To summarize:
Impala supports the following compression codecs:
Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy compression is very fast, but GZIP provides greater space savings. Not supported for text files.
GZIP. Recommended when achieving the highest level of compression (and therefore greatest disk-space savings) is desired. Not supported for text files.
Deflate. Not supported for text files.
BZIP2. Not supported for text files.
LZO, for Text files only. Impala can query LZO-compressed Text tables, but currently cannot create them or insert data into them; perform these operations in Hive.

Resources