Snowflake is not deducting partitioned by column in Parquet - snowflake-cloud-data-platform

I have a question about Snowflake's new capability -Infer Schema table function. The INFER SCHEMA function performs admirably on the parquet file and returns the correct data type. However, when the parquet files are partitioned and stored in S3, the INFER SCHEMA does not function as it does with pyspark dataframes.
In DataFrames, the partition folder name and value are read as the last column; is there a way to achieve the same result in the Snowflake Infer schema?
Example:
#GregPavlik - The input is in structured parquet format. When the parquet files are stored in S3 without a partition, the schema is perfectly derived.
Example : {
"AGMT_GID": 1714844883,
"AGMT_TRANS_GID": 640481290,
"DT_RECEIVED": "20 302",
"LATEST_TRANSACTION_CODE": "I"
}
The Snowflake infer schema provides me with 4 column names as well as their data types.
However if the parquet file is stored in partition - like shown in image above.
under - LATEST_TRANSACTION_CODE =I/ folder i would have the file as
Example : {
"AGMT_GID": 1714844883,
"AGMT_TRANS_GID": 640481290,
"DT_RECEIVED": "20 302"
}
In this case, snowflake infer Schema only provides three columns; however, reading the same file in Pyspark dataframe prints all four columns.
I'm wondering if there is a workaround in Snowflake to read a partitioned parquet file.

Faced this issue with snowflake while handling partitioned parquet files.
This problem happens not only in infer_schema., Following flows doesn't deduct partitioned by column as a column in snowflake:
COPY INTO TABLE from parquet
MERGE INTO TABLE from parquet
SELECT from parquet
INFER_SCHEMA from parquet
Snowflake treats parquet files as file and ignore meta information at folder names. Apache Spark deducts partitioned columns intelligently.
Following approaches are ways to handle it, till Snowflake Team handles this.
Approach 1
Handle this using Snowflake metadata features.
As of now, Snowflake metadata provide only
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
This will give a column with full path to the partitioned file. But we need to handle deducing the column from it. For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
Approach 2
If we have control over the flow that writes the source parquet file.
Add a duplicate column that has same content as partition by column content
This should happen before writing to parquet. So the parquet file will have this column content.
df.withColumn("partition_column", col("col1")).write.partitionBy("partition_column").parquet(path)
With this approach, if we do this once, in all the usage of the parquet (COPY, MERGE, SELECT, INFER), new column will start appearing.
Approach 3
If we do not have control over the flow that writes the source parquet file.
This approach is more domain specific and data model specific.
In many usecases, we need to reverse engineer how the partitioned by column is related to the data.
Can it be generated from other columns ? Let's say, if the data is partitioned by year, where year is the derived data from created_by column, then this derived data can be regenerated again.
Can it be generated by joining with another snowflake table ? Let's say parquet has an id which can be joined with another table to derive at our column dynamically
Approach 3 is more problem/domain specific. Also we need to handle this in all use cases of the parquet ( COPY, MERGE, SELECT etc).

Related

How can I replace an entire column in a csv file with the appropriate foreign key?

I'm working on a stock trading project, and I have stock data saved as csv files in this format.
symbol
date
open
high
low
close
vol
AAA
20220627
24.38
24.38
24.365
24.365
500
I'm currently working on the database design, and I'm using SQL Server & SSMS. The issue is with the schema I've created, I don't have a table that shares exactly the same columns as this csv file. Therefore it's not as straight forward as just importing or bulk inserting the data directly into a table.
In my schema I came up with a Stock table
id
symbol
company_name
stock_exchange
And a Stock Data table
id
stock_id
date
open
high
low
close
vol
The csv data ultimately needs to go into my Stock Data table, however I need to figure out a way to convert the stock symbol to the correct id that each stock is being assigned by my Stock table. Is this an issue with my design or is there a simple way to handle this that I cannot seem to find? I had considered simply reading the csv data into a temporary table and then correctly inserting the data into the Stock Data table, but I wasn't sure how to easily accomplish that since I'll be inserting thousands of rows.
You can view my full diagram here - https://lucid.app/lucidchart/28591ceb-6574-4e22-a5ce-284cada1d832/edit?invitationId=inv_d5eb35d3-9bd0-4ba0-aa95-e7d70ca50562#
SSIS can do the lookup before doing the insert. This would be a typical ETL. I would likely use ELT and avoid a complicated SSIS package. There is no extract because you already have the csv file. Bulk load or use simple SSIS package to load the file. Then join the loaded data to the lookup table to do the insert to the StockData table. (You can do the join and insert in the SSIS package as a T-SQL execution. Or create a job to do the bulk insert and T-SQL.)

Datafactory - dynamically copy subsection of columns from one database table to another

I have a database on SQL Server on premises and need to regularly copy the data from 80 different tables to an Azure SQL Database. For each table the columns I need to select from and map are different - example, TableA - I need columns 1,2 and 5. For TableB I need just column 1. The tables are named the same in the source and target, but the column names are different.
I could create multiple Copy data pipelines and select the source and target data sets and map to the target table structures, but that seems like a lot of work for what is ultimately the same process repeated.
I've so far created a meta table, which lists all the tables and the column mapping information. This table holds the following data:
SourceSchema, SourceTableName, SourceColumnName, TargetSchema, TargetTableName, TargetColumnName.
For each table, data is held in this table to map the source tables to the target tables.
I have then created a lookup which selects each table from the mapping table. It then does a for each loop and does another lookup to get the source and target column data for the table in the foreach iteration.
From this information, I'm able to map the Source table and the Sink table in a Copy Data activity created within the foreach loop, but I'm not sure how I can dynamically map the columns, or dynamically select only the columns I require from each source table.
I have the "activity('LookupColumns').output" from the column lookup, but would be grateful if someone could suggest how I can use this to then map the source columns to the target columns for the copy activity. Thanks.
In your case, you can use the expression in the mapping setting.
It needs your provide an expression and it's data should like this:{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":{"name":"ModifiedDate"}}]}
So you need to add a column named as Translator in your meta table, and it's value should be like the above JSON data. Then use this expression to do mapping:#item().Translator
Reference: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

How to transform the incremental json format data from a table into another table with tabular format

In my Airflow ETL i am loading parquet files from AWS-S3 into Snowflake's table(raw_data) as a single column of type varient.
Now i want to transform that json format values from table raw_data into columnar format into another table.
Everyday when ETL runs it loads new parquet file from S3 into Snowflake's raw_data table. How can i run the transformation from this raw_data, so that it only takes the incremental data. As the raw_data has only one column of type varient. I am not able to figure out any logic which only considers the new rows.
Thanks in advance
It's a very common use-case.
The best-case scenario would be if you have a reliable timestamp column in your JSON. In that case, you don't have to modify your existing tables.
If you don't have such field, you can add a timestamp column to your raw_data table.
alter table raw_data add column added_at timestamp default current_timestamp::timestamp;
This timestamp column should be addressed in your destination table as well.
alter table target_data add column last_sync_at timestamp;
The real incrementally should live with your data transformation.
with transformed as (
select
json_column:top_level_json_key.lower_level_key::data_type as table_column_name,
...
-- do this if you have a reliable incremental timestamp key
json_column:record_generation_key::timestamp as last_sync_at
-- or do this if you don't have this key
added_at as last_sync_at
from raw_data
where
last_sync_at > (select max(last_sync_at) from target_data)
)
select *
from transformed
Then, you only have to wrap the whole result in a merge/insert.
This is of course just a direction to think about. The real implementation completely depends on your use-case.
Cheers.
This is the purpose of Snowflake Streams. You create a stream over your raw_data table and it will tell you the incremental changes when new data is loaded. When you run DML leveraging that stream, the stream will reset until new data is loaded to the raw_data table.
https://docs.snowflake.com/en/user-guide/streams.html

How to cope with case-sensitive column names in big data file formats and external tables?

Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1

single or multiple tables in database?

My system collects a lot of data from different resources (each resource has text-ID), and send it to client bounded together in predefined groups. there some hundreds of different resources, each might set record in period of second up some hours. there less then hundred "view groups".
The data collector is single-threaded.
what is the best method to organize the data?
a. make different table for each source, where the name of the table is based on the source id?
b. make single table and add the source id as text-field (key if possible)?
c. table for each predefined display group, with the source id as text-field?
each record has value (float) and date (date). the query will be something like select * from ... where date < d1 and date > d2. In case of single table, it will be "and sourceId in(...)"
database type is unknown yet, it might be lightsql, postgres, mysql, mssql ...

Resources