Snowflake - Column default values not effective with COPY command? - snowflake-cloud-data-platform

I'm using Matillion to load data into Snowflake, both on Azure. When I create tables I specify default values for the columns, I don't like having NULLs in the warehouse.
From what I've read, the Database Query orchestration component in Matillion for Snowflake will put the retrieved data set into an Azure blob and use the Snowflake COPY command to move the data from the blob to the target table.
The result is that NULL values are still there in the target table.
Can someone confirm that the COPY command does some kind of bulk data copy and that the default values are effective only with INSERT statements?
If so I'll just use trap the NULL values at the source.
Thanks.
JFS.

It isn't mentioned in a straight-forward way but the COPY INTO … TABLE statement documentation does specify that it will use the default only for skipped column names (and not in other scenarios):
( col_name [ , col_name ... ] )
[ … ]
Any columns excluded from this column list are populated by their default value
Additionally, the NULL-use behaviour is mentioned for another scenario where data may be missing, making no note of the use of defaults:
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE
[ … ]
If the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values.

Related

AzureSynapse pipeline how to add guid to raw data

I am new to AzureSynapse and am technically a Data Scientist whos doing a Data Engineering task. Please help!
I have some xlsx files containing raw data that I need to import into an SQL database table. The issue is that the raw data does not have a uniqueidentifer column and I need to add one before inserting the data into my SQL database.
I have been able to successfully add all the rows to the table by adding a new column on the Copy Data command and setting it to be #guid(). However, this sets the guid of every row to the same value (not unique for each row).
GUID mapping:
DB Result:
If I do not add this mapping, the pipeline throws an error stating that it cannot import a NULL Id into the column Id. Which makes sense as this column does not accept NULL values.
Is there a way to have AzureSynapse analystics read in a raw xlsx file and then import it into my DB with a unique identifier for each row? If so, how can I accomplish this?
Many many thanks for any support.
Giving dynamic content to a column in this way would generate the same value for entire column.
Instead, you can generate a new guid for each row using a for each activity.
You can retrieve the data from your source excel file using a lookup activity (my source only has name column). Give the output array of lookup activity to for each activity.
#activity('Lookup1').output.value
Inside for each, since you already have a linked service, create a script activity. In this script activity, you can create a query with dynamic content to insert values into the destination table. The following is the query I built using dynamic content.
insert into demo values ('#{guid()}','#{item().name}')
This allows you to iterate through source rows, insert each row individually while generating new guid every time
You can follow the above procedure to build a query to insert each row with unique identifier value. The following is an image where I used copy data to insert first 2 rows (same as yours) and inserted the next 2 rows using the above procedure.
NOTE: I have taken Azure SQL database for demo, but that does not affect the procedure.

Snowflake is not deducting partitioned by column in Parquet

I have a question about Snowflake's new capability -Infer Schema table function. The INFER SCHEMA function performs admirably on the parquet file and returns the correct data type. However, when the parquet files are partitioned and stored in S3, the INFER SCHEMA does not function as it does with pyspark dataframes.
In DataFrames, the partition folder name and value are read as the last column; is there a way to achieve the same result in the Snowflake Infer schema?
Example:
#GregPavlik - The input is in structured parquet format. When the parquet files are stored in S3 without a partition, the schema is perfectly derived.
Example : {
"AGMT_GID": 1714844883,
"AGMT_TRANS_GID": 640481290,
"DT_RECEIVED": "20 302",
"LATEST_TRANSACTION_CODE": "I"
}
The Snowflake infer schema provides me with 4 column names as well as their data types.
However if the parquet file is stored in partition - like shown in image above.
under - LATEST_TRANSACTION_CODE =I/ folder i would have the file as
Example : {
"AGMT_GID": 1714844883,
"AGMT_TRANS_GID": 640481290,
"DT_RECEIVED": "20 302"
}
In this case, snowflake infer Schema only provides three columns; however, reading the same file in Pyspark dataframe prints all four columns.
I'm wondering if there is a workaround in Snowflake to read a partitioned parquet file.
Faced this issue with snowflake while handling partitioned parquet files.
This problem happens not only in infer_schema., Following flows doesn't deduct partitioned by column as a column in snowflake:
COPY INTO TABLE from parquet
MERGE INTO TABLE from parquet
SELECT from parquet
INFER_SCHEMA from parquet
Snowflake treats parquet files as file and ignore meta information at folder names. Apache Spark deducts partitioned columns intelligently.
Following approaches are ways to handle it, till Snowflake Team handles this.
Approach 1
Handle this using Snowflake metadata features.
As of now, Snowflake metadata provide only
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
This will give a column with full path to the partitioned file. But we need to handle deducing the column from it. For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
Approach 2
If we have control over the flow that writes the source parquet file.
Add a duplicate column that has same content as partition by column content
This should happen before writing to parquet. So the parquet file will have this column content.
df.withColumn("partition_column", col("col1")).write.partitionBy("partition_column").parquet(path)
With this approach, if we do this once, in all the usage of the parquet (COPY, MERGE, SELECT, INFER), new column will start appearing.
Approach 3
If we do not have control over the flow that writes the source parquet file.
This approach is more domain specific and data model specific.
In many usecases, we need to reverse engineer how the partitioned by column is related to the data.
Can it be generated from other columns ? Let's say, if the data is partitioned by year, where year is the derived data from created_by column, then this derived data can be regenerated again.
Can it be generated by joining with another snowflake table ? Let's say parquet has an id which can be joined with another table to derive at our column dynamically
Approach 3 is more problem/domain specific. Also we need to handle this in all use cases of the parquet ( COPY, MERGE, SELECT etc).

Datafactory - dynamically copy subsection of columns from one database table to another

I have a database on SQL Server on premises and need to regularly copy the data from 80 different tables to an Azure SQL Database. For each table the columns I need to select from and map are different - example, TableA - I need columns 1,2 and 5. For TableB I need just column 1. The tables are named the same in the source and target, but the column names are different.
I could create multiple Copy data pipelines and select the source and target data sets and map to the target table structures, but that seems like a lot of work for what is ultimately the same process repeated.
I've so far created a meta table, which lists all the tables and the column mapping information. This table holds the following data:
SourceSchema, SourceTableName, SourceColumnName, TargetSchema, TargetTableName, TargetColumnName.
For each table, data is held in this table to map the source tables to the target tables.
I have then created a lookup which selects each table from the mapping table. It then does a for each loop and does another lookup to get the source and target column data for the table in the foreach iteration.
From this information, I'm able to map the Source table and the Sink table in a Copy Data activity created within the foreach loop, but I'm not sure how I can dynamically map the columns, or dynamically select only the columns I require from each source table.
I have the "activity('LookupColumns').output" from the column lookup, but would be grateful if someone could suggest how I can use this to then map the source columns to the target columns for the copy activity. Thanks.
In your case, you can use the expression in the mapping setting.
It needs your provide an expression and it's data should like this:{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":{"name":"ModifiedDate"}}]}
So you need to add a column named as Translator in your meta table, and it's value should be like the above JSON data. Then use this expression to do mapping:#item().Translator
Reference: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

How to cope with case-sensitive column names in big data file formats and external tables?

Background
I'm using Azure data factory v2 to load data from on-prem databases (for example SQL Server) to Azure data lake gen2. Since I'm going to load thousands of tables, I've created a dynamic ADF pipeline that loads the data as-is in the source based on parameters for schema, table name, modified date (for identifying increments) and so on. This obviously means I can't specify any type of schema or mapping manually in ADF. This is fine since I want the data lake to hold a persistent copy of the source data in the same structure. The data is loaded into ORC files.
Based on these ORC files I want to create external tables in Snowflake with virtual columns. I have already created normal tables in Snowflake with the same column names and data types as in the source tables, which I'm going to use in a later stage. I want to use the information schema for these tables to dynamically create the DDL statement for the external tables.
The issue
Since column names are always UPPER case in Snowflake, and it's case-sensitive in many ways, Snowflake is unable to parse the ORC file with the dynamically generated DDL statement as the definition of the virtual columns no longer corresponds to the source column name casing. For example it will generate one virtual column as -> ID NUMBER AS(value:ID::NUMBER)
This will return NULL as the column is named "Id" with a lower case D in the source database, and therefore also in the ORC file in the data lake.
This feels like a major drawback with Snowflake. Is there any reasonable way around this issue? The only options I can think of is to:
1. Load the information schema from the source database to Snowflake separately and use that data to build a correct virtual column definition with correct cased column names.
2. Load the records in their entirety into some variant column in Snowflake, converted to UPPER or LOWER.
Both options add a lot of complexity or even messes up the data. Is there any straight forward way to only return the column names from an ORC file? Ultimately I would need to be able to use something like Snowflake's DESCRIBE TABLE on the file in the data lake.
Unless you set the parameter QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE you can declare your column in the casing you want:
CREATE TABLE "MyTable" ("Id" NUMBER);
If your dynamic SQL carefully uses "Id" and not just Id you will be fine.
Found an even better way to achieve this, so I'm answering my own question.
With the below query we can get the path/column names directly from the ORC file(s) in the stage with a hint of the data type from the source. This filters out colums that only contains NULL values. Will most likely create some type of data type ranking table for the final data type determination for the virtual columns we're aiming to define dynamically for the external tables.
SELECT f.path as "ColumnName"
, TYPEOF(f.value) as "DataType"
, COUNT(1) as NbrOfRecords
FROM (
SELECT $1 as "value" FROM #<db>.<schema>.<stg>/<directory>/ (FILE_FORMAT => '<fileformat>')
),
lateral flatten(value, recursive=>true) f
WHERE TYPEOF(f.value) != 'NULL_VALUE'
GROUP BY f.path, TYPEOF(f.value)
ORDER BY 1

How to use the pre-copy script from the copy activity to remove records in the sink based on the change tracking table from the source?

I am trying to use change tracking to copy data incrementally from a SQL Server to an Azure SQL Database. I followed the tutorial on Microsoft Azure documentation but I ran into some problems when implementing this for a large number of tables.
In the source part of the copy activity I can use a query that gives me a change table of all the records that are updated, inserted or deleted since the last change tracking version. This table will look something like
PersonID Age Name SYS_CHANGE_OPERATION
---------------------------------------------
1 12 John U
2 15 James U
3 NULL NULL D
4 25 Jane I
with PersonID being the primary key for this table.
The problem is that the copy activity can only append the data to the Azure SQL Database so when a record gets updated it gives an error because of a duplicate primary key. I can deal with this problem by letting the copy activity use a stored procedure that merges the data into the table on the Azure SQL Database, but the problem is that I have a large number of tables.
I would like the pre-copy script to delete the deleted and updated records on the Azure SQL Database, but I can't figure out how to do this. Do I need to create separate stored procedures and corresponding table types for each table that I want to copy or is there a way for the pre-copy script to delete records based on the change tracking table?
You have to use a LookUp activity before the Copy Activity. With that LookUp activity you can query the database so that you get the deleted and updated PersonIDs, preferably all in one field, separated by comma (so its easier to use in the pre-copy script). More information here: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
Then you can do the following in your pre-copy script:
delete from TableName where PersonID in (#{activity('MyLookUp').output.firstRow.PersonIDs})
This way you will be deleting all the deleted or updated rows before inserting the new ones.
Hope this helped!
In the meanwhile the Azure Data Factory provides the meta-data driven copy task. After going through the dialogue driven setup, a metadata table is created, which has one row for each dataset to be synchronized. I solved this UPSERT problem by adding a stored procedure as well as a table type for each dataset to be synchronized. Then I added the relevant information in the metadata table for each row like this
{
"preCopyScript": null,
"tableOption": "autoCreate",
"storedProcedure": "schemaname.UPSERT_SHOP_SP",
"tableType": "schemaname.TABLE_TYPE_SHOP",
"tableTypeParameterName": "shops"
}
After that you need to adapt the sink properties of the copy task like this (stored procedure, table type, table type parameter name):
#json(item().CopySinkSettings).storedProcedure
#json(item().CopySinkSettings).tableType
#json(item().CopySinkSettings).tableTypeParameterName
If the destination table does not exist, you need to run the whole task once before adding the above variables, because auto-create of tables works only as long as no stored procedure is given in the sink properties.

Resources