How to create partitioned Athena table with Sagemaker Feature Store - amazon-sagemaker

I'm using Sagemaker Feature Store and trying to create an Offline Feature Store. During the process, Sagemaker creates an Athena table. However, I notice that this table is not partitioned, and when I create a query, it takes forever.
How can I use Sagemaker Feature Store to create a Athena table with partition?

Related

SQL Server table daily sync of records from table A to table B

I want to create a daily process where I reload all rows from table A into table B. Over time table A rows will change due to changes in source system and also because of aging/deletion of records in the origin table. Table A gets truncated/reloaded daily in step 1. Table B is the master table that just gets new/updated rows.
From a historical point of view, I want to keep track of ALL the rows in table B and be able to do a point in time comparison for analytics purposes.
So I need to do two things, Daily insert rows from table A to table B if they don't exist and then also create a new record in Table B if the record already exists but ANY of the columns have changed. At one point I attempted to use temporal tables but I had too many false/positives on 'real' changes, basically certain columns were throwing off things because a date/time column was updated(only real change in row).
I'm using a Azure SQL Server Managed Instance database (Microsoft SQL Azure (RTM) - 12.0.2000.8).
At my disposal I have SSMS, SQL Server and also Azure Data Factory.
Any suggestions on the best way to do this or tools to help with this?
There are 2 concepts out of which you can implement any one.
Temporal table
Capture Data Change (CDC)
As CDC is the commonly used approach in which you can create an Azure data factory with a pipeline that loads delta data based on change data capture (CDC) information in the source Azure SQL Managed Instance database to an Azure blob storage.
To implement the CDC, you can you can follow this simple Microsoft tutorial Incrementally load data from Azure SQL Managed Instance to Azure Storage using change data capture (CDC)
Note: You also need to Create a storage account which is required but not given in above tutorial.

Databricks Delta tables vs SQL Server Delta table

Is there is a difference between Sql Delta table and Databricks Delta table? It looks like for SQL we use the name on a conceptual basis. The table that stores the difference of Base table is Delta. Is it the same for databricks?
No, Databricks Delta is storage layer that provides ACID transactions & other improvements to store big amounts of data for use with Apache Spark. It used to store complete datasets, that could be updated if necessary. Delta is open source project, with some enhancements available at Databricks platform.

Add only unique rows in SnowFlake Cloud Database

I want to automate the ingestion of data from a source into a SnowFlake Cloud Database. There is no way to extract only unique rows from the source. So the entire data will be extracted during every ingestion run. However, while adding to SnowFlake I only want to add the unique rows. How can this be achieved most optimally?
Further Information: Source is a DataStax Cassandra Graph.
Assuming there is a key that you can use to determine which records need to be loaded, the idea scenario would be to load the data to a stage table in Snowflake and then run a MERGE statement using the new data and apply to your target table.
https://docs.snowflake.com/en/sql-reference/sql/merge.html
If there is no key, you might want to consider running an INSERT OVERWRITE statement and just replacing the table with the new incoming data.
https://docs.snowflake.com/en/sql-reference/sql/insert.html#insert-using-overwrite
You will have to stage it to a table in snowflake for ingestion and then move it to the destination table using select distinct.

Monitor data change in redshift

I am trying to find a tool, or methodology to store when an update is done against an specific table and column in AWS Redshift.
In PostgreSQL there is a way of doing this with triggers, but Redshift does not support these triggers.
Can we monitor updates statements and store the timestamp, the old value, the new one, and the table affected?
There is no in-built capability in Amazon Redshift to do change detection.
Amazon Redshift is intended as a Data Warehouse, which typically means that bulk information is loaded from external sources. It should be relatively rare for data to be updated within Amazon Redshift because it is not intended to be used as an OLTP database.
Thus, it would be better to put change detection in the source database or in the ETL pipeline, rather than Redshift.

What steps are performed by a 'Rescan'?

To automatically warehouse documents from Cloudant to dashDB, there is a schema discovery process (SDP) that automates the data migration for you. When using the SDP to warehouse documents from Cloudant to dashDB, there is an option 'Rescan'.
I have used 'Rescan' a number of times, but am unclear on the steps it actually performs. What steps are performed by a 'Rescan'? E.g.
Drop tables in the dashDB target schema? Which tables?
Scan Cloudant source database?
Recreate the target schema?
...
...
The steps are pretty much as you suggested. Rescan will
Inspect the previously discovered JSON schema and remove all tables from the dashDB instance created for that load (leaving any user defined tables untouched)
Re-discover the JSON schema again using the current settings (including sample size, type of discovery algorithm etc.)
Create the new tables into the same dashDB target
Ingest the newly created tables with data from Cloudant
Subscribe to the _changes feed from Cloudant to continuously synchronize document changes with dashDB
All steps (except for the first) are identical for the initial load as well as the rescan function.
The main motivation for a rescan is to support schema evolution. Whenever the document structure in a Cloudant source database changes, a user can make a conscious decision to drop and re-create the dashDB tables using this rescan function. SDP won't automate that process to avoid potential conflicts with applications depending on the existing dashDB tables.

Resources