External table refresh in snowflake - snowflake-cloud-data-platform

I have a external table which is based on s3 integartion.
The files in s3 bucket is refreshed only once a day at 12:30 UTC.
Option 1:
I do auto refresh of table.
create or replace external table EXT_TABLE
(
"ID" VARCHAR as (nullif(value:c1,'')::VARCHAR),
...........
)
with location = #STAGE_NAME
auto_refresh = true
file_format = user_data_format
pattern = '.*su.*[.]csv';
Option 2:
Create a task that runs at 13 UTC and does
alter external table EXT_TABLE refresh;
Do we have any difference in cost between 2 options refreshing using task or option 1 refreshing automatically knowing the files in s3 would be overwritten only once a day.
What is the better approach and what is the difference in cost in option 1 and 2?

If you are sure that the S3 bucket will be refreshed only once a day at 12:30 UTC, I would go with the task.
First of all, there is a small charge for receiving notifications:
https://docs.snowflake.com/en/user-guide/tables-external-intro.html#billing-for-refreshing-external-table-metadata
I am not sure about this, but you may receive multiple notifications if you upload multiple files to your S3 bucket. (Hope someone verifies this)

Related

Flink Tableconfig setIdleStateRetention seems to be not working

I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());

how to Copy from big table to another table in snowflake?

I have a 7TB+- table in snowflake, I want to pass half of that table to a new table. for example with a country filter. what technique would you recommend? insert into select * from TABLE where COUNTRY = 'A' or use snowpipe to send a parquet format to S3 an then copy into table into snowflake target table
I tried the first option. 5 hours after and the process was on 35%. I read a post where a guy had to scaling the cluster to XL instance. He read another post where snowpipe is the good option. my cluster is only a XS :(
by the way, I have Cluster key and the mission is segment the data by countries by company politics.
The original table is about events from the devices that have the app installed. 30 events per session minute, for example a Uber App or Lyft App
An MV will definitely be more performant than a standard view but there is an extra cost associated with that as Snowflake has to keep the MV in sync with the table. Sounds like the table will be rapidly changing so this cost will be continuous.
Another option is to create a stream on the source table and use a task to merge the stream data into the target table. Tasks require a running warehouse but I've found that an XS warehouse is very capable so minimum you're talking 24 credits per day. Tasks also have a minimum 1 minute interval so if you need bleeding edge, that might discount this option

Can Snowflake Auto Purge records older than X number of days?

For an existing table in snowflake is there a way we can set TTL for each record ?
In other words can i ensure records updated/created more than 90 days ago is automatically purged periodically.
You can use a Snowflake TASK to run deletes on a routine schedule. And if you are dealing with a very large table, I recommend that you cluster it on the DATE of whatever field you are using to delete from. This will increase the performance of the delete statement. Unfortunately, there is no way to set this on a table and have it remove records automatically for you.
Opt 1. If the table is used for analytics, you can build a view on top of it to retrieve only last 90 days data (doing this you have the history)
Opt 2. you can use SQL statement on schedule which deletes the records which are > 90 days

How to reset tables in Salesforce

I have a development version of salesforce that I'd like to delete all the records from. My analyst is telling me that to do so manually will take a week, or we can refresh it which will take minutes to a week.
It seems to me that there must be a better way of doing this. In SQL, I'd make a list of tables to truncate and dynamically truncate each one. Is there some way to do this with salesforce?
It won't take you a week to refresh a sandbox, usually only takes 30 minutes. You can also write an execute anon script that will delete data from your tables
List<YourObject> objectData = [SELECT ID FROM YourObject];
delete objectData;

Auto updating access database (can't be linked)

I've got a CSV file that refreshes every 60 seconds with live data from the internet. I want to automatically update my Access database (on a 60 second or so interval) with the new rows that get downloaded, however I can't simply link the DB to the CSV.
The CSV comes with exactly 365 days of data, so when another day ticks over, a day of data drops off. If i was to link to the CSV my DB would only ever have those 365 days of data, whereas i want to append the existing database with the new data added.
Any help with this would be appreciated.
Thanks.
As per the comments the first step is to link your CSV to the database. Not as your main table but as a secondary table that will be used to update your main table.
Once you do that you have two problems to solve:
Identify the new records
I assume there is a way to do so by timestamp or ID, so all you have to do is hold on to the last ID or timestamp imported (that will require an additional mini-table to hold the value persistently).
Make it happen every 60 seconds. To get that update on a regular interval you have two options:
A form's 'OnTimer' event is the easy way but requires very specific conditions. You have to make sure the form that triggers the event is only open once. This is possible even in a multi-user environment with some smart tracking.
If having an Access form open to do the updating is not workable, then you have to work with Windows scheduled tasks. You can set up an Access Macro to run as a Windows scheduled task.

Resources