Table rentionion in snowflakes has any relation with timetravel setting? - snowflake-cloud-data-platform

Problem :Does it make sense to have time travel setting as 1 and retention setting as 30 days at table level (permanent table) in snowflakes .

There is no real setting called time travel. data_retention_time_in_days is actually the setting that controls the duration of your time travel. If you are asking whether an account setting of 1 day and a table-level setting of 30 days makes sense, then the answer is 'Yes'. You can set this parameter at multiple levels (account, database, schema, table) to customize your data retention period based on those object levels.

As mentioned there is no setting for Time Travel. No tasks are required to enable Time Travel. It is automatically enabled with the standard, 1-day retention period.
To increase this retention period DATA_RETENTION_TIME_IN_DAYS parameter is used which can se set upto 90 days which can be set databases, schemas, and tables.
https://docs.snowflake.com/en/user-guide/data-time-travel.html
https://www.youtube.com/watch?v=F1pevMhm7lg
On top of that there is Fail-Safe period which provides a (non-configurable) 7-day period during which historical data is recoverable by Snowflake. This period starts immediately after the Time Travel retention period ends.
https://docs.snowflake.com/en/user-guide/data-failsafe.html

Related

Latency in Snowflake Account Usage Views

I am trying to understand "latency" issue with Account Usage Views.
Does the latency, let's say for Query History mentioned to be 45 min, mean it might take 45 min for a query to pull result out of Account Usage view or does it mean it might take time for data to be available in Account Usage view?
When I query Account Usage in a trial account, query doesnt take much time on Account Usage view and also Account Usage view shows latest sql details in Query History so I am not able to understand what latency denote.
Another question is if latency means the amount of time SQL will take to pull result I assume it will keep the Warehouse in running state increasing the cost
Data latency
Due to the process of extracting the data from Snowflake’s internal metadata store, the account usage views have some natural latency:
For most of the views, the latency is 2 hours (120 minutes).
For the remaining views, the latency varies between 45 minutes and 3 hours.
For details, see the list of views for each schema (in this topic). Also, note that these are all maximum time lengths; the actual latency for a given view when the view is queried may be less.
"Does the latency, let's say for Query History mentioned to be 45 min, mean it might take 45 min for a query to pull result out of Account Usage view or does it mean it might take time for data to be available in Account Usage view?"
The terms latency refers to the time until the data will be available in Account Usage view.
It does not mean that query SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.<some_view> takes 45 minutes to execute.

Is it possible to store historical configuration settings for each row of data without cramming all the configuration settings into each row of data?

For background: I was recently hired as a database engineer for a water treatment company. We deploy water treatment machines to sites across the country, and the machines treat water and send continuous data back to us regarding the state of incoming water (flow rate, temperature, concentration of X in incoming water, etc.), and regarding the treatments the machine applied to that water at that point in time. Over time, sites (and their various components) change a lot: a machine might break down and need to be replaced, a different concentration of chemical may be used to fill the machine's tanks, its flow meters and other sensors might be recalibrated or set to scale differently, its chemical pumps might be replaced, and on and on. These affect the interpretation of the data: for example, if 5 mL of chlorine was added to the incoming water at 01/01/2021 12:00:05, that means two completely different things if the chlorine was 5% concentrated or 40% concentrated.
Water treatment datapoints are identified by a composite key consisting of the ID of the site, and a timestamp. It would be easy if the only data that mattered was current data, as I could just store the configuration settings on a Site level and pull them up for datapoints as needed. But we need to be able to correctly interpret older data. So, I thought about storing configurations in another table, trackingall the settings for each site over each time period, but it's not possible to create a foreign key between the continuous timestamps of the datapoints and the start/end dates of the configurations - the closest thing would be some kind of range check, like "Datapoint.TimeStamp BETWEEN Configuration.Start AND Configuration.End". So the only other option I see is to store every configuration setting for every datapoint alongside each datapoint, but that seems like a terrible solution given how many configuration settings there are and how many datapoints are generated, especially since most of the settings don't even change often.
So, is there a way to store historical configurations for each row of data in a way that is at all normalized, or is the only possible solution to cram all the settings into each datapoint?
If I understood your request :
1 - a water datapoint is identified by a composite key consisting of the ID of the site, and a timestamp :
SiteID
TimeStampID
2 - a water datapoint can have multiple configurations when a break down happens for example :
ConfigurationID
StartDate
EndDate
Let's consider a DataPoint having the following information for a specific day :
DataPoint SiteID TimeStampID
1001 101 01-02-2021 09:00:01
1001 101 01-02-2021 10:20:31
1001 101 01-02-2021 17:45:00
At that day, a break down started at 11:01:20 and ended at 11:34:22.
ConfigurationID DataPoint StartDate EndDate
155 1001 01-02-2021 11:01:20 01-02-2021 11:34:22
The original answer that I accepted seems to have been deleted. For anyone coming here in the future, the solution that I intend to go with is as follows:
I'm going to create a configuration table to hold settings in the following format:
_SiteID_ _Start_ _End_ <various settings fields>
318 "2021-01-01 12:22:03" "2021-02-10 09:08:26" ...
Where the primary key is (SiteID, Start, End). SiteID is a foreign key to the integer ID of the Site table, Start is the date at which the configuration starts being valid, and End (default: NULL) is the date at which the configuration is no longer valid. In order to keep things good and simple for users (and myself), and to prevent any accidental updates to old configuration settings when instead there should have been a new configuration row inserted, I'm going to disallow UPDATE and DELETE operations on the configuration table for all users except root, and instead create a stored procedure for "updating" the configuration of a given Site. The stored procedure will take whatever new parameters the user specified, copy in any parameters that the user DIDN'T specify from the most recent configuration for that Site (i.e., the row with the same SiteID and the NULL End date), overwrite the most recent configuration row's NULL End date to be the Start date for the new row, and finally create the new row with the specified Start date.
NOTE: the Start date and End date are both stored for each configuration because configurations might not necessarily be continuous, i.e. it is not the case that "as soon as a configuration expired, there is another configuration that starts at the exact time that that configuration expired", as deployments of water treatment equipment sometimes have large gaps in between them if a client doesn't need our services for some period of time. Without storing the End dates for configurations too, we would have to assume that each configuration lasts until the next configuration begins, or until now, if there is no later configuration stored. So End date is stored so that we don't ever think "Site A was configured to have X Y Z settings from January 2020 to June 2021" when there hasn't even been a machine at Site A since May of 2020. Storing the End date explicitly alongside the Start date also avoids the ickiness of needing to rely on the values in other rows of configuration data to know how to interpret a given row of configuration data.
Thank you to whoever it was who originally gave me the inspiration for this answer, I have no idea why your answer was deleted.

Can time-travel and fail-safe apply for an object at the same time?

Can time-travel and fail-safe apply for an object at the same time? I understand fail-safe period of 7 days starts only after time-travel period ends for applicable object. However, as per Snowflake University mindtickle assessment, the Fail-Safe is available for tables that have time-travel. Please explain.
enter image description here
Check all true statements about Fail-safe:
…
[ ] "Fail-safe is not available for tables that have Time Travel"
The (curveball) choice in context is referring to Time Travel and Fail-safe as available features to any permanent table, not as it's currently-used areas of storage.
It reads better when phrased as:
"Fail-safe feature is not available for tables that have Time Travel features"
Since the question's context is about the fail-safe feature, which only apply to permanent tables, the statement is false.
For a table, time-travel, as well as failsafe both, are applicable.
There is no direct connection between time-travel and failsafe, though they are part of the continuous data protection strategy. A few key differences
time-travel can be configured, by default 1 day and can range from 0 to 90 days (7 days for the standard edition and up to 90 days for enterprise and above).
Failsafe is not configurable and defaults to 7 days.
To recover data from failed safe, you need to contact snowflake support.
Failsafe is not applicable for transient and temporary tables.
Failsafe is outside the time-travel boundary and only needed if any there is a disaster, rest all case can be managed by time-travel.
Timetravel & failsafe, both incur a cost.
Refer this link:

Which data store is best for my scenario

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?

When does Time Dependent workflow rule executes?

I have set a time dependent workflow rule on certain condition which is below...
after 6 days of particular(1st follow-up date ->in my case) date the workflow rule should update the picklist field(current status -> in my case)....
My question is at what time on 6th day it will execute ?
Do we have control on this time ?
Regards,
Ankit
We only control the day of execution because :
1. Salesforce evaluates time-based workflow on the organization's time zone, not the users'. Users in different time zones may see differences in behavior.
2. Time-dependent actions aren't executed independently. They're grouped into a single batch that starts executing within one hour after the first action enters the batch.
3. Time triggers don't support minutes or seconds.
4. Salesforce limits the number of time triggers an organization can execute per hour. If an organization exceeds the limits for its Edition,
Salesforce defers the execution of the additional time triggers to the next hour.

Resources