How much data is actually in Time Travel? - snowflake-cloud-data-platform

is it possible to Query Time Travel depth available to a table?
Not the configured retention days, but what is actually stored in Time Travel for a table by a date/time. Ex. a Table is configured for 90 day TT but only has 10 days in TT.
I was thinking about the case of when querying TT and if the data isn't there then we get an error, but is there a way to CHECK, something like a SELECT TT-DATE in a Table?

The view TABLE_STORAGE_METRICS has a column (TIME_TRAVEL_BYTES) to show how much storage is used for time travel:
https://docs.snowflake.com/en/sql-reference/info-schema/table_storage_metrics.html
On the other hand, no historical data shows how much Time Travel data is stored for a table by a date/time.
I was thinking about the case of when querying TT and if the data isn't there then we get an error, but is there a way to CHECK, something like a SELECT TT-DATE in a Table?
There is a recent behaviour change about time travel:
https://community.snowflake.com/s/article/Time-Travel-Queries-Beyond-Data-Retention-Period-Will-Fail-Pending
This ensures that any query using time travel data will not be able to fetch data beyond the data retention period. Previously, you were able to query TT data beyond the retention period if the data is not changed. So if you check the configured retention days (and table creation date), you should not get any error when you query TT data.
https://community.snowflake.com/s/article/Time-Travel-Queries-Beyond-Data-Retention-Period-Will-Fail-Pending

Related

Converting Large Data Table To Use Partitions

I have a single MSSQL 2017 Standard table, let's call it myTable, with data going back to 2015, containing 206.4 million rows. Once INSERTed by the application, these rows are never modified or deleted. The table is actively collecting data, 24/7.
My goal is to reduce the data in this table to only the most recent full 6 months plus current month, into monthly-based partitions for easy monthly pruning. myTable.dateCreated would determine which partition the data ultimately resides.
(Unrelated, but mentioning in case it ends up being relevant: I have an existing application that replicates all data that gets stored in myTable out to a data warehouse for long term storage every 15 minutes; the main application is able to query myTable for recent data and the data warehouse for older data as needed.)
Because I want to prune the oldest one month worth of data out of myTable each time a new month starts, partitioning myTable by month makes the most sense - I can simply SWITCH the oldest partition to a staging table, then truncate that staging table without causing downtime or performance on the main table.
I've come up with the following plan, and my questions are simple: Is this the best way to approach this task, and will it keep downtime/performance degradation to a minimum?
Create a new table, myTable_pending, with the same exact table structure as myTable, EXCEPT that it will have a total of 7 monthly partitions (6 months retention plus current month) configured;
In one complete step: rename myTable to myTable_transfer, and rename myTable_pending to myTable. This should have the net effect of allowing incoming data to continue being stored, but now it will be in a partition for the month of 2023-01;
Step 3 is where I need advice... which of the following might be best to get the remaining 6mos + current data back into the now-partitioned myTable, or are there additional options I should consider?
OPTION 1: Run a Bulk Insert of just the most recent 6 months of data from myTable_transfer back into myTable, causing the data to end up in the correct partitions in the process (with the understanding that this may still take some time, but not as long as a bunch of INSERTs that would end up chewing on the transaction log);
OPTION 2: Run a DELETE against myTable_transfer, getting rid of all data except the most recent full 6 months + current, and then set up and apply partitions on THIS table, that would then cause SQL Server to reorganize the data into those partitions, but without affecting access or performance on myTable, after which I could just SWITCH the partitions from myTable_transfer into myTable for immediate access; (related issue: since myTable is still collecting current data, and myTable_transfer will contain data from the current month as well, can the current month partitions be merged?)
OPTION 3: Any other way to do this, so that myTable ends up with 6 months worth of data, properly partitioned, without significant downtime?
We ended up revising our solution, since the original table was replicated to a data warehouse anyway, we simply renamed the table and created a new one with partitioning to start collecting new data from the rename point. This provided the least amount of downtime, the fastest schema changes, and gave us the partitioning we needed to maintain the table efficiently going forward.

Detecting and Publishing Changes to Data in SQL Server in Real-time

I have an ERP System (Navision) where product data and stock numbers are frequently updated. Every time an attribute of a product is updated I want this change to be pushed to another SQL Server using Service Broker. I was considering using triggers for the detection, but I am unsure if that is the best way, and whether this is scalable. I expect updates to happen approx. once per second, but this number might double or triple.
Any feedback would be appreciated.
Add a column for Last Modified Date for each record and update this column using the trigger each time a record is being updated. Then Run a scheduled job at a specific time each day (Off-business hours preferred) So that all records that are updated after the last scheduled run is processed.
So The following items need to be done
Add a new column LastModifiedDate in the table with DATETIME data type.
Create a Trigger to update the ModifiedDate each time the record is updated
Create a new table to store the schedule run date and time
Create a scheduled job on Database that will run at a specified time every day.
This job will pick all the records that have the value greater than the date in the Table Create on Step#4.
So Since only 1 column is being updated in the trigger, it won't affect the performance of the table. Also since we are running the update job only once a day, It will also reduce the Database Traffic.

Selecting data across day boundaries from Star schema with separate date and time dimensions

What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.

What's the best way to save large monthly data backups in SQL?

I work on a program that stores information about network connections across my University and I have been asked to create a report that shows the status changes of these connections over time. I was thinking about adding another table that has the current connection information and the date the data was added so when the report is run, it just grabs the data at that date, but I'm worried that the report might get slow after a couple of months as it would be adding about 50,000 rows every month. Is there a better way to do this? We use a Microsoft SQL Server.
It depends on the reason you are holding historical data for facts.
If the reason is:
For reporting needs then you could hold it in the same table by
adding two date columns FromDate and ToDate which will remove the
need to join the active and historical data tables later on.
Just for reference then it makes sense to have it in a different
table as it may decrease the performance of your indexes on your
active table.
I'll highlight the Slowly Changing Dimension (SCD) type 2 approach that tracks data history by maintaining multiple versions of records and uses either the EndDate or a flag to identify the active record. This method allows tracking any number of historical records as each time a new record is inserted, the older ones are populated with an EndDate.
Step 1: For re-loaded facts UPDATE IsActive = 0 for the record to be history preserved and populate EndDate as the current date.
merge ActiveTable as T
using DataToBeLoaded as D
on T.ID = D.ID
and
T.isactive = 1 -- Current active entry
when matched then
update set T.IsActive = 0,
T.EndDate = GETDATE();
Step 2: Insert the latest data into the ActiveTable with IsActive = 1 and FromDate as the current date.
Disclaimer: The following approach using SCD 2 could make your data warehouse huge. However, I don't believe it would affect performance much for your scenario.

How can I get a list of modified records from a SQL Server database?

I am currently in the process of revamping my company's management system to run a little more lean in terms of network traffic. Right now I'm trying to figure out an effective way to query only the records that have been modified (by any user) since the last time I asked.
When the application starts it loads the job information and caches it locally like the following: SELECT * FROM jobs.
I am writing out the date/time a record was modified ala UPDATE jobs SET Widgets=#Widgets, LastModified=GetDate() WHERE JobID=#JobID.
When any user requests the list of jobs I query all records that have been modified since the last time I requested the list like the following: SELECT * FROM jobs WHERE LastModified>=#LastRequested and store the date/time of the request to pass in as #LastRequest when the user asks again. In theory this will return only the records that have been modified since the last request.
The issue I'm running into is when the user's date/time is not quite in sync with the server's date/time and also of server load when querying an un-indexed date/time column. Is there a more effective system then querying date/time information?
I don't know that I would rely on Date-Time since it is external to SQL Server.
If you have an Identity column, I would use that column in a table UserId, LastQueryDateTime, LastIdRetrieved
Every time you query the base table, insert new row for user (or update if exists) the max id into this table. Also, the query should read the row from this table to get the LastIdRetrieved and use that in the where clause.
All this could be eliminated if all of your code chooses to insert GetDate() from SQL Server instead of from the client machines, but that task is pretty labor intensive.
The easiest solution seems to settle on one time as leading.
One way would be to settle on the server time. After updating the row, store the value returned by select LastModified where JobID = #JobID on the client side. That way, the client can effectively query using only the server time as reference.
Use an update sequence number (USN) much like Active Directory and DNS use to keep track of the objects that have changed since their last replication. Pick a number to start with, and each time a record in the Jobs table is inserted or modified, write the most recent USN. Keep track of the USN when the last Select query was executed, and you then always know what records were altered since the last query. For example...
Set LastQryUSN = 100
Update Jobs Set USN=101, ...
Update Jobs Set USN=102, ...
Insert Jobs (USN, ...) Values (103, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 103
Update Jobs Set USN=104
Insert Jobs (USN, ...) Values (105, ...)
Select * From Jobs Where USN > LastQryUSN
Set LastQryUSN = 105
... and so on
When you get the Jobs, get the server time too:
DECLARE #now DATETIME = GETUTCDATE();
SELECT #now AS [ServerTime], * FROM Jobs WHERE Modified >= #LastModified;
First time you pass in a minimum date as #LastModified. On each subsequent call, you pass in the ServerTime returned last call. This way the client time is taken out of the equation.
The answer to the server load is, I hope, obvious: add an index on Modified column.
And one more adice: never use local time, not even on server. Always use UTC times, and store UTC time in Modified. As it is right now, your program is completely screwed two times a year, when the daylight savings time changes are set in or when they are removed.
Current SQL Server has change tracking you can use for exactly that. Just enable change tracking on the tables you want to track.

Resources