Is it possible to specify when a snapshot "update" is being run? I have an airflow dag that I want to re-run and it correctly creates the snapshot with historical data.
The problem is that the DBT_VALID_TO AND DBT_VALID_FROM columns are all today.
This depends on the strategy for your snapshot. If you use a timestamp strategy, dbt will use the updated_at timestamp for the valid_from date for the most recent records. If you use check_cols, then dbt has no way of knowing when the changes were made, so it uses the current timestamp.
Related
If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)
If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.
Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.
...is changing over time. I know time travel does not work on Information schema. So, wanted to know if there is an alternate approach
Yes, the alternative approach is to schedule a task with an insert query based on your query against information schema, that inserts the data into some table with a timestamp.
https://docs.snowflake.net/manuals/user-guide/tasks-intro.html
I have an ERP System (Navision) where product data and stock numbers are frequently updated. Every time an attribute of a product is updated I want this change to be pushed to another SQL Server using Service Broker. I was considering using triggers for the detection, but I am unsure if that is the best way, and whether this is scalable. I expect updates to happen approx. once per second, but this number might double or triple.
Any feedback would be appreciated.
Add a column for Last Modified Date for each record and update this column using the trigger each time a record is being updated. Then Run a scheduled job at a specific time each day (Off-business hours preferred) So that all records that are updated after the last scheduled run is processed.
So The following items need to be done
Add a new column LastModifiedDate in the table with DATETIME data type.
Create a Trigger to update the ModifiedDate each time the record is updated
Create a new table to store the schedule run date and time
Create a scheduled job on Database that will run at a specified time every day.
This job will pick all the records that have the value greater than the date in the Table Create on Step#4.
So Since only 1 column is being updated in the trigger, it won't affect the performance of the table. Also since we are running the update job only once a day, It will also reduce the Database Traffic.
I am building a relational DB structure, where one of the table (e.g. events table) represent time ranges with a start date.
This table have the following fields that can be modified:
event.start_date: DateTime
event.duration: TimeDelta
The end_date can be compute doing:
end_date = event.start_date + event.duration
I have code that retrieve elements from this table, and that make heavy use of the end_date property.
Is there a way to store it somewhere in the table, so that it is read_only, and that if event.start_date or event.duration is modified, then it is updated.
The idea is to have a consistent and not redundant DB, but a fast
access to 'resultant' values (such as end_date).
The icing of the cake being having also a event.end_date field and:
if the event.duration is updated, then event.end_date is updated.
if event.end_date is updated, then event.duration is updated automatically.
Answering to myself:
It seems that PostgreSQL trigger procedures (DB-engine level) or Django signals (application level) do the job.
Following is the requirement for my table say "Orders":
1) On day-1, I am sending the full data using bcp command as a unicode text file.
2) From the next day daily, I need to send only delta data for the transactions happenned that day.
What is the best way to implement delta? I would like to avoid the current table design and not all table has timestamp fields.
Look into SQL Server change tracking. It does what you want.
You could also snapshot the PK values and a hash of each row on midnight. Next night you snapshot again and create the diff using a full join.
You've already excluded the best way. Now you are limited to manually performing a diff based on the previous day's snapshot.