How many versions are created in a delta table in a Data lake on Azure - version

I have a clarification question. As per what I have read, Delta tables create 0--original data, 1--updated data version of a row in a table.
So basically we have just two versions of the data in Delta tables or this is configurable? what happens, when we update same row multiple times, delta table simply keep latest version of updates?
Thanks in advance.

Delta will create a new version for each operation - insert/update/delete, and also for additional operations, like, changing properties of the table, optimize, vacuum, etc., although some operations will not create new files (update table properties), or even delete not used files (vacuum).
Please take into account that data files in Delta aren't mutable, when you update or delete data, Delta identifies which files contain the data for update/delete, and create new files with modified data. That's why it's important to run VACUUM periodically, so you can get rid of the old files (although it will limit your ability to time travel just to the given period of time - one week by default)

Related

Temporal Tables Manually Update Data

Using SQL Server 2019, can I push data (snapshot data) from the Current (Temporal Table) to the History Table only when I want to rather than it happening automatically after every row commit? I understand that Temporal Tables are designed to record all data changes to a row - great for auditing. But what if I don't want to save all changes? What If I only want to 'baseline' data on a set of tables every week, (or when the user wants to) and I don't care what changes are made during the week? I know you can disable and enable the temporal tables, but that is more of a high level control, and the architecture is multi-tenanted,and different tenants will snapshot at different times.
Or perhaps Temporal Tables is the wrong tool for me? My use case is as follows - A user creates a mathematical model altering many parameters, they do this many times over many days, persisting to the database with every change. When they get it right they press 'Baseline' Everything is stored. They then continue with the next changes to the next baseline. At any point they can compare the difference between any two baselines. I only retain the data at the date of 'Baseline'. This would require that I move the data to the temporal history table manually..or let it go automatically and purge everything in between two baselines, seems a waste of DB resources.

UPDATE millions of rows, or DELETE/INSERT?

Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

How to find out the rows affected in SQL Profiler or trace?

I'm using tracing to log all delete or update queries run through the system. The problem is, if I run a query like DELETE FROM [dbo].[Artist] WHERE ArtistId>280, I know how many rows were deleted but I'm unable to find out which rows were deleted (the data they had).
I'm thinking of doing this as a logging system so it would be useful to see which rows were affected and what data they had if at all possible. I don't really want to use triggers for this job but I will if I have to (and if it's feasible).
If you need the original data and are planning on storing all the deleted data in a separate table why not just logically delete the original data rather than physically delete it? i.e.
UPDATE dbo.Artist SET Artist_deleted = 1 WHERE ArtistId>280
Then you only need add one column to your current table rather than creating new tables and scripts to support these. You could then partition the current table based on the deleted flag if you are worried about disk space/performance etc.

Will archiving lots of old data lock my Database?

I need to move the data that is a month old from a logging table to a logging-archive table, and remove data older than a year from the later.
There are lots of data (600k insert in 2 months).
I was considering to simply call (batch) a stored proc every day/week.
I first thought about doing 2 stored proc :
Deleting from the archives what is older than 365 days
Moving the data from logging to archive, what is older than 30 days (I suppose there's a way to do that with 1 sql query)
Removing from logging what is older than 30 days.
However, this solution seems quite ineficient and will probably lock the DB for a few minutes, which I do not want.
So, do I have any alternative and what are they?
None of this should lock the tables that you actually use. You are writing only to the logging table currently, and only to new records.
You are selecting from the logging table only OLD records, and writing to a table that you don't write to except for the archive process.
The steps you are taking sound fine. I would go one step further, and instead of deleting based on date, just do an INNER JOIN to your archive table on your id field - then you only delete the specific records you have archived.
As a side note, 600k records is not very big at all. We have production DBs with tables over 2billion rows, and I know some other folks here have dbs with millions of inserts a minute into transactional tables.
Edit:
I forgot to include originally, another benefit of your planned method is that each step is isolated. If you need to stop for any reason, none of your steps is destructive or depends on the next step executing immediately. You could potentially archive a lot of records, then run the deletes the next day or overnight without creating any issues.
What if you archived to a secondary database.
I.e.:
Primary database has the logging table.
Secondary database has the archive table.
That way, if you're worried about locking your archive table so you can do a batch on it, it won't take your primary database down.
But in any case, i'm not sure you have to worry about locking -- I guess just depends on how you implement.

Resources