snowflake time-travel micro-partition can rewriteable when update? - snowflake-cloud-data-platform

As we all know, in snowflake one mini-partition maybe rewrite/replace when user update the row in the mini-partition. How snowflake handle the mini-partition in time-travel table space when user update?
I think snowflake need to mark the row which has updated in the old time-travel mini-partition, and it will create a new time-travel mini-partition for the rows which has updated, considering it implement time-travel and stream.
I want to know that Because: if time-travel micro-partition can rewriteable, maybe is not safety for my customer's requirement. If it is not rewriteable, maybe the query for time-travel is not efficient for my customer's requirement too.
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.

I’m not sure why you think that how Snowflake works “under the covers” would affect safety or performance in any way that would be of interest to your customer - but, given that, Snowflake is insert-only at the micro partition level. Basically, if a user “updates” a record then the new values are inserted and the old version is marked as no longer current

Related

How can I store temporary, self-destroying data into a t-sql table?

Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.

storing anonymous information in a database

Our application will store some information from a user that we do not want to be traced back to any other records in the database. For example (albeit a stupid one) - a user must pay to tell us anonymously what their favorite color is. We want to store each color record as a new row in the database and keep track of the transaction information.
If we stored the colors and transactions in separate tables, the rows could be correlated to one another if the server were hacked, by using the sequential ID of the rows (because a color will always have a transaction) or by the creation time of the row. So to solve this we won't have a sequential ID column for the colors table, or an update/modification time for the colors table.
Now, the only way to associate a color with a transaction is to look at the files that are used to actually store the database information. While this may be difficult and tedious, I imagine it is still possible because the colors table information would probably be stored sequentially in the files.
How can I store database information in an un-ordered matter, so that this could never happen? I suppose a more general question is how do I store information anonymously and securely? (But that is way too broad)
Obviously, an answer is don't let your database get hacked, but not a good one.
You can pre-generate millions of rows and randomly populate them.
If you need to analyze data, you will need to understand it, and if you can attacker can also. No matter what clever solution you will come up with correletion will still be possible. Relational DB transaction logs, wil show what and when and where was inserted updated deleted. So you cannot provide 100% decoupling of data, if you want to use the same db. You could encrypt data with some HSM, which would render stolen data useless for attacker. Or you can store data on some other machine with random delay or some batch processing, (wait and insert 20 records instead of one)... but it can be tricky and it can fail.
Consider leveraging a non-realtional database, e.g. NoSQL.

How to auto remove an expired record from a database?

We are building a large stock and forex trading platform using a relational database. At any point during the day there will be thousands, if not millions, of records in our Orders table. Some orders, if not fulfilled immediately, expire and must be removed from this table, otherwise, the table grows very quickly. Each order has an expiration time. Once an order expires it must be deleted. Attempting to do this manually using a scheduled job that scans and deletes records is very slow and hinders the performance of the system. We need to force the record to basically delete itself.
Is there way to configure any RDBMS database to automatically remove a record based on a date/time field if the time occurs in the past?
Since you most likely will have to implement complex order handling, e.g. limit orders, stop-limit orders etc. you need a robust mechanism for monitoring and executing orders in real time. This process is not only limited to expired orders. This is a core mechanism in a trading platform and you will have to design a robust solution that fulfill your needs.
To answer your question: Delete expired orders as part of your normal order handling.
Why must the row be deleted?
I think you are putting the cart before the horse here. If a row is expired, it can be made "invisible" to other parts of the system in many ways, including views which only show orders meeting certain criteria. Having extra deleted rows around should not hamper performance if your database is appropriately indexed.
What level of auditing and tracking is necessary? Is no analysis ever done on expired orders?
Do fulfilled orders become some other kind of document/entity?
There are techniques in many databases which allow you to partition tables. Using the partition function, it is possible to regularly purge partitions (of like rows) much more easily.
You have not specified what DB you are using but lets assume you use MSSQL you could create a agent job that runs periodicly, but you are saying that that might not be a solution for you.
So what t about having an Insert Trigger that when new record is inserted you delete all the record that are expired? This will keep number of record all relatively small.

database archiving vs timeperiod based tables/fields

I am working on an employee objectives web application.
Lead/Manager sets objectives for team members after discussing with them. This is an yearly/half-yearly/quarterly depending on appraisal cycle the organization follows.
Now question is is better approach to add time period based fields or archive previous quarter's/year's data. When a user want to see previous objectives (not so frequent activity), the archive that belongs to that date may be restored in some temp table and shown to employee.
Points to start with
archiving: reduces db size, results in simpler db queries, adds an overhead when someone tried to see old data.
time-period based field/tables: one or more extra joins in queries, previous data is treated similar to current data so no overhead in retrieving old data.
PS: it is not space cost, my point is if we can achieve some optimization in terms of performance, as this is a web app and at peak times all the employees in an organization will be looking/updating it. so removing time period makes my queries a lot simpler.
Thanks
Assuming you're talking about data that changes over time, as opposed to logging-type data, then my preferred approach is to keep only the "latest" version of the data in your primary table(s), and to automatically copy the previous version of the data into a archive table. This archive table would mirror the primary, with the addition of versioned fields, such as timestamps. This archiving can be done with a trigger.
The main benefit that I see with this approach is that it doesn't compromise your database design. In particular, you don't have to worry about using composite keys that incorporate the version fields (in fact using time-based fields as keys may not even be permitted by your database).
If you need to go and look at the old data, you can run a select against the archive table and add version constraints to the query.
I would start off adding your time period fields and waiting until size becomes an issue. The kind of data you are describing does not sound like it is going to consume a lot of storage space.
Should it grow uncontrollably you can always look at the archive approach later - but the coding is going to take much longer than simply storing the relevant period with your data.
It seems to me that if you have the requirement that a user can look arbitrarily far back in the past, then you really must keep the data accessible.
This just won't be sustainable:
the archive that belongs to that date may be restored in some temp table and shown to employee.
My recommendation would be to periodically (read when absolutely necessary) move 'very old' data to another table for this purpose. Disk space is extremely cheap at this point, so keeping that data around is not nearly as expensive as implementing the system that can go back to an arbitrary time and restore an archive.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Resources