Snowflake and what to do if a malicious attack happened? - snowflake-cloud-data-platform

If someone malicious gets access to ACCOUNTADMIN, sets DATA_RETENTION_TIME_IN_DAYS=0 to all objects in a database and then start destroying these objects, does Snowflake provide a means to rebuild the database as it was before the attack? Time Travel should not be available anymore.
From my understanding, Failsafe kicks in only after Time Travel, so if Time Travel was set to 90, then from what I understand the best we can hope for is gaining back data that is 90 days old.
What to do to prevent this kind of scenario?

This is what Fail Safe is for. As soon as the malicious attack sets everything to 0 data retention, all of your data would be available in Fail Safe. You have 7 days before that data is removed. You would lose your time-travel for that 90 days, though.

Related

Alternative for Polling Record State

We currently have a payment tracking system which uses MS SQL Server Enterprise. When a client requests a service, he would have to do the payment within 24 hours, otherwise we would send him an SMS Reminder. Our current implementation simply records the date and time of the purchase, and keep on polling constantly the records in order to find "expired" purchases.
This is generating so much load on the database that we have to implement some form of replication in order to offload these operations to another server.
I was thinking: is there a way to combine CLR triggers with some kind of a scheduler that would be triggered only once, that is, 24 hours after the purchase is created?
Please keep in mind that we have tens of thousands of transactions per hour.
I am not sure how you are thinking that SQLCLR will solve this problem. I don't think this needs to be handled in the DB at all.
Since the request time doesn't change, why not load all requests into a memory-based store that you can hit constantly. You would load the 24-hour-from-request time so that you only need to compare those times to Now. If the customer pays prior to the 24-hour period then you remove the entry from the cache. Else, the polling process will eventually find it, process it, and remove it from the cache.
OR, similarly, you can use a scheduler and load a future event to be the SMS message, based on the 24-hour-from-request time, upon each request. Similar to scheduling an action using "AT". Again, if someone pays prior to that time, just remove the scheduled task/event/reminder.
You would store just the 24-hour-after-time and the RequestID. If the time is reached, the service would refer back to the DB using that RequestID to get the current info.
You just need to make sure to de-list items from the cache / scheduler if payment is made prior to the 24-hour-after time.
And if the system crashes / restarts, you just load all entries that are a) unpaid, and b) have not yet reached their 24-hour-after time.

How often should I have my server sync to the database?

I am developing a web-app right now, where clients will frequently (every few seconds), send read/write requests on certain data. As of right now, I have my server immediately write to the database when a user changes something, and immediately read from the database when they want to view something. This is working fine for me, but I am guessing that it would be quite slow if there were thousands of users online.
Would it be more efficient to save write requests in an object on the server side, then do a bulk update at a certain time interval? This would help in situations where the same data is edited multiple times, since it would now only require one db insert. It would also mean that I would read from the object for any data that hasn't yet been synced, which could mean increased efficiency by avoiding db reads. At the same time though, I feel like this would be a liability for two reasons: 1. A server crash would erase all data that hasn't yet been synced. 2. A bulk insert has the possibility of creating sudden spikes of lag due to mass database calls.
How should I approach this? Is my current approach ok, or should I queue inserts for a later time?
If a user makes a change to data and takes an action that (s)he expects will save the data, you should do everything you can to ensure the data is actually saved. Example: Let's say you delay the write for a while. The user is in a hurry, makes a change then closes the browser. If you don't save right when they take an action that they expect saves the data, there would be a data loss.
Web stacks generally scale horizontally. Don't start to optimize this kind of thing unless there's evidence that you really have to.

Caching database table on a high-performance application

I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
EDIT:
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?
So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!

What is the recommended way to build functionality similar to Stackoverflow's "Inbox"?

I have an asp.net-mvc website and people manage a list of projects. Based on some algorithm, I can tell if a project is out of date. When a user logs in, i want it to show the number of stale projects (similar to when i see a number of updates in the inbox).
The algorithm to calculate stale projects is kind of slow so if everytime a user logs in, i have to:
Run a query for all project where they are the owner
Run the IsStale() algorithm
Display the count where IsStale = true
My guess is that will be real slow. Also, on everything project write, i would have to recalculate the above to see if changed.
Another idea i had was to create a table and run a job everything minutes to calculate stale projects and store the latest count in this metrics table. Then just query that when users log in. The issue there is I still have to keep that table in sync and if it only recalcs once every minute, if people update projects, it won't change the value until after a minute.
Any idea for a fast, scalable way to support this inbox concept to alert users of number of items to review ??
The first step is always proper requirement analysis. Let's assume I'm a Project Manager. I log in to the system and it displays my only project as on time. A developer comes to my office an tells me there is a delay in his activity. I select the developer's activity and change its duration. The system still displays my project as on time, so I happily leave work.
How do you think I would feel if I receive a phone call at 3:00 AM from the client asking me for an explanation of why the project is no longer on time? Obviously, quite surprised, because the system didn't warn me in any way. Why did that happen? Because I had to wait 30 seconds (why not only 1 second?) for the next run of a scheduled job to update the project status.
That just can't be a solution. A warning must be sent immediately to the user, even if it takes 30 seconds to run the IsStale() process. Show the user a loading... image or anything else, but make sure the user has accurate data.
Now, regarding the implementation, nothing can be done to run away from the previous issue: you will have to run that process when something that affects some due date changes. However, what you can do is not unnecessarily run that process. For example, you mentioned that you could run it whenever the user logs in. What if 2 or more users log in and see the same project and don't change anything? It would be unnecessary to run the process twice.
Whatsmore, if you make sure the process is run when the user updates the project, you won't need to run the process at any other time. In conclusion, this schema has the following advantages and disadvantages compared to the "polling" solution:
Advantages
No scheduled job
No unneeded process runs (this is arguable because you could set a dirty flag on the project and only run it if it is true)
No unneeded queries of the dirty value
The user will always be informed of the current and real state of the project (which is by far, the most important item to address in any solution provided)
Disadvantages
If a user updates a project and then upates it again in a matter of seconds the process would be run twice (in the polling schema the process might not even be run once in that period, depending on the frequency it has been scheduled)
The user who updates the project will have to wait for the process to finish
Changing to how you implement the notification system in a similar way to StackOverflow, that's quite a different question. I guess you have a many-to-many relationship with users and projects. The simplest solution would be adding a single attribute to the relationship between those entities (the middle table):
Cardinalities: A user has many projects. A project has many users
That way when you run the process you should update each user's Has_pending_notifications with the new result. For example, if a user updates a project and it is no longer on time then you should set to true all users Has_pending_notifications field so that they're aware of the situation. Similarly, set it to false when the project is on time (I understand you just want to make sure the notifications are displayed when the project is no longer on time).
Taking StackOverflow's example, when a user reads a notification you should set the flag to false. Make sure you don't use timestamps to guess if a user has read a notification: logging in doesn't mean reading notifications.
Finally, if the notification itself is complex enough, you can move it away from the relationship between users and projects and go for something like this:
Cardinalities: A user has many projects. A project has many users. A user has many notifications. A notifications has one user. A project has many notifications. A notification has one project.
I hope something I've said has made sense, or give you some other better idea :)
You can do as follows:
To each user record add a datetime field sayng the last time the slow computation was done. Call it LastDate.
To each project add a boolean to say if it has to be listed. Call it: Selected
When you run the Slow procedure set you update the Selected fileds
Now when the user logs if LastDate is enough close to now you use the results of the last slow computation and just take all project with Selected true. Otherwise yourun again the slow computation.
The above procedure is optimal, becuase it re-compute the slow procedure ONLY IF ACTUALLY NEEDED, while running a procedure at fixed intervals of time...has the risk of wasting time because maybe the user will neber use the result of a computation.
Make a field "stale".
Run a SQL statement that updates stale=1 with all records where stale=0 AND (that algorithm returns true).
Then run a SQL statement that selects all records where stale=1.
The reason this will work fast is because SQL parsers, like PHP, shouldn't do the second half of the AND statement if the first half returns true, making it a very fast run through the whole list, checking all the records, trying to make them stale IF NOT already stale. If it's already stale, the algorithm won't be executed, saving you time. If it's not, the algorithm will be run to see if it's become stale, and then stale will be set to 1.
The second query then just returns all the stale records where stale=1.
You can do this:
In the database change the timestamp every time a project is accessed by the user.
When the user logs in, pull all their projects. Check the timestamp and compare it with with today's date, if it's older than n-days, add it to the stale list. I don't believe that comparing dates will result in any slow logic.
I think the fundamental questions need to be resolved before you think about databases and code. The primary of these is: "Why is IsStale() slow?"
From comments elsewhere it is clear that the concept that this is slow is non-negotiable. Is this computation out of your hands? Are the results resistant to caching? What level of change triggers the re-computation.
Having written scheduling systems in the past, there are two types of changes: those that can happen within the slack and those that cause cascading schedule changes. Likewise, there are two types of rebuilds: total and local. Total rebuilds are obvious; local rebuilds try to minimize "damage" to other scheduled resources.
Here is the crux of the matter: if you have total rebuild on every update, you could be looking at 30 minute lags from the time of the change to the time that the schedule is stable. (I'm basing this on my experience with an ERP system's rebuild time with a very complex workload).
If the reality of your system is that such tasks take 30 minutes, having a design goal of instant gratification for your users is contrary to the ground truth of the matter. However, you may be able to detect schedule inconsistency far faster than the rebuild. In that case you could show the user "schedule has been overrun, recomputing new end times" or something similar... but I suspect that if you have a lot of schedule changes being entered by different users at the same time the system would degrade into one continuous display of that notice. However, you at least gain the advantage that you could batch changes happening over a period of time for the next rebuild.
It is for this reason that most of the scheduling problems I have seen don't actually do real time re-computations. In the context of the ERP situation there is a schedule master who is responsible for the scheduling of the shop floor and any changes get funneled through them. The "master" schedule was regenerated prior to each shift (shifts were 12 hours, so twice a day) and during the shift delays were worked in via "local" modifications that did not shuffle the master schedule until the next 12 hour block.
In a much simpler situation (software design) the schedule was updated once a day in response to the day's progress reporting. Bad news was delivered during the next morning's scrum, along with the updated schedule.
Making a long story short, I'm thinking that perhaps this is an "unask the question" moment, where the assumption needs to be challenged. If the re-computation is large enough that continuous updates are impractical, then aligning expectations with reality is in order. Either the algorithm needs work (optimizing for local changes), the hardware farm needs expansion or the timing of expectations of "truth" needs to be recalibrated.
A more refined answer would frankly require more details than "just assume an expensive process" because the proper points of attack on that process are impossible to know.

Database design, real time or batch processing

I am facing two options of how to update the database, and do not know which one is better for my situation. There are three tables in the database, which are used to read/store some user's information, such the url history or some inputs.
In real time, the database is accessible by users all the time, so the changes made to the database can be seen immediately by that user.
The batch processing hides the "update" from user, database is updated by parsing the log files, and such a process runs every X hours. So user can only see their changes after X hours.
Apart from the advantage/disadvantage of synchronized/asynchronized updates that user can see. What are the other benefits of choosing real-time or batch processing updating methods for database updating?
Thanks
It all depends on the amount of traffic you expect. If you want to scale your application, asynchronous processing is always recommended. But that does not mean that your users have to wait for X hours. You can have the process run every 5 minutes or even every minute.
This way you will reduce concurrency issues and at the same time users will be able to see their updated history with a little bit of delay.
See best practices for scalability in the book Scalability Rules
I would suggest you use EDA (Event Driven Architecture) which uses a middleware to
"glue" all of this.
http://searchsoa.techtarget.com/definition/event-driven-architecture
One advice : Keep away from batch processes.
Today, everything tends to be more and more real-time. Imagine if you would receive my answer in X hours... would you be satisfied? :)
If you give us more Info, we could also help you more.
I see that your input comes from a log file? Can this be changed?
You could also implement the observer pattern.

Resources