How to use CHECKPOINT efficently in a application that uses FILESTREAM - sql-server

I use FILESTREAM to store BLOBS in my client server application.
In past i had from time to time to clear all BLOBS by executing a command like:
UPDATE BLOBTABLE set BLOBFIELD = NULL
This clears the blobs, i did this to make the DB backup smaller.
But to make the blobs "disappear from disk" i then need to run
CHECKPOINT
Note: this was done as DBA activity, not as part of the software.
Now I realize that in my application I never call CHECKPOINT.
May be i should every time i delete a blob, should i?
Just to experss my self better i make an example of my real case:
My app allows to store files (like pdf documents).
those pdf are saved as blobs in a filestream field.
As the user deletes them (from the ui) I run a DELETE commmand.
I do not call CEHCKPOINT after it, so the garbage collection does not
run.
By considering this i realize that i do not have the full thing under control.
So my question is simply: do i need to run CHECKPOINT every time i delete one of those files? is there any drawback in doing this?
Thanks!

A database performs checkpoints in different moments, one of those is when backup is performed.
Since the checkpoint triggers the garbage collection it is not needed (exceptions could be huge or complex scenarios) to call CHECKPOINT in an application because the risk is to reduce the performance.
Better decide to use CHECKPOINT as a maintenance activity if needed, but keeping in mind that a database backup (or even stopping the sql service) has a checkpoint as consequence.

Related

How does real-time collaborative applications saves the data?

I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.

Replicate a database using snapshots and transaction logs

For learning purposes, I want to write my own database, that is able to replicate itself. I have made some progress, but now I am facing a problem that I can not solve. Supposed I have a database (let's call this source) that I would like to replicate to another database (let's call this target).
The basic principle is easy: In the source you don't store actual tables, but instead a log of transactions. It's easy to send over the transaction log to the target, where the database then rebuilds itself. If you want to update the target, you simply request the part of the transaction log that has changed ever since. Basically this is what almost every database does.
While this works, it has one major drawback: If a table already exists for a long time, the transaction log is very long, and hence replicating the table requires lots of timeā€¦
To avoid this you can store the current state as well. This means you have an up-to-date snapshot that you can copy fast. Additionally, the target has to subscribe to the transaction log of the source. Once it contains additional entries, the target applies them to its copied table. This works well, too, and it's way better in terms of performance and transferred volume.
But now I am facing a problem: Supposed the snapshot is large, then it may happen that changes are made to it while it is being delivered. That means that the copied snapshot contains some old and some new data. Now, how do I get the target database in a consistent state? Even if I know from where to start the transaction log, I either have to apply a change that was already applied to some of the records, or I have to leave it out, but then a change is not applied at all to some other records.
Of course I could use the isolation level sequential, but then performance drops. Of course I could do what e.g. CouchDB does and remember the current table revision in every record, and keep a copy of every record for every revision. But then the required space grows enormously.
So, what shall I do?
Everything that I was able to find on the web always either relies on the idea of replaying the entire transaction log, or by using a process as in CouchDB which takes up huge amounts of space.
Any ideas?
Your snapshot needs to be consistent and you need to know at what time (in regards to the tx log) it is consistent. You then apply any transactions that have been committed since this point.
Obtaining a consistent snapshot can be done with exclusive locking, which may delay other transactions from committing, or using row versions (MVCC).
Good luck with your project.

On the app engine development server why does the "Applying all pending transactions and saving the datastore" step take several seconds?

It seems like something that should take place in a flash, at least locally. Is there some time emulation of the production server that causes it?
It depends how much data is saved in your local datastore and how fast your disk is. I've noticed if you have a lot of deleted data in your datastore, the file still ends up being big.
If you want to clear the database, it's better to delete the whole file than delete all the individual entities.

Postgres reverting back

I did some changes to a postgres table and I want to revert it back to a previous state. There is no back up of the database. Is there a way to do it? As in, does postgres take auto snap shots and store it somewhere or the original data is lost forever?
By default PostgreSQL won't store all of your old data -- that would surprise many people of course. However, it has a built-in Point-in-time Recovery mechanism which does pretty much exactly what you want. You have to keep an archive of "write-ahead log files" which represent the changes made to the database, and you can take periodic base backups. When you want to recover you can recovery to a specific time or even a specific transaction ID.
If you don't have a backup, and did an operation on data, I don't think there is much you can do : the database now has your new data, and the old version of it has be replaced/deleted.
The goal of a database engine is to persist the data you store in it -- not the data you removed from it.
For the next time : if you need to try something, use a transaction, and don't commit it until you are sure what you did is OK -- juste beware not to wait for too long before commitint or rollbacking it, because it might lock some stuff, preventing other people from doing queries too.

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

Resources