Dealing with stale data in in-memory caches - database

Suppose the strategy for using an in-memory cache (such as redis/memcache) in front of a database is:
Reading: client will first try to read from the cache. On cache miss, read from the database and put the data in the cache.
Writing: Update the database first, followed by deleting the cache entry.
Suppose the following sequence happens:
client A reads from the cache and got a miss.
client A reads from the database.
client B updates the same entry in the database.
client B deletes the (non-existent) cache entry.
client A puts the (stale) entry in the cache.
client C will then read the stale data in the cache.
Is there any strategy for avoiding such a scenario?
I know we could put an expiry time on each cache entry, but still there is a possibility of reading stale data, which could be undesirable in certain situations.

You could version cache data and keep each version immutable. Every time the data in the database changes you increment an integer version column. The cache key must include the version number. Clients can then first look into the database to look up the current version number and then talk to the cache.
Keeping caches consistent is very hard because they operate non-transactionally. There is no general way to prevent the kind of problem you are talking about. Ideally, you'd like to atomically make writes visible in the DB and in the cache but that is only possible in special cases. Like with the scheme that I propose.

Or if needed, you can keep versions + archive data ( mechanism of Redis to ensure this by periodically archiving data.


How does real-time collaborative applications saves the data?

I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.

Will #Cacheable speed up application using a shared database?

I am writing a server-side application, say app1, using Spring-boot. This app1 accesses a database db1 on a dedicated DB server. To speed up DB access, I have marked some of my JPARepository as #Cacheable(<some_cache_key), with an expiration time, say 1 hour.
The problem is: db1 is shared among several applications, each may update entries inside it.
Question: will I have performance gain in my app1 by using caches inside my application (#Cacheable)? (Note, the cache is inside my application, not inside the database, i.e. mask the entire DB with cache manager like Redis)
Here are my thoughts:
If another application app2 modifies a DB entry, how would the cache inside app1 know that entry is updated? Then my app1's cache went stale, isn't it? (until it starts to refresh after the fixed 1hr refresh cycle)
if #1 is correct, then does it mean the correct way of setting up cache should be mask the entire DB with some sort of cache manager. Is Redis for such kind of usage?
So, many questions there.
Will I have performance gain in my app1 by using caches inside my
application (#Cacheable)?
You should always benchmark it but theoretically, it will be faster to access the cache than the database
If another application app2 modifies a DB entry, how would the cache
inside app1 know that entry is updated? Then my app1's cache went
stale, isn't it? (until it starts to refresh after the fixed 1hr
refresh cycle)
It won't be updated unless you are using a clustered cache. Ehcache using a Terracotta cluster is such a cache. But yes, if you stick on a basic application cache, it will get stale.
if #1 is correct, then does it mean the correct way of setting up
cache should mask the entire DB with some sort of cache manager. Is
Redis for such kind of usage?
Now it gets subtle. I'm not a Redis expert. But as far as I know, Redis is frequently used as a cache but it's actually a NoSQL database. And it won't be in front (again, from as far as I know), it will be aside. So you will first query Redis to see if your data is there and then your database. If your database is much slower to access and you have a really good cache hit, it will improve your performance. But please do a benchmark.
Real caches (like Ehcache) are a bit more efficient. They add the concept of near caching. So your app will keep cache entries in memory but also on the cache server. If the entry is updated, near cache will be updated. So you get application cache performance but also coherence between servers.

Caching database table on a high-performance application

I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?
So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!

How to reduce request to my db?

i have a db that store many posts, like a blog. The problem is that exist many users and this users create many post at the same time. So, when a user request the home page i request this posts to db. In less words, i've to get the posts that i've showed, for show the new ones. How can i avoid this performance problem?
Before going down a caching path ensure
Review the logic (are you undertaking unnecessary steps, can you populate some memory variables with slow changing data and so reduce DB calls, etc)
Ensure DB operations are as distinct as possible (minimum rows and columns returned)
Data is normalised to at least 3rd normal form and then selectively denormalised with the appropriate data handling routines for the denormalised data.
After normalisation, tune the DB instance (server perfomance, disk IO, memory, etc)
Tune the SQL statements
Then ...
Consider caching. Even though it is not possible to cache all data, if you can get a significant percentage into cache for a reasonable period of time (and those values vary according to site) you remove load from the DB server and so other queries can be served faster.
do you do any type of pagination? if not database pagination would be the best bet... start with the first 10 posts, and after that only return the full list of the user requests it from a link or some other input.
The standard solution is to use something like memcached to offload common reads to a caching layer. So you might decide to only refresh the home page once every 5 minutes rather than hitting the database repeatedly with the same exact query.
If there are data which is requested very often, you should cache it. Try using an in-memory cache such as memcached to store things that are likely to be re-requested in short time. You should have free RAM for this: try using free memory on your frontend machine(s), usually serving HTTP requests and applying templates is less RAM-intensive. BTW, you can cache not only raw DB records, but also ready-made pieces of pages with formatting and all.
If your load cannot be reasonably handled by one machine, try sharding your database. Put data of some of your users (posts, comments, etc) on one machine, data of other users to another machine, etc. This will make some joins impossible on database level, because data are on different machines, but joins that you do often will be parallelized.
Also, take a look at document-oriented 'NoSQL' data stores like (MongoDB)[]. It e.g. allows you to store a post and all comments to it in a single record and fetch in one operation, without any joins. But regular joins are next to impossible. Probably a mix of SQL and NoSQL storage is most efficient (and hard to handle).

performance of web app with high number of inserts

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.
By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.
Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.
I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.
As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.
If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database
Hope this helps, Ace
When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.
If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.
My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.
If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.
Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.
Sparing use of indexes and selection of sequential primary keys help.
Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.
Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.
The only problem with using the filesystem to back writes is how you extend the log.
A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.
This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.
If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.
Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.
The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.
My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.
ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.
There's an easier way to answer this. Profile the performance of the two solutions.
Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.
If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.
That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...
Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?
If speed is the problem, leave it all in memory, and dump to the database in batches.
Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.
If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.
If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.
Lots of details on insertion speeds.
Some other tips include:
splitting data into different tables per period of time (ie: per day or per week)
using multiple db connections
using multiple db servers
have good hardware (SSD/multicore)
Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.
If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.
