I am writing a server-side application, say app1, using Spring-boot. This app1 accesses a database db1 on a dedicated DB server. To speed up DB access, I have marked some of my JPARepository as #Cacheable(<some_cache_key), with an expiration time, say 1 hour.
The problem is: db1 is shared among several applications, each may update entries inside it.
Question: will I have performance gain in my app1 by using caches inside my application (#Cacheable)? (Note, the cache is inside my application, not inside the database, i.e. mask the entire DB with cache manager like Redis)
Here are my thoughts:
If another application app2 modifies a DB entry, how would the cache inside app1 know that entry is updated? Then my app1's cache went stale, isn't it? (until it starts to refresh after the fixed 1hr refresh cycle)
if #1 is correct, then does it mean the correct way of setting up cache should be mask the entire DB with some sort of cache manager. Is Redis for such kind of usage?
So, many questions there.
Will I have performance gain in my app1 by using caches inside my
application (#Cacheable)?
You should always benchmark it but theoretically, it will be faster to access the cache than the database
If another application app2 modifies a DB entry, how would the cache
inside app1 know that entry is updated? Then my app1's cache went
stale, isn't it? (until it starts to refresh after the fixed 1hr
refresh cycle)
It won't be updated unless you are using a clustered cache. Ehcache using a Terracotta cluster is such a cache. But yes, if you stick on a basic application cache, it will get stale.
if #1 is correct, then does it mean the correct way of setting up
cache should mask the entire DB with some sort of cache manager. Is
Redis for such kind of usage?
Now it gets subtle. I'm not a Redis expert. But as far as I know, Redis is frequently used as a cache but it's actually a NoSQL database. And it won't be in front (again, from as far as I know), it will be aside. So you will first query Redis to see if your data is there and then your database. If your database is much slower to access and you have a really good cache hit, it will improve your performance. But please do a benchmark.
Real caches (like Ehcache) are a bit more efficient. They add the concept of near caching. So your app will keep cache entries in memory but also on the cache server. If the entry is updated, near cache will be updated. So you get application cache performance but also coherence between servers.
Related
In a multi node Janusgraph cluster, data modification done from one instance does not sync with others until it reaches the given expiry time (cache.db-cache-time)
As per the documentation[1] it does not recommends to enable database level cache in a distributed setup as cached data does not share amoung instances.
Any suggestions for a solution/workaround where i can see the data changes from other JG instances immediately and avoid stale data access ?
[1] https://docs.janusgraph.org/operations/cache/#cache-expiration-time
If you want immediate access to the most up-to-date version of the data then by definition, you cannot cache any of it.
The contents of the cache will be accessed as long as they have not expired or been evicted. Unfortunately there is no way around it if consistency is your top priority. Cheers!
I am currently using the cache for my current project, but i'm not sure if it is the right thing to do.
I need to retrieve a lot of data from a web api (nodes that can be picture, node, folder, gallery.... Those nodes will change very often, so I need fast access (loading up to 300-400 element at once). Currently I store them in cache (key as md5 of node_id, so easy to retrieve, and update).
It is working great so far, but if I clear the cache it takes up to 1 minute to create all the cache again.
Should I use a database to store those nodes ? Will it be quicker / slower / same ?
Your question is very broad and thus hard to answer. Saving 300-400 elements under a cache key sounds problematic to me. You can run into problems where serializing when storing in the cache and deserializing when retrieving the data will cause problems for you. Whenever your cache service is down your app will be practically unusable.
If you already run into problems when clearing/updating the cache you might want to look for an alternative. This might be a database or elasticsearch, advanced cache features like tagged caching could help with preventing you from having to clear the whole cache when part of the information updates. You might also want to use something like the chain provider to store things in multiple caches to prevent the aforementioned problem of an unreachable cache "breaking" your app. You could also look into a pattern that is common with CQRS called a read model.
There are a lot of variables that come into play. If you want to know which one will yield the best results, i.e. which one is quicker, you should do frequent performance tests with realistic data using Symfony's debug toolbar & profilers or a 3rd party service like blackfire.io or tideways. You might also want to do capacity test with a tool like JMeter to ensure those results still hold true, when there are multiple simultaneous users.
Suppose the strategy for using an in-memory cache (such as redis/memcache) in front of a database is:
Reading: client will first try to read from the cache. On cache miss, read from the database and put the data in the cache.
Writing: Update the database first, followed by deleting the cache entry.
Suppose the following sequence happens:
client A reads from the cache and got a miss.
client A reads from the database.
client B updates the same entry in the database.
client B deletes the (non-existent) cache entry.
client A puts the (stale) entry in the cache.
client C will then read the stale data in the cache.
Is there any strategy for avoiding such a scenario?
I know we could put an expiry time on each cache entry, but still there is a possibility of reading stale data, which could be undesirable in certain situations.
You could version cache data and keep each version immutable. Every time the data in the database changes you increment an integer version column. The cache key must include the version number. Clients can then first look into the database to look up the current version number and then talk to the cache.
Keeping caches consistent is very hard because they operate non-transactionally. There is no general way to prevent the kind of problem you are talking about. Ideally, you'd like to atomically make writes visible in the DB and in the cache but that is only possible in special cases. Like with the scheme that I propose.
Or if needed, you can keep versions + archive data (http://redis.io/topics/persistence) mechanism of Redis to ensure this by periodically archiving data.
I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
EDIT:
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?
So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!
I am curious as to how caching works in Google App Engine or any cloud based application. Since there is no guarantee that requests are sent to same sever, does that mean that if data is cached on 1st request on Server A, then on 2nd requests which is processed by Server B, it will not be able to access the cache?
If thats the case (cache only local to server), won't it be unlikely (depending on number of users) that a request uses the cache? eg. Google probably has thousands of servers
With App Engine you cache using memcached. This means that a cache server will hold the data in memory (rather than each application server). The application servers (for a given application) all talk the same cache server (conceptually, there could be sharding or replication going on under the hoods).
In-memory caching on the application server itself will potentially not be very effective, because there is more than one of those (although for your given application there are only a few instances active, it is not spread out over all of Google's servers), and also because Google is free to shut them down all the time (which is a real problem for Java apps that take some time to boot up again, so now you can pay to keep idle instances alive).
In addition to these performance/effectiveness issues, in-memory caching on the application server could lead to consistency problems (every refresh shows different data when the caches are not in sync).
Depends on the type of caching you want to achieve.
Caching on the application server itself can be interesting if you have complex in-memory object structure that takes time to rebuild from data loaded from the database. In that specific case, you may want to cache the result of the computation. It will be faster to use a local cache than a shared memcache to load if the structure is large.
If having consistent value between in-memory and the database is paramount, you can do some checksum/timestamp check with a stored value on the datastore, every time you use the cached value. Storing checksum/timestamp on a small object or in a global cache will fasten the process.
One big issue using global memcache is ensuring proper synchronization on "refilling" it, when a value is not yet present or has been flushed. If you have multiple servers doing the check at the exact same time and refilling value in cache, you may end-up having several distinct servers doing the refill at the same time. If the operation is idem-potent, this is not a problem; if not, a potential and very hard to trace bug.