Database cache ( redis, memcache ) usage, query vs. items - database

I'm wondering what the preferred way is to cache elements from a database with an in-memory cache, like redis or memcache. The context is that I have a table of items which are being accessed by an API, frequently ( millions of times per second ) as real-time stats. In general, the API is just looking for items in a given range of time, with a certain secondary id. The same data is likely to be hit many times. It seems like you could do it in a few ways:
Cache the entire query.
Meaning, the entire data string resulting from the real query to the Database would get stored in the cache, with a minimal query as the key. The advantage is that for frequently used queries, there is just a single access to get the entire set of results back. But any slightly different query needs to be redone and cached.
Cache the items in the query.
Meaning, each item returned from the real query gets stored individually in the cache, with a searchable id as the key. The advantage is that for slightly different queries, you don't need to run a full query against the DB again, just elements that are not currently cached.
Mirror the entire database
Meaning, each item is put into the cache as soon as it gets created/udpdated in the DB. The cache is always assumed to be up to date, and so all queries can just run on the cache directly.
It seems like these approaches might be better or worse in certain circumstances, but are there some pitfalls here that make some completely undesirable? Or just clearly better in this use-case?
Thanks for any advice!

#3 i.e., Mirroring the database is not a good option. Also, keep in mind that most in memory systems like Redis don't have a query langurage but rather retreival is based on Keys. So, it is not a good idea to replicate data, especially if data is relational.
You should use a combination of #1 and #2. Redis is key based, so you will have to design the keys as per your query criteria. I would suggest to build a library that works on the concept of etag. In redis, save the etag and the query response. The library should pass the etag to backend logic, which will re-run the query only if etag doesn't match. If the etag matches then backend will not re-run query and library will take the cached response from redis and send back to client.
Refer
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag for concept.

Related

Redis database cache in laravel 5.6

I want cache my query results and I read about Cache::remember in laravel but it take a time parameter and I don't want to set time for my redis cache.
I need something to cache my queries and after queries have updated the results changed by the updating.
what's your recommendation?
Storing the full collection of eloquent models in redis can be slower than expected.
In my case, i had to create nested selection with lot of where, count, join, group by and order by ... etc.
It has consumed a lot of resources at every request, so i tried to cache the result. It was not the best solution, because it was (4 times) slower than i wanted (200+ ms response).
The solution is SELECT id FROM ... from "huge" query and store IDs in redis. After this the SQL query looks like SELECT * FROM <table> WHERE id IN (...); in every request. (Re-order the data in sql query if necessary)
In this way, the required data from redis and sql can be queried quickly. The average response time is less than 50 ms.
I hope this will help.
There is a very good library for this but they warn that it’s only compatible with laravel 5.8. If you could update, this is a way to go. If updating laravel is not an option, at least you can read the code and try to follow the same direction they did.
https://github.com/GeneaLabs/laravel-model-caching
This library does exactly what you need. You can have your models and/or custom queries cached and you can invalidate this cache whenever a model gets updated, created or deleted.

AWS ElastiCache vs RDS ReadReplica

My app currently connects to a RDS Multi-AZ database. I also have a Single-AZ Read Replica used to serve my analytics portal.
Recently there have been an increasing load on my master database, and I am thinking of how to resolve this situation without having to scale up my database again. The two ways I have in mind are
Move all the read queries from my app to the read-replica, and just scale up the read-replica, if necessary.
Implement ElastiCache Memcached.
To me these two options seem to achieve the same outcome for me - which is to reduce load on my master database, but I am thinking I may have understood some fundamentals wrongly because Google doesnt seem to return any results on a comparison between them.
In terms of load, they have the same goal, but they differ in other areas:
Up-to-dateness of data:
A read replica will continuously sync from the master. So your results will probably lag 0 - 3s (depending on the load) behind the master.
A cache takes the query result at a specific point in time and stores it for a certain amount of time. The longer your queries are being cached, the more lag you'll have; but your master database will experience less load. It's a trade-off you'll need to choose wisely depending on your application.
Performance / query features:
A cache can only return results for queries it has already seen. So if you run the same queries over and over again, it's a good match. Note that queries must not contain changing parts like NOW(), but must be equal in terms of the actual data to be fetched.
If you have many different, frequently changing, or dynamic (NOW(),...) queries, a read replica will be a better match.
ElastiCache should be much faster, since it's returning values directly from RAM. However, this also limits the number of results you can store.
So you'll first need to evaluate how outdated your data can be and how cacheable your queries are. If you're using ElastiCache, you might be able to cache more than queries — like caching whole sections of a website instead of the underlying queries only, which should improve the overall load of your application.
PS: Have you tuned your indexes? If your main problems are writes that won't help. But if you are fighting reads, indexes are the #1 thing to check and they do make a huge difference.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.
With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.
We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.
Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

Choosing a DB for a caching system

I am working on a financial database that I need to develop caching for. I have a MySQL database with a lot of raw, realtime data. This data is then provided over a HTTP API using Flask (Python).
Before the raw data is returned it is manipulated by my python code. This manipulation can involve a lot of data, therefore a caching system is in order.
The cached data never changes. For example, if someone queries for data for a time range of 2000-01-01 till now, the data will get manipulated, returned and stored in the cache as being the specifically manipulated data from 2000-01-01 till now. If the same manipulated data is queried again later, the cache will retrieve the values from 2000-01-01 till the last time it was queried, elimination the need for manipulation for that entire period. Then, it will manipulate the new data from that point till now, and add that to the cache too.
The data size shouldn't be enormous (under 5GB I would say at max).
I need to be able to retrieve from the cache using date ranges.
Which DB should I be looking it? MongoDB? Redis? CouchDB?
Thanks!
Using BigData solution for such a small data set seems like a waste and might still not yell the required latency.
It seems like what you need is not one of the BigData solution like MongoDB or CouchDB but a distributed Caching (or In Memory Data Grid).
One of the leading solution which (which I'm one of its contributors) seems like a perfect match for you needs is XAP Elastic Caching.
For more details see: http://www.gigaspaces.com/datagrid
And you can find a post describing exactly this case on how you can use DataGrid to scale MySQL: "Scaling MySQL" - http://www.gigaspaces.com/mysql

How to reduce request to my db?

i have a db that store many posts, like a blog. The problem is that exist many users and this users create many post at the same time. So, when a user request the home page i request this posts to db. In less words, i've to get the posts that i've showed, for show the new ones. How can i avoid this performance problem?
Before going down a caching path ensure
Review the logic (are you undertaking unnecessary steps, can you populate some memory variables with slow changing data and so reduce DB calls, etc)
Ensure DB operations are as distinct as possible (minimum rows and columns returned)
Data is normalised to at least 3rd normal form and then selectively denormalised with the appropriate data handling routines for the denormalised data.
After normalisation, tune the DB instance (server perfomance, disk IO, memory, etc)
Tune the SQL statements
Then ...
Consider caching. Even though it is not possible to cache all data, if you can get a significant percentage into cache for a reasonable period of time (and those values vary according to site) you remove load from the DB server and so other queries can be served faster.
do you do any type of pagination? if not database pagination would be the best bet... start with the first 10 posts, and after that only return the full list of the user requests it from a link or some other input.
The standard solution is to use something like memcached to offload common reads to a caching layer. So you might decide to only refresh the home page once every 5 minutes rather than hitting the database repeatedly with the same exact query.
If there are data which is requested very often, you should cache it. Try using an in-memory cache such as memcached to store things that are likely to be re-requested in short time. You should have free RAM for this: try using free memory on your frontend machine(s), usually serving HTTP requests and applying templates is less RAM-intensive. BTW, you can cache not only raw DB records, but also ready-made pieces of pages with formatting and all.
If your load cannot be reasonably handled by one machine, try sharding your database. Put data of some of your users (posts, comments, etc) on one machine, data of other users to another machine, etc. This will make some joins impossible on database level, because data are on different machines, but joins that you do often will be parallelized.
Also, take a look at document-oriented 'NoSQL' data stores like (MongoDB)[http://www.mongodb.org/]. It e.g. allows you to store a post and all comments to it in a single record and fetch in one operation, without any joins. But regular joins are next to impossible. Probably a mix of SQL and NoSQL storage is most efficient (and hard to handle).

Resources