In the scenario where you've stored ~1TB of key/value, and need to provide a production API to 1000s of users.
Further, the store will only be used for read operations only (after 1-time initial write)
You should use rocksdb and also hold the data in your ram, as much as possible, and serve responses from there (the more the better)
Related
The scenario is about 1 billion records. Each record has 1kb data size and is store in SSD.
Which kv store can provide best random read performance? It need to reduce disk access to only 1 time per query and all of the data index will be stored in memory.
Redis is fast but it's too expensive to store 1 TB data in memory.
LevelDB reads disk several times per query.
The closest one I found is fatcache but it's not persistent. It's an SSD-backed memcached.
Any suggestions?
RocksDB might be the choice for you, which is optimized for fast storage like memory and flash-disk, and its highly customizable. If your application is read-only after initial bulk-load, then you can config RocksDB to compact everything in one single big file. In that way, reads are guaranteed to have at most single I/O. However, if your application handles both reads and writes, then in order to have at most one I/O per read, you will need to sacrifice the write performance as you need to config rocksdb to compact very often, and that hurts write performance.
Tuning guide for RocksDB can also be found here.
You may want to try RocksDB, it's a facebook library which optimized for SSD storage. You can also try Ardb, it's a redis protocol compatible NoSQL DB build on RockDB/LevelDB/LMDB.
Have you looked at aerospike ? I haven't use it, but they claim to have good performances on SSD.
LMDB is faster than RocksDB and uses 1/3rd as much memory. Also LMDb requires no tuning; RocksDB requires careful tuning of over 40 parameters to get performance that approaches LMDB's.
http://www.lmdb.tech/bench/inmem/scaling.html
Also LMDB is fully transactional and 100% crash-proof, RocksDB is neither.
I am looking at using a spell checker for my GAE app and we have an algorithm already for spell checking, but I'm trying to figure out how to best store and load dictionary files for best performance.
I am considering the following strategies:
Place the dictionary data in a text file(s) in local app engine storage and load/read them using standard IO methods (open(),read(),etc)
Place the dictionary data in GCS and load/read using GCS IO methods
Place the dictionary data in an ndb.model() and load/cache information
One cache I don't quite understand is the context cache -- is this cache that is attached to a given instance? I.e. if I have a resident instance that is spun up, can I go ahead and load the dictionary data into the instance's RAM and thus accessing data should be extremely fast (microsecond vs millisecond seek/get times)? The dictionary data will probably be a sharded list of some sort that we'll optimize for performance. Are there other data storage methods/structures I'm not considering here that may be more appropriate? Thanks.
Cache (or its full name memcache) isn't exactly RAM but similar. When used with NDB it acts like a buffer. When you do writes it writes to the Memcache first then to the DB. Though this may sound slower its not, as writes to the DB take a while before they are accessible. When it reads it checks memcache, if it exists then it uses that info otherwise it pulls from the DB, stores it in Memcache then gives you the data. Just like RAM though its volatile, thus you cannot guaranty information is always acceptable, its limited (depending on what type of instance you have) and can be flush with no warning or reason. You can read more here:
https://developers.google.com/appengine/docs/python/memcache/
https://developers.google.com/appengine/articles/scaling/memcache
Ultimately Memcahe will be the fastest and most accessible as it it shared amongst all your instances, so if one instance pulls some data from the datastore then all of them can access it quickly. Even if its not in memcache it is still the fastest of all the options, as the others ones will fill up your memory and may cause errors and performance issues.
I am writing a small appengine application and I want to start using Datastore.
My app has some users and each user is a complicated JAVA class.
Users might swap some "point" objects between them, so I need the data to be available and fast.
My question is quite generic:
How should I handle the data caching?
Storing the data entirely in the Datastore and fetching it in every call sounds slow in runtime.
On the other hand, holding the data in a static JAVA class sounds tricky, because every now and then the server resets and data is erased.
If I had a main loop, like in a regular console application I would've probably saved the data twice-three times a day on predefined hours of the day.
How should I manage my code in such a way that it would save the status in the datastore every now and then and this way will not loose any data.
if the number of write operations are less than read operations,
it is good to try "Write-through caching" with memcache module.
https://developers.google.com/appengine/docs/java/memcache/
basically the write-through caching works like this:
for every write operations, program update the datastore, then backup the data to memcache
for every read operations, program will try to read from memcache. if it can found required data, then it returns, otherwise it will fetch the data from datastore and backup the results to memcache.
As #lucemia noted, you should use memcache.
If you use objectify, you can set-up caching declaratively, without writing and cache handling code.
I see in many cases memcached is used. Can you give examples when to avoid memcached other than large files? How large files are appropriate for memcached?
Thanks for your answer
If you know when to bust the caches to prevent out-of-date things from being cached, there's not really a reason to avoid memcache for anything small unless it's so trivial to compute that it'd be approximately as long to hit memcache as it would to just compute it.
I have seen Memcached is used to store session data.As my point of view its not recommended storing sessions in Memcached.If a session disappears, often the user is logged out,If a portion of a cache disappears or either due to a hardware crash it should not cause your users noticable pain.According to the memcached site, “memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.” So while developing your application, remember that you must have a fall-back mechanism to retrieve the data once it is not found in the Memcached server. Here are some tips,
Avoid storing frequently updated data in Memcached.
Memcached does not ship with in-built security mechanisms. So It is your responsibility to handle security issues.
Try to maintain predefined fixed number of connections in the connection pool because each set/get operation is a new atomic connection to the Memcached server.
Avoid storing raw data coming straight from the database rather than storing processed data
I have a very small amount of data (~200 bytes) that I retrieve from the database very often. The write rate is insignificant.
I would like to get away from all the unnecessary database calls to fetch this almost static data. Is memcached a good use for this? Something else?
If it's of any relevance I'm running this on GAE using python. The data in question could be easily (de)serialized as json.
Memcache is well-suited for this - reading from the datastore is much more expensive than reading from memcache. This is especially true for small amounts of data for which the cost to retrieve is dominated by latency to the datastore.
If your app receives enough requests that instances typically stay alive for a little while, then you could go one step further and use App Caching to largely avoid memcache too. (Basically, cache the value in a global variable, and also app-cache the time the value was last updated. Provide an accessor for the value which retrieves the latest from memcache/db if it hasn't been updated in X minutes). Memcache is pretty cheap though, so this extra work might only make sense if you access this variable rather frequently.
If it changes less often than once per day, you could just hardcode it in webapp code, and reupload the file each time it changes.