When to avoid memcache? - google-app-engine

When to avoid memcache? - google-app-engine

I see in many cases memcached is used. Can you give examples when to avoid memcached other than large files? How large files are appropriate for memcached?
Thanks for your answer

If you know when to bust the caches to prevent out-of-date things from being cached, there's not really a reason to avoid memcache for anything small unless it's so trivial to compute that it'd be approximately as long to hit memcache as it would to just compute it.

I have seen Memcached is used to store session data.As my point of view its not recommended storing sessions in Memcached.If a session disappears, often the user is logged out,If a portion of a cache disappears or either due to a hardware crash it should not cause your users noticable pain.According to the memcached site, “memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.” So while developing your application, remember that you must have a fall-back mechanism to retrieve the data once it is not found in the Memcached server. Here are some tips,
Avoid storing frequently updated data in Memcached.
Memcached does not ship with in-built security mechanisms. So It is your responsibility to handle security issues.
Try to maintain predefined fixed number of connections in the connection pool because each set/get operation is a new atomic connection to the Memcached server.
Avoid storing raw data coming straight from the database rather than storing processed data

Related

Real-time chats using Redis - how not to lose messages?

I want to implement a real-time chat. My main db is PostgreSQL, with the backend written in NodeJS. Clients will be mobile devices.
As far as I understand, to achieve real-time performance for messaging, I need to use Redis.
Therefore, my plan was to use Redis for the X most recent messages between 2 or more people(group chat) , for example 1000, and everything is synced and backed in my main Db which is PostgreSQL.
However, since Redis is essentially just RAM, the chat history can be too "vulnerable", owing to the volatile nature of storing data in RAM.
So if my redis server has some unexpected and temporary failure, the recent messages in conversations would be lost.
What are the best practices nowaydays to implement something like this?
Do I simply need to persist Redis data to disk? but in that case, wouldn't that hurt performance, since it will increase the write time for each message sent ?
Or perhaps I should just prepare a recovery method, that fetches the recent history from PostgreSQL in case my redis chat history list is empty?
P.S - while we're at it, could you please suggest a good approach for maintaining the status of users (online/offline) ? Is it done with Redis as well?

What are the best practices nowaydays to implement something like
this? Do I simply need to persist Redis data to disk? but in that
case, wouldn't that hurt performance, since it will increase the write
time for each message sent?
Yes, enabling persistence will impact performance of redis.
The best bet will be run a quick benchmark with the expected IOPS and type of operations from your application to identify impacts on IOPS with persistence enabled.
RBD vs AOF:
With RDB persistence enabled, the parent process does not perform disk I/O to store changes to data to RDB. Based on the values of save points, redis forks a child process to perform RDB.
However, based on the configuration of save points, you may loose data written after last save point - in case of the event of server restart or crash if data was not saved from last save point
If your use case can not tolerate to the data loss for this period, you need to look at the AOF persistence method. AOF will keep track of all write operations, that can be used to construct data upon server restart event.
AOF with fsync policy set to every second can be slower, however, it can be as good as RDB if fsync is disabled.
Read the trade-offs of using RDB or AOF: https://redis.io/topics/persistence#redis-persistence
P.S - while we're at it, could you please suggest a good approach for
maintaining the status of users (online/offline) ? Is it done with
Redis as well?
Yes

Caching database table on a high-performance application

I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
EDIT:
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?

So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!

Caching in Google App Engine/Cloud Based Hosting

I am curious as to how caching works in Google App Engine or any cloud based application. Since there is no guarantee that requests are sent to same sever, does that mean that if data is cached on 1st request on Server A, then on 2nd requests which is processed by Server B, it will not be able to access the cache?
If thats the case (cache only local to server), won't it be unlikely (depending on number of users) that a request uses the cache? eg. Google probably has thousands of servers

With App Engine you cache using memcached. This means that a cache server will hold the data in memory (rather than each application server). The application servers (for a given application) all talk the same cache server (conceptually, there could be sharding or replication going on under the hoods).
In-memory caching on the application server itself will potentially not be very effective, because there is more than one of those (although for your given application there are only a few instances active, it is not spread out over all of Google's servers), and also because Google is free to shut them down all the time (which is a real problem for Java apps that take some time to boot up again, so now you can pay to keep idle instances alive).
In addition to these performance/effectiveness issues, in-memory caching on the application server could lead to consistency problems (every refresh shows different data when the caches are not in sync).

Depends on the type of caching you want to achieve.
Caching on the application server itself can be interesting if you have complex in-memory object structure that takes time to rebuild from data loaded from the database. In that specific case, you may want to cache the result of the computation. It will be faster to use a local cache than a shared memcache to load if the structure is large.
If having consistent value between in-memory and the database is paramount, you can do some checksum/timestamp check with a stored value on the datastore, every time you use the cached value. Storing checksum/timestamp on a small object or in a global cache will fasten the process.
One big issue using global memcache is ensuring proper synchronization on "refilling" it, when a value is not yet present or has been flushed. If you have multiple servers doing the check at the exact same time and refilling value in cache, you may end-up having several distinct servers doing the refill at the same time. If the operation is idem-potent, this is not a problem; if not, a potential and very hard to trace bug.

Charts or Stats comparing Database vs. HTTP vs. Direct File Access Performance?

I am wondering what the stats are for different ways of storing (and therefore retrieving) content. Are there any charts out there, or do you guys have any quick tests to show, the requests per second, etc., of:
Direct (local) database access, vs.
HTTP Access to cached data, vs.
HTTP Access to uncached data (remote database), vs.
Direct File access
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Thanks!

.. what the stats are ...
Although some people may have published their findings, this will not map directly to your experience - you may find the opposite of they discovered.
Sometimes it may be faster to retrieve files from a database than a file - it depends on the size of the file, the filesystem or DBMS it resides on, the other data which affects the access path (e.g. indexes, number of I/O operations to dereference the start of the file...) the underlying hardware, the amount of caching available, the presence of the data or information relating to its location in the cache and the interaction between each of these factors.
And that's before you start considering the additional variables introduced when you start talking about HTTP, which also implies remote network access.
While ultimately any file would need to be read from the filesystem at some point, this suggests that direct file access would be the fastest method (but only on the local machine) however if you consider centralized caching and concurrency this is not necessarily the case.
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Rather hard to say. How remote? what are your bandwidth costs? Latency? What level of service do you hope to provide? Does the remote system provide caching information already? How do you deal with cache invalidations?
If we knew everything about your application, the data source, your customers and networks connecting them and your budget for implementing the service then we might hazard a guess. And, yes, caching on the MITM server probably is a good idea but only if you know that you're not breaking anything by using caching.
C.

When to use a certain type of persistence in Google App Engine?

First of all I'll explain the question. By persistence, I mean storing data beyond the execution of a single request. It might not be the best question title, so feel free to edit it.
The way I see it, there are three types of persistence in GAE, each one "closer" to the request itself:
The datastore
This is where all data is most likely to be based. It may go into the higher layers of persistence temporarily, but in the end, this is where the data really is. Unfortunately, querying the datastore repeatedly is slow and uses a lot of resources.
Use when...
storing data that should be stored for an indefinite amount of time.
Avoid using when...
getting data that is queried often but rarely updated.
memcache
This is a highly complex caching engine that stores the data in memory and makes sure all users read from/write to the same cache. It's a much faster way to get/set data on a key→value basis than using the datastore. Unfortunately, data can only stay in the memory for so long, and there is no guarantee that it will stay for as long as you tell it to; the data may disappear at any time if memory is needed elsewhere.
Use when...
you need to get data more often than you need to update it. Even when data needs to be updated often, it can have its uses (if a few missed updates are considered okay), by setting up a task queue to persist data from the memcache to the datastore.
Avoid using when...
data needs to be updated often and has to be up-to-date when fetched.
Global variables
This isn't an official method of persisting data, but it works. However, it's the least reliable method, and since it has no data synchronization across servers, persisted data may show up differently for different users (but from what I've found, the server rarely changes for the same user.) Theoretically, this should be the method that has the least overhead in getting/setting values, however, and could have its uses.
Use when...
hell freezes over? I don't know... I haven't enough knowledge about what goes on behind the scenes to actually rely on this method. Discuss!
Avoid using when...
you rely on the data being the same across servers.
Cookies
If the data is user-specific, it can be efficient to store it as a cookie in the user's browser. There are some pitfalls to watch out for though:
Security – the user can meddle with cookies, and malicious people could potentially do the same. To make sure that the contents are unreadable and unchangeable to all, the cookie can be encrypted using the PyCrypto library which is available on GAE.
Performance – since cookies are sent with every request (even images), it can add to the bandwidth being used, and slow down requests. One solution is to use another domain for static content, so the browser won't send the cookie for that content.
When should the different types of persistence be used? How can they be combined to reduce/even out the amount of resources being spent?

Datastore
Use the datastore to hold any long living information. The datastore should be used like you would use a normal database to hold data that will be used in your site/application.
MemCache
Use this to access data a lot quicker than trying to access the datastore. MemCache can return data really quickly and can be used for any data that needs to span multiple calls from users. It is normally data that was originally in the datastore and then moved to the memcache.
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data() #get data from the datastore
memcache.add("key", data, 60)
return data
The memcache will flush itself when the item is out of date. You set this in the last param of the add shown above.
Global Variables
I wouldn't use these at all since they can't span instances. In GAE a request creates a new instance, well in python it does. If you want to use Global variables I would store the data needed in the memcache.

Your post is a good summary of the 3 major options. You mostly have answered the question already. However, if you are currently building an app and stressing over whether or not you should memcache something, try this:
Write your app using the datastore for everything that needs to outlive more than one request.
Once your app (or some usable subset) is working, run some functional tests or simulations to see where the slow spots (or high quota usage) are.
Find the most slow or inefficient request path, and figure out how to make that faster (either by using memcache, or altering your datastructures so you can do gets instead of queries, or possibly storing something in a global instance variable*)
goto 2 until you're satisfied.
*Things that might be good for a "global" variable would be something that is relatively expensive to create/fetch, that a substantial portion of your requests will use, and that does not need to be consistent across requests/users.

I use global variable to speed up json conversion. Before I convert my data structure to json, I hash it and check if the json if already available. For my app this gives quite a speedup as the pure python implementation is quite slow.

Global variables
To complement AutomatedTester's answer, and also reply his further question about how to share information between GETs without memcache or datastore, below a quick illustration of how to use global variables:
if 'i' not in globals():
i = 0
def main():
global i
i += 1
print 'Status: 200'
print 'Content-type: text/plain\n'
print i
if __name__ == '__main__':
main()
Calling this script multiple times will give you 1, 2, 3... Of course as mentioned earlier by Blixt you should not count on this trick too much ('i' can sometimes switch back to zero) but it can be useful to store user-specific information in a dictionary, session data for instance.