using memcached for very small data, good idea? - google-app-engine

I have a very small amount of data (~200 bytes) that I retrieve from the database very often. The write rate is insignificant.
I would like to get away from all the unnecessary database calls to fetch this almost static data. Is memcached a good use for this? Something else?
If it's of any relevance I'm running this on GAE using python. The data in question could be easily (de)serialized as json.

Memcache is well-suited for this - reading from the datastore is much more expensive than reading from memcache. This is especially true for small amounts of data for which the cost to retrieve is dominated by latency to the datastore.
If your app receives enough requests that instances typically stay alive for a little while, then you could go one step further and use App Caching to largely avoid memcache too. (Basically, cache the value in a global variable, and also app-cache the time the value was last updated. Provide an accessor for the value which retrieves the latest from memcache/db if it hasn't been updated in X minutes). Memcache is pretty cheap though, so this extra work might only make sense if you access this variable rather frequently.

If it changes less often than once per day, you could just hardcode it in webapp code, and reupload the file each time it changes.

Related

What are the low hanging fruit for optimizing google app engine with respect to quota usage?

Everyone learns to use Memcache pretty quick. Another one I've learned recently is setting indexed=False for Model properties that I am not going to query against. What are some others? What are the big ones?
Don't use offset in queries. Use cursors instead.
Explanations: offset loads all data up to offset+limit and charges you for it, but only returns limit entities.
Minimize instance use, by tweaking idle instances and pending latency appropriately for your app.
A couple helped us (not all may be low-hanging at first). First, we denormalized our datastore to reduce joins. I'm using SQL terms because I came from a SQL background. By spreading commonly queried elements around, we reduced the number of reads we had to make considerably, even after factoring in Memcache. Potentially increases writes but for most apps, the number of reads far outweighs the number of writes.
Next, we started using task queues, backends, and the channel API more often. I don't remember specific examples but I do remember we were able to reduce our front-end usage down below the free quota mark by moving some processing around to queues and backends and by sending data down via channel rather than having the client poll.
Also, we use objectify for our data access which we configure to automatically use memcache wherever appropriate.

How to handle caching and data-store sync in GAE

I am writing a small appengine application and I want to start using Datastore.
My app has some users and each user is a complicated JAVA class.
Users might swap some "point" objects between them, so I need the data to be available and fast.
My question is quite generic:
How should I handle the data caching?
Storing the data entirely in the Datastore and fetching it in every call sounds slow in runtime.
On the other hand, holding the data in a static JAVA class sounds tricky, because every now and then the server resets and data is erased.
If I had a main loop, like in a regular console application I would've probably saved the data twice-three times a day on predefined hours of the day.
How should I manage my code in such a way that it would save the status in the datastore every now and then and this way will not loose any data.
if the number of write operations are less than read operations,
it is good to try "Write-through caching" with memcache module.
https://developers.google.com/appengine/docs/java/memcache/
basically the write-through caching works like this:
for every write operations, program update the datastore, then backup the data to memcache
for every read operations, program will try to read from memcache. if it can found required data, then it returns, otherwise it will fetch the data from datastore and backup the results to memcache.
As #lucemia noted, you should use memcache.
If you use objectify, you can set-up caching declaratively, without writing and cache handling code.

How often does memcache on Google AppEngine lose data?

Memcache in general and on AppEngine in specific is unreliable in the sense that my data may be deleted from the cache for whatever reason at any point in time. However, in some cases there might be cases where a small risk may be worth the added performance using memcache could give, such as updating some data in memcache that gets saved periodically to some other, more reliable storage. Are there any numbers from Google that could give me an indication of the actual probability that a memcache entry would be lost from the cache before its expiration time, given that I keep within my quotas?
Are there any reasons other than hardware failure and administrative operations such as machines at the data centers being upgraded/moved/replaced that would cause entries to be removed from memcache prematurely?
Memcache, like any cache, should be used as... a cache. If you can't find something in the cache, there must be a strategy to find it in permanent storage.
In addition to the reasons you mention, Memcache and other caching approaches have limits to the amount of items they will hold (discarding usually the least recently used ones when the cache is full), and often also set other cache invalidation policies (e.g. flush everything unused for one hour).
If you don't configure and operate the cache yourself, you have NO guarantee of when and how items might be removed from the cache intentionally / by design.
Any concrete answer you get to this question is 100% subject to change.
That said, I've used memcache under light loads to accumulate data for 15 minutes or so before writing it all to the Datastore. This was for totally non-critical analytic data though. Do not depend on it.
It's not that data can be lost, but that if it is lost, it can be easily regained.
For example, using it to store data from the datastore is ideal, in that if a piece of data is not in the cache, it can be easily fetched.
If you're storing data such as a hit counter in the cache, it can't be regained if the cache is cleared, so you'll lose data.
If you're concerned about load for a common job, how about setting a job to update the counter later, using the task queue?
I have implemented a shared-memcache based stats counter that collects hourly to DB and can identify cache loss (log it). So far I see constantly <10% cache-losses total each day after at most 1h (average 30 mins) cache time with about 60 active counters. Counter losses appear to be random single counters. I suspect, that counters that are incremented only once (occurs quite often in my case) could have higher probability of being dropped.
My App uses <1MB total memcache in the shared memcache system. Unfortunately using dedicated memcache with 1GB minimum and substantial costs per year is out of the question. Stats counter used.
I have created a stackdriver counter that records memcache losses for a counter that is saved every full hour. The graph shows successful saves in red and memcache fails in blue. The counter saves every full hour and has a few counts in the hour.

When to use a certain type of persistence in Google App Engine?

First of all I'll explain the question. By persistence, I mean storing data beyond the execution of a single request. It might not be the best question title, so feel free to edit it.
The way I see it, there are three types of persistence in GAE, each one "closer" to the request itself:
The datastore
This is where all data is most likely to be based. It may go into the higher layers of persistence temporarily, but in the end, this is where the data really is. Unfortunately, querying the datastore repeatedly is slow and uses a lot of resources.
Use when...
storing data that should be stored for an indefinite amount of time.
Avoid using when...
getting data that is queried often but rarely updated.
memcache
This is a highly complex caching engine that stores the data in memory and makes sure all users read from/write to the same cache. It's a much faster way to get/set data on a key→value basis than using the datastore. Unfortunately, data can only stay in the memory for so long, and there is no guarantee that it will stay for as long as you tell it to; the data may disappear at any time if memory is needed elsewhere.
Use when...
you need to get data more often than you need to update it. Even when data needs to be updated often, it can have its uses (if a few missed updates are considered okay), by setting up a task queue to persist data from the memcache to the datastore.
Avoid using when...
data needs to be updated often and has to be up-to-date when fetched.
Global variables
This isn't an official method of persisting data, but it works. However, it's the least reliable method, and since it has no data synchronization across servers, persisted data may show up differently for different users (but from what I've found, the server rarely changes for the same user.) Theoretically, this should be the method that has the least overhead in getting/setting values, however, and could have its uses.
Use when...
hell freezes over? I don't know... I haven't enough knowledge about what goes on behind the scenes to actually rely on this method. Discuss!
Avoid using when...
you rely on the data being the same across servers.
Cookies
If the data is user-specific, it can be efficient to store it as a cookie in the user's browser. There are some pitfalls to watch out for though:
Security – the user can meddle with cookies, and malicious people could potentially do the same. To make sure that the contents are unreadable and unchangeable to all, the cookie can be encrypted using the PyCrypto library which is available on GAE.
Performance – since cookies are sent with every request (even images), it can add to the bandwidth being used, and slow down requests. One solution is to use another domain for static content, so the browser won't send the cookie for that content.
When should the different types of persistence be used? How can they be combined to reduce/even out the amount of resources being spent?
Datastore
Use the datastore to hold any long living information. The datastore should be used like you would use a normal database to hold data that will be used in your site/application.
MemCache
Use this to access data a lot quicker than trying to access the datastore. MemCache can return data really quickly and can be used for any data that needs to span multiple calls from users. It is normally data that was originally in the datastore and then moved to the memcache.
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data() #get data from the datastore
memcache.add("key", data, 60)
return data
The memcache will flush itself when the item is out of date. You set this in the last param of the add shown above.
Global Variables
I wouldn't use these at all since they can't span instances. In GAE a request creates a new instance, well in python it does. If you want to use Global variables I would store the data needed in the memcache.
Your post is a good summary of the 3 major options. You mostly have answered the question already. However, if you are currently building an app and stressing over whether or not you should memcache something, try this:
Write your app using the datastore for everything that needs to outlive more than one request.
Once your app (or some usable subset) is working, run some functional tests or simulations to see where the slow spots (or high quota usage) are.
Find the most slow or inefficient request path, and figure out how to make that faster (either by using memcache, or altering your datastructures so you can do gets instead of queries, or possibly storing something in a global instance variable*)
goto 2 until you're satisfied.
*Things that might be good for a "global" variable would be something that is relatively expensive to create/fetch, that a substantial portion of your requests will use, and that does not need to be consistent across requests/users.
I use global variable to speed up json conversion. Before I convert my data structure to json, I hash it and check if the json if already available. For my app this gives quite a speedup as the pure python implementation is quite slow.
Global variables
To complement AutomatedTester's answer, and also reply his further question about how to share information between GETs without memcache or datastore, below a quick illustration of how to use global variables:
if 'i' not in globals():
i = 0
def main():
global i
i += 1
print 'Status: 200'
print 'Content-type: text/plain\n'
print i
if __name__ == '__main__':
main()
Calling this script multiple times will give you 1, 2, 3... Of course as mentioned earlier by Blixt you should not count on this trick too much ('i' can sometimes switch back to zero) but it can be useful to store user-specific information in a dictionary, session data for instance.

Database/data storage for high volume of simple transactions

I have written a PHP application which requires storage of millions of integers between 0 and 10,000,000 inclusive. Each number is incremented by one very frequently (on average 100 values are updated every second) and read very frequently (20,000 reads per second). The numbers are reset to 0 either nightly, weekly, monthly or annually.
I've got a fairly good handle on MySQL but it feels like overkill, and not very efficient in the process.
Has anyone had to deal with this before, and/or could shed some light on a suitable data storage system?
Depending on how much memory you have, and how long you have to keep them around - you could just use the APC cache, or memcache. memcache has a nice increment operation, so it's crazy efficient doing it.
BTW, you said you need 20K reads. And you are doing it in PHP? on one server or on a cluster? Sounds like you need some architectural guidance..... :)
If it's a web-app, you are going to be using a cluster. And if it's not a web app, I'd suggest you think about re-doing it in something which isn't PHP. Java, C++ and c# come to mind.

Resources