Under Compute > Memcache we have some statistics:
HitRatio, Items In Cache, Oldest Item Age, Total Cache Size etc.
Then we can also see stats for 20 commonly used keys ordered by either Operation Count or Memcache compute units
my question is, is it possible to figure out how many times per second a key has been read (or read + written) from memcache using just memcache stats?
For example, if i have 1 million hits and the oldest item is 1 day old, and my memecache key uses 5% of the traffic,
Could I go (1 million hits * 5% = 50,000 hits) / 24 hours = 0.57 hits per second.
Really I have no idea what the statistics on the memcache viewer actually mean - For example the statistics don't even reset if memcache is flushed.
Cheers.
I am pretty sure counting this way won't return what you want. As explained in the python memcache statistic paragraph, the age of your items reset when they are read. So the oldest item being 1 day old just means it has been in memcache for a day since it was read.
To figure out how many times a second a key has been read, you might want to use a sharding counter, or some other form of logging, then retrieving the logged data with the Logs API to interpret them. It can't really be done directly from the memcache statistics (might be an interesting feature to request on Google's Public Issue tracker though)
Related
What is the most efficient way, in terms of cost and scalability, to pull stats on large volumes of data?
Let's take a concrete example, there are 1000 companies, each with 10000+ customers.
These companies are all in retail, or let's make it more generic, they are any company in any industry and they want to know certain things from their customers.
10 of these companies wants to know how well their call centres are doing and sends out an email asking customers to rate them 1 - 5, customers click on a link and rate them 1 - 5.
20 of these companies, which could include some of the previous 10, wants to know something else and asks for a rating 1 - 5.
Now if I want to give each of these companies feedback on their average rating or where they stack up against the other companies who sent the same questionnaire or had overlapping questions, what would be the best strategy to calculate these stats?
Option1: Have a special entity just for stats, every time a customer rates the company for something, increment the stats counters (eg increment stats counter for number of votes, vote total, increment male / female counters if you're tracking votes based on gender, etc)
The problem with this approach is that you'll be doing y number of extra writes (where n is the number of stats reports you want to track for) for every data entry and then you're also limited to those stats that you wanted to track. Also, you'll be limited to 1 write/s as Peter mentioned in his response here: Using Objectify to concurrently write data on GAE
If x is the number of entries and y the number of stats reports you want to pull, you'll be doing x * y writes and y reads to report on stats.
Option2: Do something like: ofy.query(MyEntity.class).filter("field", v).count();
Pitfalls being that you're looking up all those entities, does GAE charge for read x operations if you're doing a count that results in x number of entities?
Also, if you're potentially running through 20000 entries, won't you hit some sort of limit in terms of time-outs, max reads per query, etc?
Depending on how often I pull stats, this will mean x number of reads every time I pull stats assuming I won't hit some limits.
Option3: Put an extra property in each feedback entry for every piece of stats you're trying to build. Then have a scheduler run every hour / day / week / ..., use cursors to run through each of the entries, mark the stats column as counted and add that value to a stats entity. If the number of feedback entries are x and you want to pull y number of reports on this data, that means (assuming you do the calculations in memory and not immediately in a stats entity) x number of writes to mark the x number of feedback reports as counted and y number writes every hour / day / week to store the updated stats values.
This means that for x number of feedback reports, I'll be doing at least 2 * x writes and only y number of reads to read the stats.
All of the above seems yucky, is there a better way to do it?
If not, which of the above is the better way to do it that won't break when the volumes are massive and that won't dramatically increase costs over what the costs already are in terms of reads / writes / storage.
GAE is not a good option to do analytics, because of concurrent write limitations and lack of good query language.
If you are serious about analytics you should export data from GAE to BigQuery and do analytics there. Check out Mache and this blogpost.
Assume an app that collects real-time temperature data for various cities around the world every 10 minutes.
Using the following GAE datastore model,
class City(db.Model):
name = db.StringProperty()
class DailyTempData(db.Model):
date = db.DateProperty()
temp_readings = db.ListProperty(float, indexed=False) # appended every 10 minutes
and a cron.yaml as so,
cron:
- description: read temperature
url: /cron/read_temps
schedule: every 10 minutes
I am already hitting GAE's daily free quota for datastore writes, and I'm looking for ways to get around this problem.
I'm thinking of reducing my datastore writes by persisting the temperature data only at the end of each day, which will effectively reduce the daily write volume (for each city) from 144 times to 1.
One way to do this is to use memcache as a temporary scratchpad, but due to the possibility of random data evictions, I could well lose all my data for the day. (Aside question: from experience, how often does unplanned eviction really happen?)
Questions are as follows:
Is there such a memory/storage facility (persistent and guaranteed across cron jobs) that would allow me to reduce datastore writes as described?
If not, what could be some alternative solutions?
The only other requirement would be that the temperature readings must be accessible (for serving to client-side) any given time of day.
The only guaranteed storage in the datastore.
As to memcache evictions - it's depends on what is going on, in your app and in google appengine land, evictions could be within a minute or two or after hours. In my appengine instances I usually have oldest items sitting at about 2 hours old. But it all depends and you just can't rely on it.
tasks queues payload is about 10K.
You could just write a blob (containing all cities measured in the 10min interval) and then reprocess it and unpick it and write out the city details at the end of the day.
When you say clients must be able to access temperature readings, do you mean just the current or all the readings for the day.
You could also change your model, so that a huge object is stored for each execution or the cron. Not just for each city, I mean.
For example, say the object is called Measures... A Measures item will contain a List of all your measures for the corresponding time. Store them as non-indexed properties and you should have no problems... And also just 144 writes a day.
For the reading part... Use memcache to store the Measures items, as a good usage pattern.
With the appengine pricing changes, we've been paying attention to our datastore puts. According to the pricing comparison chart we're making 2.18 million puts a day. This seems a lot higher than expected. We receive about 0.6 queries per second which means that each request is making about 60 puts!!
Using the sample code for db profiling http://code.google.com/appengine/articles/hooks.html
we measured this for a day and the most we counted was ~14,000 which seems more reasonable. Does anyone have experience with something similar on their site?
The discrepancy you're seeing is because every index write is counted separately. When you do a datastore put, you're charged for the number of rows that have to be modified, so if you modified a single indexed field, you'd expect to be charged for:
One write for the entity itself
Two writes for the ascending index for the modified property
Two writes for the descending index for the modified property
For a total of 5 writes. As you can see, setting properties to indexed=False can have a big impact on your quota usage here.
I read today about sharded counters in Google App Engine. The article says that you should expect to max out at about 5/updates per second per entity in the data store. But it seems to me that this solution doesn't 'scale' unless you have some way of knowing how many updates you are doing per second. For example, you can allocate 10 shards, but will then start choking at 50 updates per second.
So how do you know how fast the updates are coming, and how do you feed that number back into the number of shards?
My guess is that along with the counter you could keep some record of recent activity, and if you detect a spike you can increase the number of shards. Is that generally how it's done? And if so, why isn't it done in the sample code? (That last question may be unanswerable.) Is it more common practice to monitor website activity and update shard counts as traffic rises, as opposed to doing it automatically in the code?
Update: What are the practical consequences effects of having too few shards and choking? Does it simply mean that the website becomes unresponsive, or is it possible to lose counter updates because of timeouts?
As an aside, this question talks about implementing counters without sharding, but one of the answers impies that even memcache needs to be sharded if traffic is high. So this issue of shard allocation and tuning seems to be important.
It is clearly simpler to manually monitor your website's popularity and increase the number of shards as needed. I would guess that most sites take this approach. Doing it programatically would not only be difficult, but it sounds like it would add an unacceptable amount of overhead to keep a record of all recent activity and try to analyze it to dynamically adjust the number of shards you're using.
I would prefer the simpler approach of just erring a little on the high side with the number of shards you choose.
You are correct about the practical consequences of having too few shards. Updating a datastore entity more frequently than possible which will initially cause some requests to take a long time (while the writes retry). If you have enough of them pile up, then they will start to fail as requests time out. This will certainly lead to missed counters. On the upside, your page will be so slow that users should start leaving which should relieve the pressure on the datastore :).
To address the last part of your question: Your memcache values will not require sharding. A single memcache server can handle tens of thousands of QPS of fetches and updates, so no plausibly large app is going to need to shard its memcache keys.
Why not add to the number of shards when Exceptions begin to occur?
Based on this GAE Example:
try{
Transaction tx = ds.beginTransaction();
// increment shard
tx.commit();
} catch(DatastoreFailureException e){
// Datastore is struggling to handle the current load, increase it / double it
addShards( getShardCount() );
} catch(DatastoreTimeoutException to){
// Datastore is struggling to handle the current load, increase it / double it
addShards( getShardCount() );
} catch (ConcurrentModificationException cm){
// Datastore is struggling to handle the current load, increase it / double it
addShards( getShardCount() );
}
In many cases, it could be useful to know the number of rows in a table (a kind) in a datastore using Google Application Engine. There is not clear and fast solution . At least I have not found one.. Have you?
You can efficiently get a count of all entities of a particular kind (i.e., number of rows in a table) using the Datastore Statistics. Simple example:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "NameOfYourModel").get()
count = kind_stats.count
You can find a more detailed example of how to get the latest stats here (GAE may keep multiple copies of the stats - one for 5min ago, one for 30min ago, etc.).
Note that these statistics aren't constantly updated so they lag a little behind the actual counts. If you really need the actual count, then you could track counts in your own custom stats table and update it every time you create/delete an entity (though this will be quite a bit more expensive to do).
Update 03-08-2015: Using the Datastore Statistics can lead to stale results. If that's not an option, another two methods are keeping a counter or sharding counters. (You can read more about those here). Only look at these 2 if you need real-time results.
There's no concept of "Select count(*)" in App Engine. You'll need to do one of the following:
Do a "keys-only" (index traversal) of the Entities you want at query time and count them one by one. This has the cost of slow reads.
Update counts at write time - this has the benefit of extremely fast reads at a greater cost per write/update. Cost: you have to know what you want to count ahead of time. You'll pay a higher cost at write time.
Update all counts asynchronously using Task Queues, cron jobs or the new Mapper API. This has the tradeoff of being semi-fresh.
You can count no. of rows in Google App Engine using com.google.appengine.api.datastore.Query as follow:
int count;
Query qry=new Query("EmpEntity");
DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
count=datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());