Datastore in Firestore mode - a distributed counter than can scale it's shards up based on traffic - google-app-engine

In Datastore in Firestore mode the recommended way to deal with storing a high write counter (such as profile views on a website) is to use sharded/distributed counters.
The problem I have is that with distributed counters you need to pick how many shards you want to have. This is addressed here as well. For example some profiles may get a lot more views per second than others (one profile may be a famous person while another is a regular person), and therefore need more shards.
Is there a way to write a distributed counter that can scale it's shards up if the page is getting a lot of views per second?
I was thinking of detecting a datastore contention error and then adding more shards if that happens.
I noticed there is a new extension for Cloud Firestore that seems to do what I am asking for. However, I am not using Cloud Firestore, I am using Datastore in Firestore mode - similar under the hood but still different.

The original Datastore distributed counters example:
NUM_SHARDS = 20
class SimpleCounterShard(ndb.Model):
"""Shards for the counter"""
count = ndb.IntegerProperty(default=0)
def get_count():
"""Retrieve the value for a given sharded counter.
Returns:
Integer; the cumulative count of all sharded counters.
"""
total = 0
for counter in SimpleCounterShard.query():
total += counter.count
return total
#ndb.transactional
def increment():
"""Increment the value for a given sharded counter."""
shard_string_index = str(random.randint(0, NUM_SHARDS - 1))
counter = SimpleCounterShard.get_by_id(shard_string_index)
if counter is None:
counter = SimpleCounterShard(id=shard_string_index)
counter.count += 1
counter.put()
Used a fixed number of shards, but the Firestore example uses a separate entity for keeping track of the number of shards. So, you can update the code above with something like:
class RootCounter(ndb.Model):
count = ndb.IntegerProperty(default=0)
num_shards = ndb.IntegerProperty(default=0)
def get_count(self):
if self.num_shards > 0:
return sum([e.count for e in SimpleCounterShard.query(parent=self.key)])
return count
def increment(self):
try:
self._increment()
except:
self.num_shards += 1
self.increment()
self.put()
#ndb.transactional(retries=1):
def _increment(self):
if self.num_shards > 0:
SimpleCounterShard.increment(parent=self.key, self.num_shards)
else:
self.count += 1
self.put()
The important difference since Firestore in Datastore mode has been released is that Firestore in Datastore mode is strongly consistent and that you are likely not using entity groups. Thus a query will give an exact answer, and the sharded counters can nicely fit in the hierarchy with the root counter.

Related

ndb Models are not saved in memcache when using MapReduce

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)
MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.
As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

How many Datastore reads consume each Fetch, Count and Query operations?

I'm reading on Google App Engine groups many users (Fig1, Fig2, Fig3) that can't figure out where the high number of Datastore reads in their billing reports come from.
As you might know, Datastore reads are capped to 50K operations/day, above this budget you have to pay.
50K operations sounds like a lot of resources, but unluckily, it seems that each operation (Query, Entity fetch, Count..), hides several Datastore reads.
Is it possible to know via API or some other approach, how many Datastore reads are hidden behind the common RPC.get , RPC.runquery calls?
Appstats seems useless in this case because it gives just the RPC details and not the hidden reads cost.
Having a simple Model like this:
class Example(db.Model):
foo = db.StringProperty()
bars= db.ListProperty(str)
and 1000 entities in the datastore, I'm interested in the cost of these kind of operations:
items_count = Example.all(keys_only = True).filter('bars=','spam').count()
items_count = Example.all().count(10000)
items = Example.all().fetch(10000)
items = Example.all().filter('bars=','spam').filter('bars=','fu').fetch(10000)
items = Example.all().fetch(10000, offset=500)
items = Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd')
See http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost .
A query costs you 1 read plus 1 read for each entity returned. "Returned" includes entities skipped by offset or count.
So that is 1001 reads for each of these:
Example.all(keys_only = True).filter('bars=','spam').count()
Example.all().count(1000)
Example.all().fetch(1000)
Example.all().fetch(1000, offset=500)
For these, the number of reads charged is 1 plus the number of entities that match the filters:
Example.all().filter('bars=','spam').filter('bars=','fu').fetch()
Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd').fetch()
Instead of using count you should consider storing the count in the datastore, sharded if you need to update the count more than once a second. http://code.google.com/appengine/articles/sharding_counters.html
Whenever possible you should use cursors instead of an offset.
Just to make sure:
I'm almost sure:
Example.all().count(10000)
This one uses small datastore operations (no need to fetch the entities, only keys), so this would count as 1 read + 10,000 (max) small operations.

Google App Engine GQL query question

in theory, I want to have 100,000 entities in my User model:
I'd like to implement an ID property which increments with each new entity being put.
The goal is so that I can do queries like "get user with IDs from 200 to 300". Kind of like seperating the 10000 entities into 1000 readable pages of 100 entities each.
I heard that the ID property from App Engine does not guarantee that it goes up incrementally.
So how do I implement my own incrementing ID?
One of my ideas is to use memcache. Add a counter in memcache that increases each time a new entity is inserted.
class User(db.Model):
nickname = db.StringProperty()
my_id = db.IntegerProperty()
# Steps to add a new user entity:
# Step 1: check memcache
counter = memcache.get("global_counter")
# Step 2: increment counter
counter = counter + 1
# Step 3: add user entity
User(nickname="tommy",my_id=counter).put()
# Step 4: replace incremented counter
memcache.replace(key="global_counter",value=counter)
# todo: create a cron job which periodically takes the memcached global_counter and
# store it into the datastore as a backup ( in case memcache gets flushed )
what do you guys think?
additional question: if 10 users register at the same time, will it mess up the memcache counter?
You don't need to implement your own auto-incrementing counter to achieve the pagination you're looking for - look at "Paging without a property"
In short, the key is guaranteed to be returned in a deterministic order, and you can use inequality operators (>=, <=) to return results starting from a particular key.
To get your first hundred users:
users = User.all().order("__key__").fetch(101)
The first 100 results are what you iterate over; the 101st result, if it is returned, is used as a bookmark in the next query - just add a .filter('__key__ >=', bookmark) to the above query, and you'll get results 200-301
The memcache counter can go away.
Counters are difficult to do in a distributed system. It is often easiest to re-think the application so that it can do without.
How about using a timestamp instead? You can sort by (and page) on that as well, it will generate the same ordering as a counter.

How does one get a count of rows in a Datastore model in Google App Engine?

I need to get a count of records for a particular model on App Engine. How does one do it?
I bulk uploaded more than 4000 records but modelname.count() only shows me 1000.
You should use Datastore Statistics:
Query query = new Query("__Stat_Kind__");
query.addFilter("kind_name", FilterOperator.EQUAL, kind);
Entity entityStat = datastore.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Please note that the above does not work on the development Datastore but it works in production (when published).
I see that this is an old post, but I'm adding an answer in benefit of others searching for the same thing.
As of release 1.3.6, there is no longer a cap of 1,000 on count queries. Thus you can do the following to get a count beyond 1,000:
count = modelname.all(keys_only=True).count()
This will count all of your entities, which could be rather slow if you have a large number of entities. As a result, you should consider calling count() with some limit specified:
count = modelname.all(keys_only=True).count(some_upper_bound_suitable_for_you)
This is a very old thread, but just in case it helps other people looking at it, there are 3 ways to accomplish this:
Accessing the Datastore statistics
Keeping a counter in the datastore
Sharding counters
Each one of these methods is explained in this link.
count = modelname.all(keys_only=True).count(some_upper_limit)
Just to add on to the earlier post by dar, this 'some_upper_limit' has to be specified. If not, the default count will still be a maximum of 1000.
In GAE a count will always make you page through the results when you have more than 1000 objects. The easiest way to deal with this problem is to add a counter property to your model or to a different counters table and update it every time you create a new object.
I still hit the 1000 limit with count so adapted dar's code (mine's a bit quick and dirty):
class GetCount(webapp.RequestHandler):
def get(self):
query = modelname.all(keys_only=True)
i = 0
while True:
result = query.fetch(1000)
i = i + len(result)
if len(result) < 1000:
break
cursor = query.cursor()
query.with_cursor(cursor)
self.response.out.write('<p>Count: '+str(i)+'</p>')
DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
Query query = new Query("__Stat_Kind__");
Query.Filter eqf = new Query.FilterPredicate("kind_name",
Query.FilterOperator.EQUAL,
"SomeEntity");
query.setFilter(eqf);
Entity entityStat = ds.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Another solution is using a key only query and get the size of the iterator. The computing time with this solution will rise linearly with the amount of entrys:
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactorykeyFactory = datastore.newKeyFactory().setKind("MyKind");
Query query = Query.newKeyQueryBuilder().setKind("MyKind").build();
int count = Iterators.size(datastore.run(query));

What's the best way to count results in GQL?

I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

Resources