What's the best way to count results in GQL? - google-app-engine

I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...

You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count

+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.

Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...

According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.

I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT

orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result

We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics

As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}

The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

Related

Count of no of record returned, without considering Limit cakePHP 3

I want the no of records available in the database for the current query but without considering the LIMIT.
$this->Orders->find('all')
->where(['order_quantity']=>5)
->LIMIT(5);
Let's consider, I have 50 no of records for this above query. So just want the no of records available for the current query. I can't use 'count()' because of the limit it will always return total no of records available is less than or equal to 5. Is there any solution in cakePHP.
This page in the CakePHP 3 book, explains EXACTLY the answer to your question including how and why it works:
Returning the Total Count of Records
Using a single query object, it is possible to obtain the total number
of rows found for a set of conditions:
$total = $articles->find()->where(['is_active' => true])->count();
The count() method will ignore the limit, offset and page clauses,
thus the following will return the same result:
$total = $articles->find()->where(['is_active' => true])->limit(10)->count();
This is useful when you need to know the total result set size in
advance, without having to construct another Query object. Likewise,
all result formatting and map-reduce routines are ignored when using
the count() method.
Notice the bit about "... will ignore the limit, offset, and page clauses"
So try something like this:
$data = $articles->find()->where(['is_active' => true])->limit(10);
$count = $data->count();
I don't think you are familiar with the use of limit in find. So, I suggest you to study the docs.
The limit in your query means that the query will only display first 5 data even if the query actually has 50 data.
So, in order to get the actual data, you just need to remove the limit and make some changes in your code as follows:
$this->Orders->find('count')->where(['order_quantity' => 5]);

Search entries in Go GAE datastore using partial string as a filter

I have a set of entries in the datastore and I would like to search/retrieve them as user types query. If I have full string it's easy:
q := datastore.NewQuery("Products").Filter("Name =", name).Limit(20)
but I have no idea how to do it with partial string, please help.
q := datastore.NewQuery("Products").Filter("Name >", name).Limit(20)
There is no like operation on app engine but instead you can use '<' and '>'
example:
'moguz' > 'moguzalp'
EDIT: GAH! I just realized that your question is Go-specific. My code below is for Python. Apologies. I'm also familiar with the Go runtime, and I can work on translating to Python to Go later on. However, if the principles described are enough to get you moving in the right direction, let me know and I wont' bother.
Such an operation is not directly supported on the AppEngine datastore, so you'll have to roll your own functionality to meet this need. Here's a quick, off-the-top-of-my-head possible solution:
class StringIndex(db.Model):
matches = db.StringListProperty()
#classmathod
def GetMatchesFor(cls, query):
found_index = cls.get_by_key_name(query[:3])
if found_index is not None:
if query in found_index.matches:
# Since we only query on the first the characters,
# we have to roll through the result set to find all
# of the strings that matach query. We keep the
# list sorted, so this is not hard.
all_matches = []
looking_at = found_index.matches.index(query)
matches_len = len(foundIndex.matches)
while start_at < matches_len and found_index.matches[looking_at].startswith(query):
all_matches.append(found_index.matches[looking_at])
looking_at += 1
return all_matches
return None
#classmethod
def AddMatch(cls, match) {
# We index off of the first 3 characters only
index_key = match[:3]
index = cls.get_or_insert(index_key, list(match))
if match not in index.matches:
# The index entity was not newly created, so
# we will have to add the match and save the entity.
index.matches.append(match).sort()
index.put()
To use this model, you would need to call the AddMatch method every time that you add an entity that would potentially be searched on. In your example, you have a Product model and users will be searching on it's Name. In your Product class, you might have a method AddNewProduct that creates a new entity and puts it into the datastore. You would add to that method StringIndex.AddMatch(new_product_name).
Then, in your request handler that gets called from your AJAXy search box, you would use StringIndex.GetMatchesFor(name) to see all of the stored products that begin with the string in name, and you return those values as JSON or whatever.
What's happening inside the code is that the first three characters of the name are used for the key_name of an entity that contains a list of strings, all of the stored names that begin with those three characters. Using three (as opposed to some other number) is absolutely arbitrary. The correct number for your system is dependent on the amount of data that you are indexing. There is a limit to the number of strings that can be stored in a StringListProperty, but you also want to balance the number of StringIndex entities that are in your datastore. A little bit of math with give you a reasonable number of characters to work with.
If the number of keywords is limited you could consider adding an indexed list property of partial search strings.
Note that you are limited to 5000 indexes per entity, and 1MB for the total entity size.
But you could also wait for Cloud SQL and Full Text Search API to be avaiable for the Go runtime.

Autoincrement ID in App Engine datastore

I'm using App Engine datastore, and would like to make sure that row IDs behave similarly to "auto-increment" fields in mySQL DB.
Tried several generation strategies, but can't seem to take control over what happens:
the IDs are not consecutive, there seem to be several "streams" growing in parallel.
the ids get "recycled" after old rows are deleted
Is such a thing at all possible ?
I really would like to refrain from keeping (indexed) timestamps for each row.
It sounds like you can't rely on IDs being sequential without a fair amount of extra work. However, there is an easy way to achieve what you are trying to do:
We'd like to delete old items (older than two month worth, for
example)
Here is a model that automatically keeps track of both its creation and its modification times. Simply using the auto_now_add and auto_now parameters makes this trivial.
from google.appengine.ext import db
class Document(db.Model):
user = db.UserProperty(required=True)
title = db.StringProperty(default="Untitled")
content = db.TextProperty(default=DEFAULT_DOC)
created = db.DateTimeProperty(auto_now_add=True)
modified = db.DateTimeProperty(auto_now=True)
Then you can use cron jobs or the task queue to schedule your maintenance task of deleting old stuff. Find the oldest stuff is as easy as sorting by created date or modified date:
db.Query(Document).order("modified")
# or
db.Query(Document).order("created")
What I know, is that auto-generated ID's as Long Integers are available in Google App Engine, but there's no guarantee that the value's are increasing and there's also no guarantee that the numbers are real one-increments.
So, if you nee timestamping and increments, add a DateTime field with milliseconds, but then you don't know that the numbers are unique.
So, the best this to do (what we are using) is: (sorry for that, but this is indeed IMHO the best option)
use a autogenerated ID as Long (we use Objectify in Java)
use a timestamp on each entity and use a index to query the entities (use a descending index) to get the top X
I think this is probably a fairly good solution, however be aware that I have not tested it in any way, shape or form. The syntax may even be incorrect!
The principle is to use memcache to generate a monotonic sequence, using the datastore to provide a fall-back if memcache fails.
class IndexEndPoint(db.Model):
index = db.IntegerProperty (indexed = False, default = 0)
def find_next_index (cls):
""" finds the next free index for an entity type """
name = 'seqindex-%s' % ( cls.kind() )
def _from_ds ():
"""A very naive way to find the next free key.
We just take the last known end point and loop untill its free.
"""
tmp_index = IndexEndPoint.get_or_insert (name).index
index = None
while index is None:
key = db.key.from_path (cls.kind(), tmp_index))
free = db.get(key) is None
if free:
index = tmp_index
tmp_index += 1
return index
index = None
while index is None:
index = memcache.incr (index_name)
if index is None: # Our index might have been evicted
index = _from_ds ()
if memcache.add (index_name, index): # if false someone beat us to it
index = None
# ToDo:
# Use a named task to update IndexEndPoint so if the memcache index gets evicted
# we don't have too many items to cycle over to find the end point again.
return index
def make_new (cls):
""" Makes a new entity with an incrementing ID """
result = None
while result is None:
index = find_next_index (cls)
def txn ():
"""Makes a new entity if index is free.
This should only fail if we had a memcache miss
(does not have to be on this instance).
"""
key = db.key.from_path (cls.kind(), index)
if db.get (key) is not None:
return
result = cls (key)
result.put()
return result
result = db.run_in_transaction (txn)

GAE Query counter (+1000)

How to implement this query counter into existing class? The main purpose is i have datastore model of members with more than 3000 records. I just want to count its total record, and found this on app engine cookbook:
def query_counter (q, cursor=None, limit=500):
if cursor:
q.with_cursor (cursor)
count = q.count (limit=limit)
if count == limit:
return count + query_counter (q, q.cursor (), limit=limit)
return count
My existing model is:
class Members(search.SearchableModel):
group = db.ListProperty(db.Key,default=[])
email = db.EmailProperty()
name = db.TextProperty()
gender = db.StringProperty()
Further i want to count members that join certain group with list reference. It might content more than 1000 records also.
Anyone have experience with query.cursor for this purpose?
To find all members you would use it like this:
num_members = query_counter(Members.all())
However, you may find that this runs slowly, because it's making a lot of datastore calls.
A faster way would be to have a separate model class (eg MembersCount), and maintain the count in there (ie add 1 when you create a member, subtract 1 when you delete a member).
If you are creating and deleting members frequently you may need to create a sharded counter in order to get good performance - see here for details:
http://code.google.com/appengine/articles/sharding_counters.html
To count members in a certain group, you could do something like this:
group = ...
num_members = query_counter(Members.all().filter('group =', group.key()))
If you expect to have large numbers of members in a group, you could also do that more efficiently by using a counter model which is sharded by the group.

How does one get a count of rows in a Datastore model in Google App Engine?

I need to get a count of records for a particular model on App Engine. How does one do it?
I bulk uploaded more than 4000 records but modelname.count() only shows me 1000.
You should use Datastore Statistics:
Query query = new Query("__Stat_Kind__");
query.addFilter("kind_name", FilterOperator.EQUAL, kind);
Entity entityStat = datastore.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Please note that the above does not work on the development Datastore but it works in production (when published).
I see that this is an old post, but I'm adding an answer in benefit of others searching for the same thing.
As of release 1.3.6, there is no longer a cap of 1,000 on count queries. Thus you can do the following to get a count beyond 1,000:
count = modelname.all(keys_only=True).count()
This will count all of your entities, which could be rather slow if you have a large number of entities. As a result, you should consider calling count() with some limit specified:
count = modelname.all(keys_only=True).count(some_upper_bound_suitable_for_you)
This is a very old thread, but just in case it helps other people looking at it, there are 3 ways to accomplish this:
Accessing the Datastore statistics
Keeping a counter in the datastore
Sharding counters
Each one of these methods is explained in this link.
count = modelname.all(keys_only=True).count(some_upper_limit)
Just to add on to the earlier post by dar, this 'some_upper_limit' has to be specified. If not, the default count will still be a maximum of 1000.
In GAE a count will always make you page through the results when you have more than 1000 objects. The easiest way to deal with this problem is to add a counter property to your model or to a different counters table and update it every time you create a new object.
I still hit the 1000 limit with count so adapted dar's code (mine's a bit quick and dirty):
class GetCount(webapp.RequestHandler):
def get(self):
query = modelname.all(keys_only=True)
i = 0
while True:
result = query.fetch(1000)
i = i + len(result)
if len(result) < 1000:
break
cursor = query.cursor()
query.with_cursor(cursor)
self.response.out.write('<p>Count: '+str(i)+'</p>')
DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
Query query = new Query("__Stat_Kind__");
Query.Filter eqf = new Query.FilterPredicate("kind_name",
Query.FilterOperator.EQUAL,
"SomeEntity");
query.setFilter(eqf);
Entity entityStat = ds.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Another solution is using a key only query and get the size of the iterator. The computing time with this solution will rise linearly with the amount of entrys:
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactorykeyFactory = datastore.newKeyFactory().setKind("MyKind");
Query query = Query.newKeyQueryBuilder().setKind("MyKind").build();
int count = Iterators.size(datastore.run(query));

Resources