How to implement this query counter into existing class? The main purpose is i have datastore model of members with more than 3000 records. I just want to count its total record, and found this on app engine cookbook:
def query_counter (q, cursor=None, limit=500):
if cursor:
q.with_cursor (cursor)
count = q.count (limit=limit)
if count == limit:
return count + query_counter (q, q.cursor (), limit=limit)
return count
My existing model is:
class Members(search.SearchableModel):
group = db.ListProperty(db.Key,default=[])
email = db.EmailProperty()
name = db.TextProperty()
gender = db.StringProperty()
Further i want to count members that join certain group with list reference. It might content more than 1000 records also.
Anyone have experience with query.cursor for this purpose?
To find all members you would use it like this:
num_members = query_counter(Members.all())
However, you may find that this runs slowly, because it's making a lot of datastore calls.
A faster way would be to have a separate model class (eg MembersCount), and maintain the count in there (ie add 1 when you create a member, subtract 1 when you delete a member).
If you are creating and deleting members frequently you may need to create a sharded counter in order to get good performance - see here for details:
http://code.google.com/appengine/articles/sharding_counters.html
To count members in a certain group, you could do something like this:
group = ...
num_members = query_counter(Members.all().filter('group =', group.key()))
If you expect to have large numbers of members in a group, you could also do that more efficiently by using a counter model which is sharded by the group.
Related
I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
I'm using App Engine datastore, and would like to make sure that row IDs behave similarly to "auto-increment" fields in mySQL DB.
Tried several generation strategies, but can't seem to take control over what happens:
the IDs are not consecutive, there seem to be several "streams" growing in parallel.
the ids get "recycled" after old rows are deleted
Is such a thing at all possible ?
I really would like to refrain from keeping (indexed) timestamps for each row.
It sounds like you can't rely on IDs being sequential without a fair amount of extra work. However, there is an easy way to achieve what you are trying to do:
We'd like to delete old items (older than two month worth, for
example)
Here is a model that automatically keeps track of both its creation and its modification times. Simply using the auto_now_add and auto_now parameters makes this trivial.
from google.appengine.ext import db
class Document(db.Model):
user = db.UserProperty(required=True)
title = db.StringProperty(default="Untitled")
content = db.TextProperty(default=DEFAULT_DOC)
created = db.DateTimeProperty(auto_now_add=True)
modified = db.DateTimeProperty(auto_now=True)
Then you can use cron jobs or the task queue to schedule your maintenance task of deleting old stuff. Find the oldest stuff is as easy as sorting by created date or modified date:
db.Query(Document).order("modified")
# or
db.Query(Document).order("created")
What I know, is that auto-generated ID's as Long Integers are available in Google App Engine, but there's no guarantee that the value's are increasing and there's also no guarantee that the numbers are real one-increments.
So, if you nee timestamping and increments, add a DateTime field with milliseconds, but then you don't know that the numbers are unique.
So, the best this to do (what we are using) is: (sorry for that, but this is indeed IMHO the best option)
use a autogenerated ID as Long (we use Objectify in Java)
use a timestamp on each entity and use a index to query the entities (use a descending index) to get the top X
I think this is probably a fairly good solution, however be aware that I have not tested it in any way, shape or form. The syntax may even be incorrect!
The principle is to use memcache to generate a monotonic sequence, using the datastore to provide a fall-back if memcache fails.
class IndexEndPoint(db.Model):
index = db.IntegerProperty (indexed = False, default = 0)
def find_next_index (cls):
""" finds the next free index for an entity type """
name = 'seqindex-%s' % ( cls.kind() )
def _from_ds ():
"""A very naive way to find the next free key.
We just take the last known end point and loop untill its free.
"""
tmp_index = IndexEndPoint.get_or_insert (name).index
index = None
while index is None:
key = db.key.from_path (cls.kind(), tmp_index))
free = db.get(key) is None
if free:
index = tmp_index
tmp_index += 1
return index
index = None
while index is None:
index = memcache.incr (index_name)
if index is None: # Our index might have been evicted
index = _from_ds ()
if memcache.add (index_name, index): # if false someone beat us to it
index = None
# ToDo:
# Use a named task to update IndexEndPoint so if the memcache index gets evicted
# we don't have too many items to cycle over to find the end point again.
return index
def make_new (cls):
""" Makes a new entity with an incrementing ID """
result = None
while result is None:
index = find_next_index (cls)
def txn ():
"""Makes a new entity if index is free.
This should only fail if we had a memcache miss
(does not have to be on this instance).
"""
key = db.key.from_path (cls.kind(), index)
if db.get (key) is not None:
return
result = cls (key)
result.put()
return result
result = db.run_in_transaction (txn)
Here is simplified version of my datastore structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Now I need to display all news sorted by their total rating (sum of different users ratings). How can I do that in the following code:
news = News.all()
# filter by additional parms
# news.filter("city =", "1")
news.order("-added") # ?
for one_news in news:
self.response.out.write(one_news.title()+'<br>')
Queries only have access to the entity you're querying against, if you have a property from another entity (or some aggregate calculation based on fields from other entities) that you want to use to order results, you're going to need to store it in the entity you're querying against.
In the case of ratings, that might mean a periodic task that sums up ratings and distributes them to articles.
To do that you would need to run a query fetching every single NewsRating referencing your News entity and sum all the ratings (as the datastore does not provide JOINs). This will be a huge task both time and cost wise. I'd recommend to take a look at just-overheard-it example as a reference point.
I need to get a count of records for a particular model on App Engine. How does one do it?
I bulk uploaded more than 4000 records but modelname.count() only shows me 1000.
You should use Datastore Statistics:
Query query = new Query("__Stat_Kind__");
query.addFilter("kind_name", FilterOperator.EQUAL, kind);
Entity entityStat = datastore.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Please note that the above does not work on the development Datastore but it works in production (when published).
I see that this is an old post, but I'm adding an answer in benefit of others searching for the same thing.
As of release 1.3.6, there is no longer a cap of 1,000 on count queries. Thus you can do the following to get a count beyond 1,000:
count = modelname.all(keys_only=True).count()
This will count all of your entities, which could be rather slow if you have a large number of entities. As a result, you should consider calling count() with some limit specified:
count = modelname.all(keys_only=True).count(some_upper_bound_suitable_for_you)
This is a very old thread, but just in case it helps other people looking at it, there are 3 ways to accomplish this:
Accessing the Datastore statistics
Keeping a counter in the datastore
Sharding counters
Each one of these methods is explained in this link.
count = modelname.all(keys_only=True).count(some_upper_limit)
Just to add on to the earlier post by dar, this 'some_upper_limit' has to be specified. If not, the default count will still be a maximum of 1000.
In GAE a count will always make you page through the results when you have more than 1000 objects. The easiest way to deal with this problem is to add a counter property to your model or to a different counters table and update it every time you create a new object.
I still hit the 1000 limit with count so adapted dar's code (mine's a bit quick and dirty):
class GetCount(webapp.RequestHandler):
def get(self):
query = modelname.all(keys_only=True)
i = 0
while True:
result = query.fetch(1000)
i = i + len(result)
if len(result) < 1000:
break
cursor = query.cursor()
query.with_cursor(cursor)
self.response.out.write('<p>Count: '+str(i)+'</p>')
DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
Query query = new Query("__Stat_Kind__");
Query.Filter eqf = new Query.FilterPredicate("kind_name",
Query.FilterOperator.EQUAL,
"SomeEntity");
query.setFilter(eqf);
Entity entityStat = ds.prepare(query).asSingleEntity();
Long totalEntities = (Long) entityStat.getProperty("count");
Another solution is using a key only query and get the size of the iterator. The computing time with this solution will rise linearly with the amount of entrys:
Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
KeyFactorykeyFactory = datastore.newKeyFactory().setKind("MyKind");
Query query = Query.newKeyQueryBuilder().setKind("MyKind").build();
int count = Iterators.size(datastore.run(query));
I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...