in theory, I want to have 100,000 entities in my User model:
I'd like to implement an ID property which increments with each new entity being put.
The goal is so that I can do queries like "get user with IDs from 200 to 300". Kind of like seperating the 10000 entities into 1000 readable pages of 100 entities each.
I heard that the ID property from App Engine does not guarantee that it goes up incrementally.
So how do I implement my own incrementing ID?
One of my ideas is to use memcache. Add a counter in memcache that increases each time a new entity is inserted.
class User(db.Model):
nickname = db.StringProperty()
my_id = db.IntegerProperty()
# Steps to add a new user entity:
# Step 1: check memcache
counter = memcache.get("global_counter")
# Step 2: increment counter
counter = counter + 1
# Step 3: add user entity
User(nickname="tommy",my_id=counter).put()
# Step 4: replace incremented counter
memcache.replace(key="global_counter",value=counter)
# todo: create a cron job which periodically takes the memcached global_counter and
# store it into the datastore as a backup ( in case memcache gets flushed )
what do you guys think?
additional question: if 10 users register at the same time, will it mess up the memcache counter?
You don't need to implement your own auto-incrementing counter to achieve the pagination you're looking for - look at "Paging without a property"
In short, the key is guaranteed to be returned in a deterministic order, and you can use inequality operators (>=, <=) to return results starting from a particular key.
To get your first hundred users:
users = User.all().order("__key__").fetch(101)
The first 100 results are what you iterate over; the 101st result, if it is returned, is used as a bookmark in the next query - just add a .filter('__key__ >=', bookmark) to the above query, and you'll get results 200-301
The memcache counter can go away.
Counters are difficult to do in a distributed system. It is often easiest to re-think the application so that it can do without.
How about using a timestamp instead? You can sort by (and page) on that as well, it will generate the same ordering as a counter.
Related
I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
I have a set of entries in the datastore and I would like to search/retrieve them as user types query. If I have full string it's easy:
q := datastore.NewQuery("Products").Filter("Name =", name).Limit(20)
but I have no idea how to do it with partial string, please help.
q := datastore.NewQuery("Products").Filter("Name >", name).Limit(20)
There is no like operation on app engine but instead you can use '<' and '>'
example:
'moguz' > 'moguzalp'
EDIT: GAH! I just realized that your question is Go-specific. My code below is for Python. Apologies. I'm also familiar with the Go runtime, and I can work on translating to Python to Go later on. However, if the principles described are enough to get you moving in the right direction, let me know and I wont' bother.
Such an operation is not directly supported on the AppEngine datastore, so you'll have to roll your own functionality to meet this need. Here's a quick, off-the-top-of-my-head possible solution:
class StringIndex(db.Model):
matches = db.StringListProperty()
#classmathod
def GetMatchesFor(cls, query):
found_index = cls.get_by_key_name(query[:3])
if found_index is not None:
if query in found_index.matches:
# Since we only query on the first the characters,
# we have to roll through the result set to find all
# of the strings that matach query. We keep the
# list sorted, so this is not hard.
all_matches = []
looking_at = found_index.matches.index(query)
matches_len = len(foundIndex.matches)
while start_at < matches_len and found_index.matches[looking_at].startswith(query):
all_matches.append(found_index.matches[looking_at])
looking_at += 1
return all_matches
return None
#classmethod
def AddMatch(cls, match) {
# We index off of the first 3 characters only
index_key = match[:3]
index = cls.get_or_insert(index_key, list(match))
if match not in index.matches:
# The index entity was not newly created, so
# we will have to add the match and save the entity.
index.matches.append(match).sort()
index.put()
To use this model, you would need to call the AddMatch method every time that you add an entity that would potentially be searched on. In your example, you have a Product model and users will be searching on it's Name. In your Product class, you might have a method AddNewProduct that creates a new entity and puts it into the datastore. You would add to that method StringIndex.AddMatch(new_product_name).
Then, in your request handler that gets called from your AJAXy search box, you would use StringIndex.GetMatchesFor(name) to see all of the stored products that begin with the string in name, and you return those values as JSON or whatever.
What's happening inside the code is that the first three characters of the name are used for the key_name of an entity that contains a list of strings, all of the stored names that begin with those three characters. Using three (as opposed to some other number) is absolutely arbitrary. The correct number for your system is dependent on the amount of data that you are indexing. There is a limit to the number of strings that can be stored in a StringListProperty, but you also want to balance the number of StringIndex entities that are in your datastore. A little bit of math with give you a reasonable number of characters to work with.
If the number of keywords is limited you could consider adding an indexed list property of partial search strings.
Note that you are limited to 5000 indexes per entity, and 1MB for the total entity size.
But you could also wait for Cloud SQL and Full Text Search API to be avaiable for the Go runtime.
I'm using App Engine datastore, and would like to make sure that row IDs behave similarly to "auto-increment" fields in mySQL DB.
Tried several generation strategies, but can't seem to take control over what happens:
the IDs are not consecutive, there seem to be several "streams" growing in parallel.
the ids get "recycled" after old rows are deleted
Is such a thing at all possible ?
I really would like to refrain from keeping (indexed) timestamps for each row.
It sounds like you can't rely on IDs being sequential without a fair amount of extra work. However, there is an easy way to achieve what you are trying to do:
We'd like to delete old items (older than two month worth, for
example)
Here is a model that automatically keeps track of both its creation and its modification times. Simply using the auto_now_add and auto_now parameters makes this trivial.
from google.appengine.ext import db
class Document(db.Model):
user = db.UserProperty(required=True)
title = db.StringProperty(default="Untitled")
content = db.TextProperty(default=DEFAULT_DOC)
created = db.DateTimeProperty(auto_now_add=True)
modified = db.DateTimeProperty(auto_now=True)
Then you can use cron jobs or the task queue to schedule your maintenance task of deleting old stuff. Find the oldest stuff is as easy as sorting by created date or modified date:
db.Query(Document).order("modified")
# or
db.Query(Document).order("created")
What I know, is that auto-generated ID's as Long Integers are available in Google App Engine, but there's no guarantee that the value's are increasing and there's also no guarantee that the numbers are real one-increments.
So, if you nee timestamping and increments, add a DateTime field with milliseconds, but then you don't know that the numbers are unique.
So, the best this to do (what we are using) is: (sorry for that, but this is indeed IMHO the best option)
use a autogenerated ID as Long (we use Objectify in Java)
use a timestamp on each entity and use a index to query the entities (use a descending index) to get the top X
I think this is probably a fairly good solution, however be aware that I have not tested it in any way, shape or form. The syntax may even be incorrect!
The principle is to use memcache to generate a monotonic sequence, using the datastore to provide a fall-back if memcache fails.
class IndexEndPoint(db.Model):
index = db.IntegerProperty (indexed = False, default = 0)
def find_next_index (cls):
""" finds the next free index for an entity type """
name = 'seqindex-%s' % ( cls.kind() )
def _from_ds ():
"""A very naive way to find the next free key.
We just take the last known end point and loop untill its free.
"""
tmp_index = IndexEndPoint.get_or_insert (name).index
index = None
while index is None:
key = db.key.from_path (cls.kind(), tmp_index))
free = db.get(key) is None
if free:
index = tmp_index
tmp_index += 1
return index
index = None
while index is None:
index = memcache.incr (index_name)
if index is None: # Our index might have been evicted
index = _from_ds ()
if memcache.add (index_name, index): # if false someone beat us to it
index = None
# ToDo:
# Use a named task to update IndexEndPoint so if the memcache index gets evicted
# we don't have too many items to cycle over to find the end point again.
return index
def make_new (cls):
""" Makes a new entity with an incrementing ID """
result = None
while result is None:
index = find_next_index (cls)
def txn ():
"""Makes a new entity if index is free.
This should only fail if we had a memcache miss
(does not have to be on this instance).
"""
key = db.key.from_path (cls.kind(), index)
if db.get (key) is not None:
return
result = cls (key)
result.put()
return result
result = db.run_in_transaction (txn)
I'm reading on Google App Engine groups many users (Fig1, Fig2, Fig3) that can't figure out where the high number of Datastore reads in their billing reports come from.
As you might know, Datastore reads are capped to 50K operations/day, above this budget you have to pay.
50K operations sounds like a lot of resources, but unluckily, it seems that each operation (Query, Entity fetch, Count..), hides several Datastore reads.
Is it possible to know via API or some other approach, how many Datastore reads are hidden behind the common RPC.get , RPC.runquery calls?
Appstats seems useless in this case because it gives just the RPC details and not the hidden reads cost.
Having a simple Model like this:
class Example(db.Model):
foo = db.StringProperty()
bars= db.ListProperty(str)
and 1000 entities in the datastore, I'm interested in the cost of these kind of operations:
items_count = Example.all(keys_only = True).filter('bars=','spam').count()
items_count = Example.all().count(10000)
items = Example.all().fetch(10000)
items = Example.all().filter('bars=','spam').filter('bars=','fu').fetch(10000)
items = Example.all().fetch(10000, offset=500)
items = Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd')
See http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost .
A query costs you 1 read plus 1 read for each entity returned. "Returned" includes entities skipped by offset or count.
So that is 1001 reads for each of these:
Example.all(keys_only = True).filter('bars=','spam').count()
Example.all().count(1000)
Example.all().fetch(1000)
Example.all().fetch(1000, offset=500)
For these, the number of reads charged is 1 plus the number of entities that match the filters:
Example.all().filter('bars=','spam').filter('bars=','fu').fetch()
Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd').fetch()
Instead of using count you should consider storing the count in the datastore, sharded if you need to update the count more than once a second. http://code.google.com/appengine/articles/sharding_counters.html
Whenever possible you should use cursors instead of an offset.
Just to make sure:
I'm almost sure:
Example.all().count(10000)
This one uses small datastore operations (no need to fetch the entities, only keys), so this would count as 1 read + 10,000 (max) small operations.
I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...