Google App Engine ndb performance on repeated property - google-app-engine

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.

No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

Related

How to flatten a 'friendship' model within User model in GAE?

I recently came across a number of articles pointing out to flatten the data for NoSQL databases. Coming from traditional SQL databases I realized I am replicating a SQL db bahaviour in GAE. So I started to refactor code where possible.
We have e.g. a social media site where users can become friends with each other.
class Friendship(ndb.Model):
from_friend = ndb.KeyProperty(kind=User)
to_friend = ndb.KeyProperty(kind=User)
Effectively the app creates a friendship instance between both users.
friendshipA = Friendship(from_friend = UserA, to_friend = userB)
friendshipB = Friendship(from_friend = UserB, to_friend = userA)
How could I now move this to the actual user model to flatten it. I thought maybe I could use a StructuredProperty. I know it is limited to 5000 entries, but that should be enough for friends.
class User(UserMixin, ndb.Model):
name = ndb.StringProperty()
friends = ndb.StructuredProperty(User, repeated=True)
So I came up with this, however User can't point to itself, so it seems. Because I get a NameError: name 'User' is not defined
Any idea how I could flatten it so that a single User instance would contain all its friends, with all their properties?
You can't create a StructuredProperty that references itself. Also, use of StructuredProperty to store a copy of User has additional problem of needing to perform a manual cascade update if a user ever modifies a property that is stored.
However, as KeyProperty accept String as kind, you can easily store the list of Users using KeyProperty as suggested by #dragonx. You can further optimise read by using ndb.get_multi to avoid multiple round-trip RPC calls when retrieving friends.
Here is a sample code:
class User(ndb.Model):
name = ndb.StringProperty()
friends = ndb.KeyProperty(kind="User", repeated=True)
userB = User(name="User B")
userB_key = userB.put()
userC = User(name="User C")
userC_key = userC.put()
userA = User(name="User A", friends=[userB_key, userC_key])
userA_key = userA.put()
# To retrieve all friends
for user in ndb.get_multi(userA.friends):
print "user: %s" % user.name
Use a KeyProperty that stores the key for the User instance.

App Engine Datastore - consistency and 1 write per sec limitation - who will it work in the following scenarious

I'm trying to wrap my head around eventuality consistency and 1 write per sec principles in GAE datastore. I have a scenario and two questions:
#python like pseudo-code
class User:
user_id = StringProperty
last_update_time = DateTimeProperty
class Comment:
user_id = StringProperty
comment = StringProperty
...
def AddCommentAndReturnAllComments(user_id):
user = db.GqlQuery("SELECT * FROM User where user_id = :1", user_id)
user.last_update_time = datetime.now()
user.put()
comment = Comment(parent=User(user_id))
comment.put()
comments = db.GqlQuery("SELECT * FROM Comment where user_id = :1", user_id)
return comments
Questions:
Will I get an exception here because I make two writes into the same EntityGroup within one second (user.put and comment.put)? Is there a simple way around it?
If I remove the parent=user(user_id), the two entities will no longer belong to the same EntityGroup. Does it mean that the list of comments returned from the function might not contain the last added comment?
Am I doing something inherently wrong?
I know that I got the entity referencing part wrong. It doesn't matter for the question (or does it?)
This seems to be a soft limit. In practice I see up to 5 writes/s allowed.
Yes and it also happens now, because you are not using ancestor query.
Nothing, except as mentioned in point 2.

Cost of updating entities in datastore (and, possible to append properties)?

I have a two part question.
Let's say I have a entity with a blob property...
# create entity
Entity(ndb.Model):
blob = ndb.BlobProperty(indexed=False)
e = Entity()
e.blob = 'abcd'
e_key = e.put()
# update entity
e = e_key.get()
e.blob += 'efg'
e.put()
So questions are:
The first time I put() that entity, the cost is 2 Write Ops; how many Ops does it cost to update the entity, as in the above example?
When I added 'efg' to the property, the old property had to be read into memory first, does app engine provide a way to append the old value without reading it first?
There are no partial updates. Every time you overwrite the whole entity. Numbers of indexes will also have an impact on cost. You might like to have a look at https://developers.google.com/appengine/articles/life_of_write for a detailed breakdown of what happens.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

What's the best way to model polls?

The domain is like this:
class Poll(db.Model):
question = db.StringProperty()
...
class Choice(db.Model):
poll = db.ReferenceProperty(Poll)
choice = db.StringProperty()
class Vote(db.Model):
user = db.ReferenceProperty(User)
choice = db.ReferenceProperty(Choice)
(This is not actually a definitive model, its just pseudo-diagram)
The things I need to query are:
Total number of votes for each poll on screen
Total number of votes for each option for each poll on screen
If the current user voted, for each poll
I have come up with some other schema using shared counters, list properties and none (with my intrinsic limitations) seems to be working. Oh, and of course, it needs to be super fast :)
Could you help me model my data?
Thank you
edit: Thanks to #Nick Johnson I can make a more accurate description of my problem, he suggested this schema
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
The problem with this, is that I can't query efficiently showing if the user has voted or not on a particular poll. Also, I want this shema to resist to lets say 1MM votes per poll or something (maybe I'd never get there, but I would like to aim there)
To solve this I was thinking of adding an EntityIndex like this:
class PollIndex(db.Model):
# PollIndex is child of Poll
voters = db.ListProperty(db.Key)
voters_choices = db.ListProperty()
# other search parameters
Then when I have to query for a list of polls I can only do it with 2 queries:
# get keys from pollindex where user is not there
# get keys from pollindex where user is there
# grabb all the polls
An other cool thing is that if the voters increase in size I can dinamically add more PollIndexes
What do you think of this approach?
The answer somewhat depends on what you expect the maximum sustained rate of updates to the poll to be. I'll assume initially that it's going to be quite limited (<1 per second typical, with peaks up to 10 per second).
Your design is mostly okay, except for a couple of tweaks:
Don't store choices as a separate entity, just store them as a list on the poll
Keep a running total of votes on the Poll entity for fast retrieval
With those changes, your model looks something like this:
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
# Here's how we record a vote
def record_vote(poll_key, user, choice_idx):
# We assume 'user' is an instance of a datastore model, and has a property 'user' that is
# a users.User object
poll = Poll.get(poll_key)
vote = Vote.get_by_key_name(user.user.user_id(), parent=poll)
if vote:
# User has already voted
return
vote = Vote(key_name=user.user.user_id(), parent=poll, user=user)
poll.votes[choice_idx] += 1
db.put([vote, poll])
If you need higher throughput, you should modify the Vote record to not be a child of Poll (and change its key name to incorporate both poll ID and user ID), and then either use write-behind counters with Memcache or a pull queue to aggregate the results into updates to the Poll totals.

Resources