What's the best way to model polls? - google-app-engine

The domain is like this:
class Poll(db.Model):
question = db.StringProperty()
...
class Choice(db.Model):
poll = db.ReferenceProperty(Poll)
choice = db.StringProperty()
class Vote(db.Model):
user = db.ReferenceProperty(User)
choice = db.ReferenceProperty(Choice)
(This is not actually a definitive model, its just pseudo-diagram)
The things I need to query are:
Total number of votes for each poll on screen
Total number of votes for each option for each poll on screen
If the current user voted, for each poll
I have come up with some other schema using shared counters, list properties and none (with my intrinsic limitations) seems to be working. Oh, and of course, it needs to be super fast :)
Could you help me model my data?
Thank you
edit: Thanks to #Nick Johnson I can make a more accurate description of my problem, he suggested this schema
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
The problem with this, is that I can't query efficiently showing if the user has voted or not on a particular poll. Also, I want this shema to resist to lets say 1MM votes per poll or something (maybe I'd never get there, but I would like to aim there)
To solve this I was thinking of adding an EntityIndex like this:
class PollIndex(db.Model):
# PollIndex is child of Poll
voters = db.ListProperty(db.Key)
voters_choices = db.ListProperty()
# other search parameters
Then when I have to query for a list of polls I can only do it with 2 queries:
# get keys from pollindex where user is not there
# get keys from pollindex where user is there
# grabb all the polls
An other cool thing is that if the voters increase in size I can dinamically add more PollIndexes
What do you think of this approach?

The answer somewhat depends on what you expect the maximum sustained rate of updates to the poll to be. I'll assume initially that it's going to be quite limited (<1 per second typical, with peaks up to 10 per second).
Your design is mostly okay, except for a couple of tweaks:
Don't store choices as a separate entity, just store them as a list on the poll
Keep a running total of votes on the Poll entity for fast retrieval
With those changes, your model looks something like this:
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
# Here's how we record a vote
def record_vote(poll_key, user, choice_idx):
# We assume 'user' is an instance of a datastore model, and has a property 'user' that is
# a users.User object
poll = Poll.get(poll_key)
vote = Vote.get_by_key_name(user.user.user_id(), parent=poll)
if vote:
# User has already voted
return
vote = Vote(key_name=user.user.user_id(), parent=poll, user=user)
poll.votes[choice_idx] += 1
db.put([vote, poll])
If you need higher throughput, you should modify the Vote record to not be a child of Poll (and change its key name to incorporate both poll ID and user ID), and then either use write-behind counters with Memcache or a pull queue to aggregate the results into updates to the Poll totals.

Related

How to model Player, Match, and EloRank as NDB entities

I'm trying to wrap my head around entities groups and hierarchical keys in ndb but I might be stuck in "normalized thinking". I want to compute and store different players' rank based on how they did in different matches against each other over time. But all I can come up with is to store the "foreign keys" as strings like this:
class Player(ndb.Model):
name = ndb.StringProperty()
class Match(ndb.Model):
player1_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player2_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player1_score = ndb.IntegerProperty()
player2_score = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
class EloRank(ndb.Model):
player_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
match_key = ndb.KeyProperty(kind=Match) # pointing to Match entity
rank = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
Sure, it would be easy to "denormalize" the data by copy it (i.e. Match have two sub keys, one for player 1 and one for player 2) but how can I for instance change name of a player without resorting to doing updates on each Match entity?
StructuredProperty doesn't seem to be the answer either, since they belong to the defining entity.
How would you rewrite this model to put the entities in the same group?
Update
Use KeyProperty instead of StringProperty as suggested by M12.
First of all, you may want to use the ndb.KeyProperty() to store player keys instead of a StringProperty().
If you store a reference (key) to the players participating in each match, you don't need to update every match when a user changes name, as when a match is requested by a user, the application could use the player's key to fetch their name and send it back to the user.
Next, I'd probably store the player's rank within his instance, i.e. in the Player model:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
This approach requires you to write a fairly solid framework that makes sure that after every match, all users scores are modified appropriately. The "per-match-score" could still be in the Match model, but the Player model would have the "aggregated" score of all matches played.
In order to do this, it would also be handy to add a list of matches played by each player into their model, so the Player model would now be:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
matches = ndb.KeyProperty(repeated=True)
Player.matches would essentially be a list of keys to matches played by the user so that are easier to be fetched when looking at a Players details and history of matches played.
Alternatively, Player.matches could be a ndb.JsonProperty() if you would like to store additional information regarding matches played, as the one I initially suggested (ndb.KeyProperty(repeated=True)) is fairly limited in what it can store (it's only a list)
Hope this helps a bit!

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

App Engine Datastore - consistency and 1 write per sec limitation - who will it work in the following scenarious

I'm trying to wrap my head around eventuality consistency and 1 write per sec principles in GAE datastore. I have a scenario and two questions:
#python like pseudo-code
class User:
user_id = StringProperty
last_update_time = DateTimeProperty
class Comment:
user_id = StringProperty
comment = StringProperty
...
def AddCommentAndReturnAllComments(user_id):
user = db.GqlQuery("SELECT * FROM User where user_id = :1", user_id)
user.last_update_time = datetime.now()
user.put()
comment = Comment(parent=User(user_id))
comment.put()
comments = db.GqlQuery("SELECT * FROM Comment where user_id = :1", user_id)
return comments
Questions:
Will I get an exception here because I make two writes into the same EntityGroup within one second (user.put and comment.put)? Is there a simple way around it?
If I remove the parent=user(user_id), the two entities will no longer belong to the same EntityGroup. Does it mean that the list of comments returned from the function might not contain the last added comment?
Am I doing something inherently wrong?
I know that I got the entity referencing part wrong. It doesn't matter for the question (or does it?)
This seems to be a soft limit. In practice I see up to 5 writes/s allowed.
Yes and it also happens now, because you are not using ancestor query.
Nothing, except as mentioned in point 2.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

How to avoid duplicates in GAE datastore?

Let's say here is the database structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Each user can leave only one rating for each News. The following code doesn't care about duplicates:
rating = NewsRating()
rating.user = 123456
rating.rating = 1
rating.news = News.get_by_key_name('news-unique-key')
rating.put()
How should I modify that that it will allow to have only one record for each rating.user and rating.news combination? If such rating already exists, then it should be updated with new value.
Use key names and (possibly) parent entities to keep track. For instance, supposing you have a UserInfo kind, you could do it like this:
class NewsRating(db.Model):
# No explicit user reference, since it's the parent entity
rating = db.IntegerProperty(required=True)
news = db.ReferenceProperty(News) # We could get this from the key name, but this is more convenient
rating = NewsRating(parent=current_user, key_name=str(news.key().id()), news=news)
rating.put()
Attempting to add the same rating multiple times will simply overwrite the existing one, or you can use a datastore transaction to add it atomically.
Note that you should almost certainly keep a total of ratings against the News entity, rather than counting up ratings on each request, which will get less efficient as the number of ratings increases.

Resources