How to avoid duplicates in GAE datastore? - google-app-engine

Let's say here is the database structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Each user can leave only one rating for each News. The following code doesn't care about duplicates:
rating = NewsRating()
rating.user = 123456
rating.rating = 1
rating.news = News.get_by_key_name('news-unique-key')
rating.put()
How should I modify that that it will allow to have only one record for each rating.user and rating.news combination? If such rating already exists, then it should be updated with new value.

Use key names and (possibly) parent entities to keep track. For instance, supposing you have a UserInfo kind, you could do it like this:
class NewsRating(db.Model):
# No explicit user reference, since it's the parent entity
rating = db.IntegerProperty(required=True)
news = db.ReferenceProperty(News) # We could get this from the key name, but this is more convenient
rating = NewsRating(parent=current_user, key_name=str(news.key().id()), news=news)
rating.put()
Attempting to add the same rating multiple times will simply overwrite the existing one, or you can use a datastore transaction to add it atomically.
Note that you should almost certainly keep a total of ratings against the News entity, rather than counting up ratings on each request, which will get less efficient as the number of ratings increases.

Related

Datastore design: 1 large class vs. 2 classes vs. polymodel?

I am interested in understanding the pros / cons of several ways to design classes for Google App Engine's Datastore.
Consider the following classes:
Option 0
class Car(db.Model):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
class Part(db.Model):
title = db.StringProperty()
# other stuff
Part's parent is always set to the corresponding Car on creation.
These are used in several ways:
query + list (+ sort) parts: when listing the part, I need to display the Car's title, and get its external_id and year (so I don't need everything but the whole Car entity is fetched by accessing the part.parent, I am already using parent prefetch).
query + list (+ sort) cars: only need the title, year and imgurl.
get car: page with all the car details, need all the properties.
Considering the ways I get and display my data, what is the best option (providing pros/cons) between the above design and the followings?
Option 1
class Car(db.Model):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
class CarEx(db.Model):
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
Pro: When fetching Parts, getting the parents (Car) is faster since there are less properties.
Con: When displaying a Car, we need to get the CarEx. Need to add one more entity when adding a Car. Need to delete CarEx when deleting a Car.
Option 2
class Car(db.PolyModel):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
class CarEx(Car):
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
When adding cars, we would only add CarEx entities.
Pro: When fetching Parts, getting the parents (Car) is faster since there are less properties. ??? I am actually not sure at all this is true. ???
Pro: When displaying a Car, we get the CarEx. No need to get another entity. Adding and deleting cars is as easy as having only 1 Car model with everything in it (Option 0).
Con: Extra writes when adding a CarEx. Other extra costs?
So overall, I need to be able to fetch parts (and their parents, without a huge cost), and I need to fetch a full Car on a separate page. I am not sure if my assumptions about PolyModel are correct, nor if there are any other hidden pros/cons, or even other options.
A few points, If you are starting out, really you should be using ndb.
The small number of properties you list are not going to make enough difference to use Car and CarEx. Especially if you need CarEx all the time.
You use of PolyModel doesn't make sense, given how PolyModel works. Polymodel would be more suited to
class Vehicle(PolyModel):
title = StringProperty
year = StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
class Car(Vehicle):
doors = IntegerProperty
class Van(Vehicle):
carrying_capacity = FloatProperty() #(m3)
class Truck(Vehicle):
tray_length = IntegerProperty()
Yep contrived, properties. But now I can search for all vehicles by any of the core Vehicle properties and get Trucks and Vans and Cars. You can't do this with normal model inheritance. Without PolyModel you would have to search Car, Truck entity types seperately.
In your case you probably don't need this.
What you do with Parts depends heavily on how many, and how often you need them. If you are likely to have less than 1MB of Parts and you need all Parts when you need Parts, then consider storeing Parts in a single container entity, and use a repeated StructuredProperty to store them. Then when you need parts you fetch them in a single entity. If you only need some parts then store them as separate entities.
If you need more than 1MB of Parts but you always need all parts then use more than one container.
You really need to look at the frequency of use of particular views, if you need all information vs some of it, to determine the best strategy.

How to model Player, Match, and EloRank as NDB entities

I'm trying to wrap my head around entities groups and hierarchical keys in ndb but I might be stuck in "normalized thinking". I want to compute and store different players' rank based on how they did in different matches against each other over time. But all I can come up with is to store the "foreign keys" as strings like this:
class Player(ndb.Model):
name = ndb.StringProperty()
class Match(ndb.Model):
player1_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player2_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player1_score = ndb.IntegerProperty()
player2_score = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
class EloRank(ndb.Model):
player_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
match_key = ndb.KeyProperty(kind=Match) # pointing to Match entity
rank = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
Sure, it would be easy to "denormalize" the data by copy it (i.e. Match have two sub keys, one for player 1 and one for player 2) but how can I for instance change name of a player without resorting to doing updates on each Match entity?
StructuredProperty doesn't seem to be the answer either, since they belong to the defining entity.
How would you rewrite this model to put the entities in the same group?
Update
Use KeyProperty instead of StringProperty as suggested by M12.
First of all, you may want to use the ndb.KeyProperty() to store player keys instead of a StringProperty().
If you store a reference (key) to the players participating in each match, you don't need to update every match when a user changes name, as when a match is requested by a user, the application could use the player's key to fetch their name and send it back to the user.
Next, I'd probably store the player's rank within his instance, i.e. in the Player model:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
This approach requires you to write a fairly solid framework that makes sure that after every match, all users scores are modified appropriately. The "per-match-score" could still be in the Match model, but the Player model would have the "aggregated" score of all matches played.
In order to do this, it would also be handy to add a list of matches played by each player into their model, so the Player model would now be:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
matches = ndb.KeyProperty(repeated=True)
Player.matches would essentially be a list of keys to matches played by the user so that are easier to be fetched when looking at a Players details and history of matches played.
Alternatively, Player.matches could be a ndb.JsonProperty() if you would like to store additional information regarding matches played, as the one I initially suggested (ndb.KeyProperty(repeated=True)) is fairly limited in what it can store (it's only a list)
Hope this helps a bit!

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

What's the best way to model polls?

The domain is like this:
class Poll(db.Model):
question = db.StringProperty()
...
class Choice(db.Model):
poll = db.ReferenceProperty(Poll)
choice = db.StringProperty()
class Vote(db.Model):
user = db.ReferenceProperty(User)
choice = db.ReferenceProperty(Choice)
(This is not actually a definitive model, its just pseudo-diagram)
The things I need to query are:
Total number of votes for each poll on screen
Total number of votes for each option for each poll on screen
If the current user voted, for each poll
I have come up with some other schema using shared counters, list properties and none (with my intrinsic limitations) seems to be working. Oh, and of course, it needs to be super fast :)
Could you help me model my data?
Thank you
edit: Thanks to #Nick Johnson I can make a more accurate description of my problem, he suggested this schema
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
The problem with this, is that I can't query efficiently showing if the user has voted or not on a particular poll. Also, I want this shema to resist to lets say 1MM votes per poll or something (maybe I'd never get there, but I would like to aim there)
To solve this I was thinking of adding an EntityIndex like this:
class PollIndex(db.Model):
# PollIndex is child of Poll
voters = db.ListProperty(db.Key)
voters_choices = db.ListProperty()
# other search parameters
Then when I have to query for a list of polls I can only do it with 2 queries:
# get keys from pollindex where user is not there
# get keys from pollindex where user is there
# grabb all the polls
An other cool thing is that if the voters increase in size I can dinamically add more PollIndexes
What do you think of this approach?
The answer somewhat depends on what you expect the maximum sustained rate of updates to the poll to be. I'll assume initially that it's going to be quite limited (<1 per second typical, with peaks up to 10 per second).
Your design is mostly okay, except for a couple of tweaks:
Don't store choices as a separate entity, just store them as a list on the poll
Keep a running total of votes on the Poll entity for fast retrieval
With those changes, your model looks something like this:
class Poll(db.Model):
question = db.StringProperty(indexed=False, required=True)
choices = db.StringListProperty(indexed=False, required=True)
votes = db.ListProperty(int, indexed=False, required=True)
class Vote(db.Model):
# Vote is a child entity of Poll, so doesn't need an explicit reference to it
# Vote's key name is the user_id, so users can only vote once
user = db.ReferenceProperty(User, required=True)
choice = db.IntegerProperty(required=True)
# Here's how we record a vote
def record_vote(poll_key, user, choice_idx):
# We assume 'user' is an instance of a datastore model, and has a property 'user' that is
# a users.User object
poll = Poll.get(poll_key)
vote = Vote.get_by_key_name(user.user.user_id(), parent=poll)
if vote:
# User has already voted
return
vote = Vote(key_name=user.user.user_id(), parent=poll, user=user)
poll.votes[choice_idx] += 1
db.put([vote, poll])
If you need higher throughput, you should modify the Vote record to not be a child of Poll (and change its key name to incorporate both poll ID and user ID), and then either use write-behind counters with Memcache or a pull queue to aggregate the results into updates to the Poll totals.

How to order by the field stored in the separate model?

Here is simplified version of my datastore structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Now I need to display all news sorted by their total rating (sum of different users ratings). How can I do that in the following code:
news = News.all()
# filter by additional parms
# news.filter("city =", "1")
news.order("-added") # ?
for one_news in news:
self.response.out.write(one_news.title()+'<br>')
Queries only have access to the entity you're querying against, if you have a property from another entity (or some aggregate calculation based on fields from other entities) that you want to use to order results, you're going to need to store it in the entity you're querying against.
In the case of ratings, that might mean a periodic task that sums up ratings and distributes them to articles.
To do that you would need to run a query fetching every single NewsRating referencing your News entity and sum all the ratings (as the datastore does not provide JOINs). This will be a huge task both time and cost wise. I'd recommend to take a look at just-overheard-it example as a reference point.

Resources