Datastore design: 1 large class vs. 2 classes vs. polymodel? - google-app-engine

I am interested in understanding the pros / cons of several ways to design classes for Google App Engine's Datastore.
Consider the following classes:
Option 0
class Car(db.Model):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
class Part(db.Model):
title = db.StringProperty()
# other stuff
Part's parent is always set to the corresponding Car on creation.
These are used in several ways:
query + list (+ sort) parts: when listing the part, I need to display the Car's title, and get its external_id and year (so I don't need everything but the whole Car entity is fetched by accessing the part.parent, I am already using parent prefetch).
query + list (+ sort) cars: only need the title, year and imgurl.
get car: page with all the car details, need all the properties.
Considering the ways I get and display my data, what is the best option (providing pros/cons) between the above design and the followings?
Option 1
class Car(db.Model):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
class CarEx(db.Model):
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
Pro: When fetching Parts, getting the parents (Car) is faster since there are less properties.
Con: When displaying a Car, we need to get the CarEx. Need to add one more entity when adding a Car. Need to delete CarEx when deleting a Car.
Option 2
class Car(db.PolyModel):
title = db.StringProperty()
year = db.StringProperty()
imgurl = db.StringProperty()
class CarEx(Car):
type = db.StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
When adding cars, we would only add CarEx entities.
Pro: When fetching Parts, getting the parents (Car) is faster since there are less properties. ??? I am actually not sure at all this is true. ???
Pro: When displaying a Car, we get the CarEx. No need to get another entity. Adding and deleting cars is as easy as having only 1 Car model with everything in it (Option 0).
Con: Extra writes when adding a CarEx. Other extra costs?
So overall, I need to be able to fetch parts (and their parents, without a huge cost), and I need to fetch a full Car on a separate page. I am not sure if my assumptions about PolyModel are correct, nor if there are any other hidden pros/cons, or even other options.

A few points, If you are starting out, really you should be using ndb.
The small number of properties you list are not going to make enough difference to use Car and CarEx. Especially if you need CarEx all the time.
You use of PolyModel doesn't make sense, given how PolyModel works. Polymodel would be more suited to
class Vehicle(PolyModel):
title = StringProperty
year = StringProperty()
addeddate = db.DateTimeProperty()
external_id = db.IntegerProperty()
# possibly 5 or 6 more properties
class Car(Vehicle):
doors = IntegerProperty
class Van(Vehicle):
carrying_capacity = FloatProperty() #(m3)
class Truck(Vehicle):
tray_length = IntegerProperty()
Yep contrived, properties. But now I can search for all vehicles by any of the core Vehicle properties and get Trucks and Vans and Cars. You can't do this with normal model inheritance. Without PolyModel you would have to search Car, Truck entity types seperately.
In your case you probably don't need this.
What you do with Parts depends heavily on how many, and how often you need them. If you are likely to have less than 1MB of Parts and you need all Parts when you need Parts, then consider storeing Parts in a single container entity, and use a repeated StructuredProperty to store them. Then when you need parts you fetch them in a single entity. If you only need some parts then store them as separate entities.
If you need more than 1MB of Parts but you always need all parts then use more than one container.
You really need to look at the frequency of use of particular views, if you need all information vs some of it, to determine the best strategy.

Related

Tracking item order for storage to and retrieval from a DB

I'm trying to figure out how I'm going to 'CRUD' the order of items I have in a group that I'm storing in a database. (Pseudo statement of: select * items from app where group_id = 1;)
My guess is that I just use an numeric field and just increase/decrease the number as more items are added to/removed from the group. I can then just update the items number in this field as they are moved around. However, I've seen this go really badly wrong in an old legacy app where items would get out of sync and you'd have a group where the order ended up something like this:
0,1,1,3,4,5
0,1,1,1,4,5
This wasn't handled very gracefully by the application either, and broke the application necessitating manual intervention to reorder the items in the DB.
Is there a way to avoid this pitfall?
EDIT: I would also maybe want the items available in multiple groups with multiple orders.
I think in that case I would need a many to many relationship for both the group to item relationship and the item to order relationship. /EDIT
I'll be doing this in the Django framework.
I'm not really sure what you are asking; because ordering is one thing, and grouping of related objects is something else entirely.
Databases don't store the order of things, but rather the relationships (grouping) of things. The order of things is a user interface detail and not something that a database should be used for.
In django, you can create a ManyToMany relationship. This essentially creates a "box" where you can add and remove items that are related to a particular model. Here is the example from the documentation:
from django.db import models
class Publication(models.Model):
title = models.CharField(max_length=30)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.title
class Meta:
ordering = ('title',)
class Article(models.Model):
headline = models.CharField(max_length=100)
publications = models.ManyToManyField(Publication)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.headline
class Meta:
ordering = ('headline',)
Here an Article can belong to many Publications, and Publications have one or more Articles associated with them:
a = Article.create(headline='Hello')
b = Article.create(headline='World')
p = Publication.create(title='My Publication')
p.article_set.add(a)
p.article_set.add(b)
p.save()
# You can also add an article to a publication from the article object:
c = Article.create(headline='The Answer is 42')
c.publications.add(p)
To know how many articles belong to a publication:
Publication.objects.get(title='My Publication').article_set.count()

How to model Player, Match, and EloRank as NDB entities

I'm trying to wrap my head around entities groups and hierarchical keys in ndb but I might be stuck in "normalized thinking". I want to compute and store different players' rank based on how they did in different matches against each other over time. But all I can come up with is to store the "foreign keys" as strings like this:
class Player(ndb.Model):
name = ndb.StringProperty()
class Match(ndb.Model):
player1_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player2_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
player1_score = ndb.IntegerProperty()
player2_score = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
class EloRank(ndb.Model):
player_key = ndb.KeyProperty(kind=Player) # pointing to Player entity
match_key = ndb.KeyProperty(kind=Match) # pointing to Match entity
rank = ndb.IntegerProperty()
time = ndb.DatetimeProperty(auto_now_add=True)
Sure, it would be easy to "denormalize" the data by copy it (i.e. Match have two sub keys, one for player 1 and one for player 2) but how can I for instance change name of a player without resorting to doing updates on each Match entity?
StructuredProperty doesn't seem to be the answer either, since they belong to the defining entity.
How would you rewrite this model to put the entities in the same group?
Update
Use KeyProperty instead of StringProperty as suggested by M12.
First of all, you may want to use the ndb.KeyProperty() to store player keys instead of a StringProperty().
If you store a reference (key) to the players participating in each match, you don't need to update every match when a user changes name, as when a match is requested by a user, the application could use the player's key to fetch their name and send it back to the user.
Next, I'd probably store the player's rank within his instance, i.e. in the Player model:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
This approach requires you to write a fairly solid framework that makes sure that after every match, all users scores are modified appropriately. The "per-match-score" could still be in the Match model, but the Player model would have the "aggregated" score of all matches played.
In order to do this, it would also be handy to add a list of matches played by each player into their model, so the Player model would now be:
class Player(ndb.Model):
name = ndb.StringProperty()
rank = ndb.IntegerProperty()
matches = ndb.KeyProperty(repeated=True)
Player.matches would essentially be a list of keys to matches played by the user so that are easier to be fetched when looking at a Players details and history of matches played.
Alternatively, Player.matches could be a ndb.JsonProperty() if you would like to store additional information regarding matches played, as the one I initially suggested (ndb.KeyProperty(repeated=True)) is fairly limited in what it can store (it's only a list)
Hope this helps a bit!

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

How to avoid duplicates in GAE datastore?

Let's say here is the database structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Each user can leave only one rating for each News. The following code doesn't care about duplicates:
rating = NewsRating()
rating.user = 123456
rating.rating = 1
rating.news = News.get_by_key_name('news-unique-key')
rating.put()
How should I modify that that it will allow to have only one record for each rating.user and rating.news combination? If such rating already exists, then it should be updated with new value.
Use key names and (possibly) parent entities to keep track. For instance, supposing you have a UserInfo kind, you could do it like this:
class NewsRating(db.Model):
# No explicit user reference, since it's the parent entity
rating = db.IntegerProperty(required=True)
news = db.ReferenceProperty(News) # We could get this from the key name, but this is more convenient
rating = NewsRating(parent=current_user, key_name=str(news.key().id()), news=news)
rating.put()
Attempting to add the same rating multiple times will simply overwrite the existing one, or you can use a datastore transaction to add it atomically.
Note that you should almost certainly keep a total of ratings against the News entity, rather than counting up ratings on each request, which will get less efficient as the number of ratings increases.

When Expando Class should be used in Google App Engine Apps?

What are the applications for Google App Engine Expando Class?
And what are the good practices related to it?
Two common uses of Expandos are partially-fixed schemas and deleting old properties.
I frequently use Expando when I have a kind that needs slightly different properties across entities; in other words, when I need a 'partially' dynamic schema. One use-cases is an application that takes orders where some products are liquid (think water), some are physical units (think DVDs), and some are 'other' (think flour). Some fields, like item code, price and quantity, are always needed. But, what if the details of how quantity was computed is also needed?
Typically a fixed-schema solution would be to add a property for all of the variables we might use: weight, dimension, before and after weights of our stock, and so on. That sucks. For every entity most of the other fields are not needed.
class Order(db.Model):
# These fields are always needed.
item_code = db.StringProperty()
unit_of_measure = db.StringProperty()
unit_price = db.FloatProperty()
quantity = db.FloatProperty()
# These fields are used depending on the unit of measure.
weight = db.FloatProperty()
volume = db.FloatProperty()
stock_start_weight = db.FloatProperty()
stock_end_weight = db.FloatProperty()
With Expando we can do much better. We could use the unit_of_measure to tell us how we computed quantity. The functions that compute quantity can set the dynamic fields, and the functions that read that method's information know what to look for. And, the entity does not have a bunch of unneeded properties.
class Order(db.Expando):
# Every instance has these fields.
item_code = db.StringProperty()
unit_of_measure = db.StringProperty()
unit_price = db.FloatProperty()
quantity = db.FloatProperty()
def compute_gallons(entity, kilograms, kg_per_gallon):
# Set the fixed fields.
entity.unit_of_measure = 'GAL'
entity.quantity = kilograms / kg_per_gallon
# Set the gallon specific fields:
entity.weight = kilograms
entity.density = kg_per_gallon
You could achieve a similar result by using a text or blob property and serializing a dict of 'other' value to it. Expando basically 'automates' that for you.

Resources