de-normalizing data model: django/sql -> app engine - database

I'm just starting to get my head around non-relational databases, so I'd like to ask some help with converting these traditional SQL/django models into Google App Engine model(s).
The example is for event listings, where each event has a category, belongs to a venue, and a venue has a number of photos attached to it.
In django, I would model the data like this:
class Event(models.Model)
title = models.CharField()
start = models.DatetimeField()
category = models.ForeignKey(Category)
venue = models.ForeignKey(Venue)
class Category(models.Model):
name= models.CharField()
class Venue (models.Model):
name = models.CharField()
address = models.CharField()
class Photo(models.Model):
venue = models.ForeignKey(Venue)
source = models.CharField()
How would I accomplish the equivalent with App Engine models?

There's nothing here that must be de-normalized to work with App Engine. You can change ForeignKey to ReferenceProperty, CharField to StringProperty and DatetimeField to DateTimeProperty and be done. It might be more efficient to store category as a string rather than a reference, but this depends on usage context.
Denormalization becomes important when you start designing queries. Unlike traditional SQL, you can't write ad-hoc queries that have access to every row of every table. Anything you want to query for must be satisfied by an index. If you're running queries today that depend on table scans and complex joins, you'll have to make sure that the query parameters are indexed at write-time instead of calculating them on the fly.
As an example, if you wanted to do a case-insensitive search by event title, you'd have to store a lower-case copy of the title on every entity at write time. Without guessing your query requirements, I can't really offer more specific advice.

It's possible to run Django on App Engine
You need a trio of apps from here:
http://www.allbuttonspressed.com/projects
Django-nonrel
djangoappengine
djangotoolbox
Additionally, this module makes it possible to do the joins across Foreign Key relationships which are not directly supported by datastore methods:
django-dbindexer
...it denormalises the fields you want to join against, but has some limitations - doesn't update the denormalised values automatically so is only really suitable for static values
Django signals provide a useful starting point for automatic denormalisation.

Related

GAE: NDB query on list

Google App Engine NDB data model like so:
Users
Username
FirstName
LastName
Posts
PostID
PosterUsername
SubscribedPosts
PostID
SubscriberUsername
For a specific user, I want to return all the Posts which the user is subscribed to and display them on the page.
Since the wonderful NDB doesn't support JOINs, we do two queries:
postIDList =
SubscribedPosts.query(SubscribedPosts.SubscriberUsername == 'johndoe').fetch()
This gives us a list of SubscribedPosts. So how do I take my postIDList list and use it as a filter criteria for a Posts query?
Something like:
results = Posts.query(Post.PostID IN postIDList.PostID)
In a normal relational database, this would be a simple query using table joins. How is this done in Google's ndb?
You are going to run into lots of bottlenecks if you try to design your datastore models the same way you would design tables in a relational database as you have in this example.
Your comment goes in one possible right direction, although there are other solutions. Going that route, I would drop the "subscribedPosts" model altogether use a repeated KeyProperty entity in the User model to store subscribed posts.
See this related post: One-To-Many Example in NDB
Seems you are looking to model a many-to-many relationship, not one-to-many. Read Modelling Entity Relationships (althought this is for the older db, not newer ndb, it still gives the idea).
One of the two entities should maintain a list of keys (repeated=True) of the related other entities. Which entity should hold the list? Preferably the list should be on the side that usually has fewer relationships so that the list of keys is smaller. Another consideration is which side likely has less contention for updates.
In your specific case, lets say on average users subscribe to 10 posts and lets say on average each post has 100 users subscribed to it. In this case, we would want to put the list of keys on Users side of the relation.
class Users(ndb.Model):
user_name = ndb.StringProperty()
first_name = ndb.StringProperty()
last_name = ndb.StringProperty()
posts = ndb.KeyProperty(kind='Posts', repeated=True)
class Posts(ndb.Model)
post_id = ndb.StringProperty()
poster_user_name = ndb.StringProperty()
Establish the relationship by adding to the list in the Users instance:
current_user.posts.append(current_post.key)
For a given Users instance, getting all subscribed Posts is easy since the list of keys of the subscribed Posts is already within the given Users:
ndb.get_multi(given_user.posts)
For a given Posts instance, get all subscribing Users by ...
query = Users.query(Users.posts == given_post.key)

How would I achieve this using Google App Engine Datastore?

I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()

Google App Engine: Datastore Query in which the WHERE clause points to a Reference Property

In my datastore I have an entity Book that has reference to Owner that has reference to ContactInfo which has a property zipcode on it. I want to query for all books within a certain zipcode. How can I do this? I understand I can't write a query where I can do:
q = db.Query(Book).filter('owner.contact_info.zipcode =', 12345)
This is exactly the sort of thing you cannot do with the App Engine datastore. It is not a relational database, and you cannot query it as one. One of the things this implies is that it does not support JOINs, and you cannot do queries across entity types.
Because of this, it is usually not a good idea to follow the full normalized form in creating your data models. Unless you have a very good reason for keeping them separate, ContactInfo should almost certainly be merged with Owner. You might also want to define a repeated ReferenceProperty on Owner that records books_owned: then you can do a simple query and some gets to get all the books:
owners = db.Query(Owner).filter('zipcode', 12345)
books = []
for owner in owners:
book_ids.extend(owner.books_owned)
books = db.get(book_ids)
Edit the field would look like this:
class Owner(db.Model):
...
books_owned = db.ListProperty(db.Key)
If you update the schema, nothing happens to the existing entities: you will need to go through them (perhaps using the remote API) and update them to add the new data. Note though that you can just set the properties directly, there's no database migration to be done.
If contact info is a separate model, you will first need to find all ContactInfo entities with zipcode == 12345, then find all of the Owner entities that reference those ContactInfo entities, then find all the Book entities that reference those Owner entities.
If you're still able to change your model definitions at all, it would probably be wise to denormalize at least ContactInfo in the Owner model, and possibly also the Owner inside each Book.

Google App Engine Entity Ownership

I am writing an app for GAE in Python which stores recipes for different users. I have an entity called User in the datastore and an entity called Recipe. I want to be able to set the owner of each Recipe to the User who created it. Also, I want each User entity to contain a list of all Recipes belonging to that User as well as being able to query the Recipe database to find all Recipes owned by a particular User.
What is the best way to go about creating this parent/child type relationship?
Thanks
There are two main ways. (I am going to assume your using python which defines examples)
Option 1. Make the User the ancestor of all of their recipes
recipe = Recipe(parent=user.key)
Option 2. Use key property
class Recipe(ndb.Model):
owner = ndb.KeyProperty()
recipe = Recipe(owner=user.key)
all recipes for user with option 1
recipes = Recipe.query(ancestor=user.key)
all recupes for user with option 2
recipes = Recipe.query().filter(Recipe.owner == user.key)
Which one you use really depends a lot on what you plan to do with the data after creation, transaction patterns etc.... You should elaborate on your use cases. Both will work.
Also you should read up on transactions entity groups and understand them to really determine if you want to use ancestors https://developers.google.com/appengine/docs/java/datastore/transactions?hl=en .
If you use db.Model, to model one-to-many relationship, you can use the RefernenceProperty constructor and specify a collection_name. For example, one book may have many reviews.
class Book(db.Model):
title = db.StringProperty()
author = db.StringProperty()
class BookReview(db.Model):
book = db.ReferenceProperty(Book, collection_name='reviews')
b = Book()
b.put()
br = BookReview()
br.book = b # sets br's 'book' property to b's key
for review in b.reviews:# use collection_name to retrieve all reviews for a book
....
see https://developers.google.com/appengine/docs/python/datastore/datamodeling#references
Alternatively, you can use ndb's KeyProperty as in Tim's answer.
Also see
db.ReferenceProperty() vs ndb.KeyProperty in App Engine

GQL + Join Table Query Replacement for Google App Engine Datastore

Given the following Many to Many Relationship designed in Google App Engine Datastore:
User
PK: UserID
Name
Company
PK: CompanyID
Name
CompanyReview
CK CompanyID
CK UserID
ReviewContent
For optimization query, what's the best way to query this relationship tables for showing the selected company's review by users.
Currently, I'm doing the following:
results = CompanyReview.all().filter('owned_by = ', company).filter('written_by = ', user).fetch(10)
where I'm able to retrieve the data of CompanyReview table. However, in this case, I would need to check against the UserID from this CompanyReview table against the User table in order to obtain the name of the users who have commented for the selected company.
Is there a better solution to grab the user name as well, all in one statement in this case or at least better optimized solution? Performance is emphasized.
It dependes on which side of the relationship will have more values. As described is this article of Google App Engine docs, you can model many-to-many relationships by using a list of keys in one side of the relationship. "This means you should place the list on side of the relationship which you expect to have fewer values".
If both sides of the relationship will have many values, then you will really need the CompanyReview model. But pay attention to what the article says:
However, you need to be very careful because traversing the
connections of a collection will require more calls to the datastore.
Use this kind of many-to-many relationship only when you really need
to, and do so with care to the performance of your application.
This is because it uses RefereceProperty in the relationship model:
class ContactCompany(db.Model):
contact = db.ReferenceProperty(Contact,
required=True,
collection_name='companies')
company = db.ReferenceProperty(Company,
required=True,
collection_name='contacts')
title = db.StringProperty()
So if in Contact entities we try to access the companies, it will make a new query. And if in ContactCompany entities we try to get attributes of contact as in contact_company.contact.name, a query for that single contact will be made also. Read the ReferencyProperty docs for more info.
Extra:
Since you are performance-savvy, I recommend using a decorator for memcaching function returns and using this excellent layered storage library for Google App Engine.

Categories

Resources