Complicated, costly many-to-many entity references - google-app-engine

I have four main kinds: Account, Company, Service and Address. I would like Address entities to be shared between Company and Service entities.
Account: this is the user account (email, password)
Company: A business which provide Services (ancestor: Account)
Service: A service rendered by a Company (ancestor: Company)
Address: An address (group of fields: street, city, country) of a Company OR a Service (ancestor: Account)
The Challenge:
Company and Service entities may have different addresses; after all, a company's address is not necessarily where its services are acquired. Services may have many addresses, since a company may set up different franchises/outlets where its services may be acquired.
I would like to model data in such a way that Addresses can be referenced by either Company or Service entities, or both. I have tried these two approaches:
Let's assume this is the Address model:
class Address(ndb.Model):
street = ndb.StringProperty(required=True)
city = ndb.StringProperty(required=True)
country = ndb.StringProperty(required=True)
Approach 1: Store list of address keys inside Service or Company
class Service(ndb.Model):
title = ndb.StringProperty(required=True)
addresses = ndb.KeyProperty(repeated=True)
class Company(ndb.Model):
name = ndb.StringProperty(required=True)
addresses = ndb.KeyProperty(repeated=True)
Problem: For each page view of Service or Company, I would need to perform additional queries to fetch the their respective addresses. This blows up to be a big expensive problem as our entities grow in number.
Approach 2: Create an AddressMapping entity which forms a relationship between two entities:
class Service(ndb.Model):
title = ndb.StringProperty(required=True)
addresses = ndb.KeyProperty(repeated=True)
class AddressMapping(ndb.Model):
entity = ndb.StringProperty(required=True) # service or company
address = ndb.KeyProperty(repeated=True)
Problem: If a service is disabled/deleted/modified, we need to delete/modify all accompanying AddressMapping entities, or else they will be orphaned. Additional queries still required when viewing pages. This also seems expensive.
These are the two approaches I've come up with; they both seem bad. Any ideas on how I may improve this?

If you store keys of addresses in your Company and Service models, you do not need "additional queries to fetch them" - you can simply get all address entities that you need. This is fast and cheap.

This is a pretty standard problem with the datastore. The solution is denormalisation. By allowing duplicates you can break the problem down into a one-to-many relationship. So in your example: Allow Address duplicates and let each Address have a parent, either Company or Service. Or split your Address entity into two (ServiceAddress, CompanyAddress).
When you now modify a Service or Company, you can do a simple ancestor query for the addresses and you will only get the corresponding addresses.
This approach assumes that you will not update Address (or any other entity you have) more than once per second, since you will run into the 1 write per second and entity group otherwise.

Related

GAE: NDB query on list

Google App Engine NDB data model like so:
Users
Username
FirstName
LastName
Posts
PostID
PosterUsername
SubscribedPosts
PostID
SubscriberUsername
For a specific user, I want to return all the Posts which the user is subscribed to and display them on the page.
Since the wonderful NDB doesn't support JOINs, we do two queries:
postIDList =
SubscribedPosts.query(SubscribedPosts.SubscriberUsername == 'johndoe').fetch()
This gives us a list of SubscribedPosts. So how do I take my postIDList list and use it as a filter criteria for a Posts query?
Something like:
results = Posts.query(Post.PostID IN postIDList.PostID)
In a normal relational database, this would be a simple query using table joins. How is this done in Google's ndb?
You are going to run into lots of bottlenecks if you try to design your datastore models the same way you would design tables in a relational database as you have in this example.
Your comment goes in one possible right direction, although there are other solutions. Going that route, I would drop the "subscribedPosts" model altogether use a repeated KeyProperty entity in the User model to store subscribed posts.
See this related post: One-To-Many Example in NDB
Seems you are looking to model a many-to-many relationship, not one-to-many. Read Modelling Entity Relationships (althought this is for the older db, not newer ndb, it still gives the idea).
One of the two entities should maintain a list of keys (repeated=True) of the related other entities. Which entity should hold the list? Preferably the list should be on the side that usually has fewer relationships so that the list of keys is smaller. Another consideration is which side likely has less contention for updates.
In your specific case, lets say on average users subscribe to 10 posts and lets say on average each post has 100 users subscribed to it. In this case, we would want to put the list of keys on Users side of the relation.
class Users(ndb.Model):
user_name = ndb.StringProperty()
first_name = ndb.StringProperty()
last_name = ndb.StringProperty()
posts = ndb.KeyProperty(kind='Posts', repeated=True)
class Posts(ndb.Model)
post_id = ndb.StringProperty()
poster_user_name = ndb.StringProperty()
Establish the relationship by adding to the list in the Users instance:
current_user.posts.append(current_post.key)
For a given Users instance, getting all subscribed Posts is easy since the list of keys of the subscribed Posts is already within the given Users:
ndb.get_multi(given_user.posts)
For a given Posts instance, get all subscribing Users by ...
query = Users.query(Users.posts == given_post.key)

Sharded ancestor entities in GAE

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.
An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.
In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

How would I achieve this using Google App Engine Datastore?

I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()

GQL + Join Table Query Replacement for Google App Engine Datastore

Given the following Many to Many Relationship designed in Google App Engine Datastore:
User
PK: UserID
Name
Company
PK: CompanyID
Name
CompanyReview
CK CompanyID
CK UserID
ReviewContent
For optimization query, what's the best way to query this relationship tables for showing the selected company's review by users.
Currently, I'm doing the following:
results = CompanyReview.all().filter('owned_by = ', company).filter('written_by = ', user).fetch(10)
where I'm able to retrieve the data of CompanyReview table. However, in this case, I would need to check against the UserID from this CompanyReview table against the User table in order to obtain the name of the users who have commented for the selected company.
Is there a better solution to grab the user name as well, all in one statement in this case or at least better optimized solution? Performance is emphasized.
It dependes on which side of the relationship will have more values. As described is this article of Google App Engine docs, you can model many-to-many relationships by using a list of keys in one side of the relationship. "This means you should place the list on side of the relationship which you expect to have fewer values".
If both sides of the relationship will have many values, then you will really need the CompanyReview model. But pay attention to what the article says:
However, you need to be very careful because traversing the
connections of a collection will require more calls to the datastore.
Use this kind of many-to-many relationship only when you really need
to, and do so with care to the performance of your application.
This is because it uses RefereceProperty in the relationship model:
class ContactCompany(db.Model):
contact = db.ReferenceProperty(Contact,
required=True,
collection_name='companies')
company = db.ReferenceProperty(Company,
required=True,
collection_name='contacts')
title = db.StringProperty()
So if in Contact entities we try to access the companies, it will make a new query. And if in ContactCompany entities we try to get attributes of contact as in contact_company.contact.name, a query for that single contact will be made also. Read the ReferencyProperty docs for more info.
Extra:
Since you are performance-savvy, I recommend using a decorator for memcaching function returns and using this excellent layered storage library for Google App Engine.

Django - Designing Model Relationships - Admin interface and Inline

I think my understanding of Django's FK and admin is a bit faulty, so I'd value any input on how to model the below case.
Firstly, we have generic Address objects. Then, we have User's, who each have a UserProfile. Through this, Users belong to departments, as well as having addresses.
Departments themselves can also have multiple addresses, as well as a head of department. So it might be something like (this is something I'm just hacking up now):
class Address(models.Model):
street_address = models.CharField(max_length=20)
etc...
class Department(models.Model):
name = models.CharField(max_lenght=20)
head_of_department = models.OneToOneField(User)
address = models.ForeignKey(Address)
class UserProfile(models.Model):
user = models.ForeignKey(User, unique=True)
address = models.ForeignKey(Address)
department = models.OneToOneField(Department)
Anyhow, firstly, is that the right way of setting up the relationships?
And secondly, I'd like it to appear in the admin that you can edit a department, and on that page, it'd have an inline list of all the addresses to also edit. I've tried setting up an AddressInline class, and attaching it as an inline to Department.
class AddressInline(admin.TabularInline):
model = Address
class DepartmentAdmin(admin.ModelAdmin):
inlines = [AddressInline]
However, when I try to display that, I get:
Exception at /admin/people/department/1/
<class 'people.models.Address'> has no ForeignKey to <class 'people.models.Department'>
Cheers,
Victor
Since it seems you want a UserProfile or Department to have potentially many addresses, your ForeignKeys are backward. A single ForeignKey can only point to one model instance, whereas there is no limit on the number of ForeignKeys that can point to a single model instance. So your ForeignKey should be on Address (in which case your inline would work as-is).
The complicating factor is that you have a single Address model and you want to relate it to two other models; a single ForeignKey on Address can't point to both UserProfile and Department. One solution is to have two address models (DepartmentAddress, with a ForeignKey to Department, and UserAddress, with a ForeignKey to UserProfile). You could reduce duplication in your code by having these both inherit from an abstract base class containing all the data fields, but you still end up with two mostly-identical tables in your database.
The other option is to have a GenericForeignKey on Address, which can point to an instance of any model. Your inline would then need to become a GenericInlineModelAdmin. This violates pure database normalization and doesn't allow your database to do proper integrity checking. If you had potentially more models in the future that would also have addresses, I'd consider this; if it's likely to be limited to only the current two, I might go with the above option instead.
I read your models what you want from your models as:
A department has one to many addresses
A department has one and only one user (as head of department)
A user (through his profile) belongs to one to many departments
A user (through his profile) has one to many addresses
If that was your intent, meaning that there is no case where a user will NOT have an address or a department, and no case where a department will not have an address or a head of department; then I would say your models are OK should read:
class Department(models.Model):
name = models.CharField(max_lenght=20)
head_of_department = models.OneToOneField(User)
address = models.ForeignKey(Address)
class UserProfile(models.Model):
user = models.ForeignKey(User, unique=True)
address = models.ForeignKey(Address)
department = models.OneToOneField(Department)
class Address(models.Model):
street_address = models.CharField(max_length=20)
...
class Meta:
abstract = True
class UserAddress(Address):
user_profile = models.ForeignKey(UserProfile)
class DepartmentAddress(Address):
department = models.ForeignKey(Department)
Read more about abstract classes.
What your models are not contemplating are the possibilities that two Users will have the same address, and/or that two departments will have the same address. Since you are not specifying a unique constraint on address (that I can see), I assume that you are OK with a real-world address showing up more than once in your Address table.
If that is OK with you; fine.
The error message you are getting is stating a fact: there is no foreign key in Address to Department. You will have to revert that relationship for the inline to work. Meaning, when editing an address you can edit any departments associated with it; but not the reverse. With the models I suggest above you should not see this error.
See the example from the docs. Notice how an Author has many Books and how the many side of the relationship is the one that can be inline.

Resources