Trying to model some highly connected, but also hierarchical data in app engine.
Here is an example:
Person:
Phone Numbers:
Number: 555-555-5555, Ext: 123, Notes: Work
Number: 444-444-4444, Ext: 456, Notes: Mobile
One entity, contained data structures stored as JSON blobs:
One way to do this would be to store the phone_numbers collection as an unindexed blob of JSON text, and then add a search property so that a person could be queried by phone number:
p_entity = Person()
p_entity.phone_numbers = dbText(simplejson.dumps([{'Number':'555-555-5555', 'Ext':'123', 'Notes':'Work'},{'Number':'444-444-4444', Ext:'456', Notes:'Mobile'}]))
p_entity.phone_numbers_search_property = ['5555555555', '4444444444']
p_entity.put()
Multiple entities with parent-child relationships:
Another way would be to use child and parent entities:
person_entity = Person()
person_entity.put()
phone_entity1 = PhoneNumber(parent=person_entity)
phone_entity.Number = '5555555555'
phone_entity.Ext = '123'
phone_entity.Notes = 'Work'
phone_entity2 = PhoneNumber(parent=person_entity)
phone_entity.Number = '4444444444'
phone_entity.Ext = '456'
phone_entity.Notes = 'Mobile'
A use case:
This is highly connected data. A person object contains multiple phone numbers. But phone calls can also be made to and from those phone numbers. Records of phone calls will also need to refer to these phone numbers.
The purpose of parent-child entity relationships:
After reading over the documentation, I was under the impression that the purpose of parent-child entity relationships was for performing transactions.
However, could they also be appropriate in this case? Is it almost as efficient to pull a parent and all of it's children out of the datastore as to pull one entity out with its "children" instead stored as JSON text blobs?
Basic question
Is there a normal and accepted way to handle this kind of data in google app engine?
Take a look at the new NDB API (in particular the StructuredProperty: http://code.google.com/appengine/docs/python/ndb/properties.html#structured)
Also, from my experience and what I've read, when you update an existing entity, you do not pay for writes on properties that did not change, that is, in contrast to what Riley said, you would only pay for writing the object + 2 writes for any indexed properties that were modified + 1 write for each composite index you have that contains the model and properties you modified.
From all the articles I've read and my experience (I too had to come up with a solution for this and ended up with the JSON method) you want to pack as much as you can into a single entity to minimize trips to the datastore which cost the most in terms of $$ and time.
There's no special bonus for pulling children entities out of the datastore. If you get two entities, the cost is the same whether they're in the same entity group or not. The only purpose of entity groups in app engine is transactions.
Should your records of phone calls change when the phone number is changed? My initial thought is that the records should have a separate copy of the phone number data, not a reference to a phone number object. You can still query the call log by phone number. It makes more sense to store a reference to the contacts involved, so that if their name changes or something the call log can be updated.
Related
I am currently exploring MongoDB.
I built a notes web app and for now the DB has 2 collections: notes and users.
The user can create, read and update his notes.
I want to create a page called /my-notes that will display all the notes that belong to the connected user.
My question is:
Should the notes model has an ownerId field or the opposite - the user model will have a field of noteIds of type list.
Points I found relevant for the decision making:
noteIds approach:
There is no need to query the notes that hold the desired ownerId (say we have a lot of notes then we will need indexes and search accross the whole notes collection). We just need to find the user by user ID and then get all the notes by their IDs.
In this case there are 2 calls to DB.
The data is ordered by the order of insertion to the notesIds field in the document.
ownerId approach:
We do need to find the notes by their ownerId field across the notes collection which might be more computer "intensive".
We can paginate / sort the data as we want - more control over the data.
Are there any more points you can think of?
As I can conclude this is a question of whether you want less computer intensive DB calls vs more control over the data.
What are the "best practices"?
Thanks,
A similar use case is explained in the documentation. If there is no limit on number of notes a user can have, it might be better to store a userId reference field in notes document.
As you've figured out already, pagination would be easier in the second approach. Also when updating notes, you can simply updateOne({ _id: "note_id", userId: 1 }) instead of checking user's document if the note actually belong to the user.
Given below entity in google app engine datastore, is it better to define index on reportingIds or define a separate entity which has only personId and reportingIds fields? Based on the documentation I understood, defining index results in increase of count of operations against datastore quota.
Below are entities in GAE Go. My code needs to scan through Person entities frequently. It needs to limit its scan to Person entity that has at least 1 reporting person. 2 approaches I see. Define index on reportingIds and Query by specifying filters. Create/Update PersonWithReporters entity when ever a Person gets a new reporting person. In the second case, my code needs to iterate through all the entities in PersonWithReporters and need not construct any index/query. I can iterate using Key which is always guaranteed to have the latest data. Not sure which approach is beneficial considering datastore operation counts against quota limit.
type Person struct {
Id string //unique person id
//many other personal details, his personal settings etc
reportingIds []string //ids of the Person this guy manages
}
type PersonWithReporters struct {
Id string //Person managing reportees
reportingIds []string //ids of the Person this guy manages
}
A approach with a separate entity gives you two advantages.
As you have already mentioned, you don't need to index/query all Person entities.
Every time a Person gets a new reporting person, you will create a new entity, which may be significantly cheaper than updating a Person entity which has many other properties, some of which, presumably, are indexed.
Your approach with a separate entity is also not ideal. When you index a property with multiple values, under the hood the Datastore creates an index entry for each value. So, when you add reporting person number 3 to this entity, you have to update 3 index entries instead of 1.
You can optimize your data model even further by creating a Reporter entity with no properties! Every time a new reporting person is added, you create this Reporter entity with ID set to the ID of a reporting person, and make it a child entity of a Person entity representing a person to whom this reporter reports.
Now, when you need to iterate through all persons with someone reporting to them, you run a simple query on this Reporter entity - no filters. This query can be set to keys-only (there is nothing than a key in this entity anyway, but keys-only queries are treated differently - they are basically free).
For every entity returned by this query you retrieve its key, and this key contains an ID (which is an ID of a reporting person), and a parent key, which includes an ID of a person who this reporter reports to.
Unless AppEngine's datastore in Go is very different to how it works in Java or Python you cannot index an array natively - So option 1 is out of the question, and so is option 2.
I suggest option three, which is to define a
type PersonWithReporters {
Id string // concatenate(managing_Person_id, separator, reporter_Person_id) to avoid id collisions
reportingId string; // indexed
managingId string; // probably indexed as well
}
You would create multiple of these entities instead of a single entity with an array. Also you add an index on reportingId. Now you can create a filter query on this entity and should be able to retrieve the desired information.
I would worry more about performance and not too much about the quota limits, they are pretty high. Just implement it, see how it works and whether quota is your main concern here.
I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()
I'm quite new to ndb but I've already understood that I need to rewire a certain area in my brain to create models. I'm trying to create a simple model - just for the sake of understanding how to design an ndb database - with a one-to-one relationship: for instance, a user and his info. After searching around a lot - found documentation but it was hard to find different examples - and experimenting a bit (modeling and querying in a couple of different ways), this is the solution I found:
from google.appengine.ext import ndb
class Monster(ndb.Model):
name = ndb.StringProperty()
#classmethod
def get_by_name(cls, name):
return cls.query(cls.name == name).get()
def get_info(self):
return Info.query(Info.monster == self.key).get()
class Info(ndb.Model):
monster = ndb.KeyProperty(kind='Monster')
address = ndb.StringProperty()
a = Monster(name = "Dracula")
a.put()
b = Info(monster = a.key, address = "Transilvania")
b.put()
print Monster.get_by_name("Dracula").get_info().address
NDB doesn't accept joins, so the "join" we want has to be emulated using class methods and properties. With the above system I can easily reach a property in the second database (Info) through a unique property in the first (in this case "name" - suppose there are no two monsters with the same name).
However, if I want to print a list with 100 monster names and respective addresses, the second database (Info) will be hit 100 times.
Question: is there a better way to model this to increase performance?
If its truly a one to one relationship, why are creating 2 models. Given your example the Address entity cannot be shared with any Monster so why not put the Address details in the monster.
There are some reasons why you wouldn't.
Address could become large and therefore less efficient to retrieve 100's of properties when you only need a couple - though project queries may help there.
You change your mind and you want to see all monsters that live in Transylvania - in which case you would create the address entity and the Monster would have the key property that points to the Address. This obviously fails when you work out that some monsters can live in multiple places (Werewolfs - London, Transylvania, New York ;-) , in which case you either have a repeating KeyProperty in the monstor or an intermediate entity that points to the monster and the address. In your case I don't think that monsters on the whole have that many documented Addresses ;-)
Also if you are uniquely identifying monsters by name you should consider storing the name as part of the key. Doing a Monster.get_by_id("dracula") is quicker than a query by name.
As I wrote (poorly) in the comment. If 1. above holds and it is a true one to one relationship. I would then create Address as a child entity (Monster is the parent/ancestor in the key) when creating address. This allows you to,
allow other entities to point to the Address,
If you create a bunch of child entities, fetch them with a single
ancestor query). 3 If you have get monster and it's owned entities
again it's an ancestor query.
If you have a bunch of entities that
should only exist if Monster instance exists and they are not
children, then you have to do querys on all the entity types with
KeyProperty's matching the key, and if theses entities are not
PolyModels, then you have to perform a query for each entity
type (and know you need to perform the query on a given entity,
which involves a registry of some type, or hard coding things)
I suspect what you may be trying could be achieved by using elements described in the link below
Have a look at "Operations on Multiple Keys or Entities" "Expando Models" "Model Hooks"
https://developers.google.com/appengine/docs/python/ndb/entities
(This is probably more a comment than an answer)
Given the following Many to Many Relationship designed in Google App Engine Datastore:
User
PK: UserID
Name
Company
PK: CompanyID
Name
CompanyReview
CK CompanyID
CK UserID
ReviewContent
For optimization query, what's the best way to query this relationship tables for showing the selected company's review by users.
Currently, I'm doing the following:
results = CompanyReview.all().filter('owned_by = ', company).filter('written_by = ', user).fetch(10)
where I'm able to retrieve the data of CompanyReview table. However, in this case, I would need to check against the UserID from this CompanyReview table against the User table in order to obtain the name of the users who have commented for the selected company.
Is there a better solution to grab the user name as well, all in one statement in this case or at least better optimized solution? Performance is emphasized.
It dependes on which side of the relationship will have more values. As described is this article of Google App Engine docs, you can model many-to-many relationships by using a list of keys in one side of the relationship. "This means you should place the list on side of the relationship which you expect to have fewer values".
If both sides of the relationship will have many values, then you will really need the CompanyReview model. But pay attention to what the article says:
However, you need to be very careful because traversing the
connections of a collection will require more calls to the datastore.
Use this kind of many-to-many relationship only when you really need
to, and do so with care to the performance of your application.
This is because it uses RefereceProperty in the relationship model:
class ContactCompany(db.Model):
contact = db.ReferenceProperty(Contact,
required=True,
collection_name='companies')
company = db.ReferenceProperty(Company,
required=True,
collection_name='contacts')
title = db.StringProperty()
So if in Contact entities we try to access the companies, it will make a new query. And if in ContactCompany entities we try to get attributes of contact as in contact_company.contact.name, a query for that single contact will be made also. Read the ReferencyProperty docs for more info.
Extra:
Since you are performance-savvy, I recommend using a decorator for memcaching function returns and using this excellent layered storage library for Google App Engine.