I'm quite new to ndb but I've already understood that I need to rewire a certain area in my brain to create models. I'm trying to create a simple model - just for the sake of understanding how to design an ndb database - with a one-to-one relationship: for instance, a user and his info. After searching around a lot - found documentation but it was hard to find different examples - and experimenting a bit (modeling and querying in a couple of different ways), this is the solution I found:
from google.appengine.ext import ndb
class Monster(ndb.Model):
name = ndb.StringProperty()
#classmethod
def get_by_name(cls, name):
return cls.query(cls.name == name).get()
def get_info(self):
return Info.query(Info.monster == self.key).get()
class Info(ndb.Model):
monster = ndb.KeyProperty(kind='Monster')
address = ndb.StringProperty()
a = Monster(name = "Dracula")
a.put()
b = Info(monster = a.key, address = "Transilvania")
b.put()
print Monster.get_by_name("Dracula").get_info().address
NDB doesn't accept joins, so the "join" we want has to be emulated using class methods and properties. With the above system I can easily reach a property in the second database (Info) through a unique property in the first (in this case "name" - suppose there are no two monsters with the same name).
However, if I want to print a list with 100 monster names and respective addresses, the second database (Info) will be hit 100 times.
Question: is there a better way to model this to increase performance?
If its truly a one to one relationship, why are creating 2 models. Given your example the Address entity cannot be shared with any Monster so why not put the Address details in the monster.
There are some reasons why you wouldn't.
Address could become large and therefore less efficient to retrieve 100's of properties when you only need a couple - though project queries may help there.
You change your mind and you want to see all monsters that live in Transylvania - in which case you would create the address entity and the Monster would have the key property that points to the Address. This obviously fails when you work out that some monsters can live in multiple places (Werewolfs - London, Transylvania, New York ;-) , in which case you either have a repeating KeyProperty in the monstor or an intermediate entity that points to the monster and the address. In your case I don't think that monsters on the whole have that many documented Addresses ;-)
Also if you are uniquely identifying monsters by name you should consider storing the name as part of the key. Doing a Monster.get_by_id("dracula") is quicker than a query by name.
As I wrote (poorly) in the comment. If 1. above holds and it is a true one to one relationship. I would then create Address as a child entity (Monster is the parent/ancestor in the key) when creating address. This allows you to,
allow other entities to point to the Address,
If you create a bunch of child entities, fetch them with a single
ancestor query). 3 If you have get monster and it's owned entities
again it's an ancestor query.
If you have a bunch of entities that
should only exist if Monster instance exists and they are not
children, then you have to do querys on all the entity types with
KeyProperty's matching the key, and if theses entities are not
PolyModels, then you have to perform a query for each entity
type (and know you need to perform the query on a given entity,
which involves a registry of some type, or hard coding things)
I suspect what you may be trying could be achieved by using elements described in the link below
Have a look at "Operations on Multiple Keys or Entities" "Expando Models" "Model Hooks"
https://developers.google.com/appengine/docs/python/ndb/entities
(This is probably more a comment than an answer)
Related
I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()
I'm trying to wrap my head around how I can represent a many-to-many relationship inside of AppEngine's Datastore in the Go Programming Language. I'm more used to traditional relational databases.
I have two types of entities in my system. Let's call them A and B. Every A entity is related to some number of B entities. Similarly, every B entity is related to some other number of A entities. I'd like to be able to efficiently query for all B entities given an A entity, and for all A entities given a Bentity.
In the Python SDK, there seems to be a way to note fields in an entity can be ReferencePropertys which reference some other entity. However, I can't find something similar in Go's AppEngine SDK. Go seems to just use basic structs to represent entities.
What's the best practice for dealing with this?
A python ReferenceProperty essentially stores a key to another entity. It's similar to using a Key field in Go.
There's at least two ways to solve your problem. A cheap way to store a limited number of references, and an expensive way for larger data sets.
fmt.Println.MKO provided the answer for the cheap way, except the query is simpler than what he suggests, it should actually be:
SELECT * FROM B where AIds = 'A1'
This method is limited to the number of indexed entries per entity, as well as the entity size. So the list of AIds or BIds will limit the number of entities to 20000 or less.
If you have an insane amount of data, you would want a mapping entity to represent the M2M relationship between a given A & B entity. It would simply contain a key to an A and a key to a B. You would then query for map entities, and then fetch the corresponding A or B entities you need. This would be much more expensive, but breaks past the entity size limit.
based on how you which to query you could do the following:
in your struct A add a field:
BIds []int64
in your struct B add a field:
AIds []int64
now any time you add a relation between A and B you just need to add the corresponding ids to your two variables
when you need to query now for all B which are related to this A1 you do your query like this:
SELECT * FROM B where AIds = 'A1'
for all A wich are related to this B1 your do it similar:
SELECT * FROM A where BIds = 'B1'
update:
altered querys on suggestion from dragonx
I have two models:
Car(ndb.Model) and Branch(ndb.Model) each with a key method.
#classmethod
def car_key(cls, company_name, car_registration_id):
if not (company_name.isalnum() and car_registration_id.isalnum()):
raise ValueError("Company & car_registration_id must be alphanumeric")
key_name = company_name + "-" + car_registration_id
return ndb.Key("Car", key_name)
Branch Key:
#classmethod
def branch_key(cls, company_name, branch_name):
if not (company_name.isalnum() and branch_name.isalnum()):
raise ValueError("Company & Branch names must be alphanumeric")
key_name = company_name + "-" + branch_name
return ndb.Key("Branch", key_name)
However I'm thinking this is a bit ugly and not really how you're supposed to use keys.
(the car registration is unique to a car but sometimes one company may sell a car to another company and also cars move between branches).
Since a company may many cars or many branches, I suppose I don't want large entity groups because you can only write to an entity group once per second.
How should I define my keys?
e.g. I'm considering car_key = ndb.Key("Car", car_reg_id, "Company", company_name)
since it's unlikely for a car to have many companies so the entity group wont be too big.
However I'm not sure what to do about the branch key since many companies may have the same branch name, and many branches may have the same company.
You've rightly identified that ancestor relationships in GAE should not be based on the logical structure of your data.
They need to be based on the transactional behavior of your application. Ancestors make your life difficult. For example, once you use a compound key, you won't be able to fetch that entity by key unless you happen to know all the elements of the key. If you knew the Car id, you wouldn't be able to fetch it without also knowing the other component.
Consider what queries you would need to have strong consistency for. If you do happen to need strong consistency when querying all the cars in a given branch, then you should consider using that as an ancestor.
Consider what operations need to be done in a transaction, that's another good reason for using an entity group.
Keep in mind also, you might not need any entity group at all (probably the answer for your situation).
Or, on the flip side, you might need an entity group that might not exactly fit any logical conceptual model, but the ancestor might be an entity that exists purely to exists because you need an ancestor for a certain transaction.
I want to run over this plan I have for achieving strong consistency with my GAE structure. Currently, here's what I have (it's really simple, I promise):
You have a Class (Class meaning Classroom not a programming "class") model and an Assignment model and also a User model. Now, a Class has an integer list property called memberIds, which is an indexed list of User ids. A class also has a string list of Assignment ids.
Anytime a new Assignment is created, its respective Class entity is also updated and adds the new Assignment id to its list.
What I want to do is get new Assignments for a user. What I do is query for all Classes where memberId = currentUserId. Each Class I get back has a list of assignment ids. I use those ids to get their respective Assignments by key. After months with this data model, I just realized that I might not get strong consistency with this (for the Class query part).
If user A posts an assignment (which consequently updates ClassA), user B who checks in for new assignments a fraction of a second later might not yet see the updated changes to ClassA (right?).
This is undesired. One solution would be to user ancestor queries, but that is not possible in my case, and entity groups are limited to 1 write per second, and I'm not sure if that's enough for my case.
So here's what I figured: anytime a new assignment is posted, we do this:
Get the respective Class entity
add the assignment id to the Class
get the ids of all this Class's members
fetch all users who are members, by keys
a User entity has a list of Classes that the user is a member of. (A LocalStructuredProperty, sort of like a dictionary:{"classId" : "242", "hasNewAssignment" : "Yes"} ). We flag that class as hasNewAssignment =
YES
Now when a user wants to get new assignments, instead of querying for groups that have a new assignment and which I am a member of, I
check the User objects list of Classes and check which classes have
new assignments.
I retrieve those classes by key (strongly consistent so far, right?)
I check the assignments list of the classes and I retrieve all assignments by key.
So throughout this process, I've never queried. All results should be strongly consistent, right? Is this a good solution? Am I overcomplicating things? Are my read/write costs skyrocketing with this? What do you think?
Queries are not strongly consistent. Gets are strongly consistent.
I think you did the right thing:
Your access is strongly consistent.
Your reads will be cheaper: one get is half cheaper as then query that returns one entity.
You writes will be more expensive: you also need to update all User entities.
So the cost depends on your usage pattern: how many assignment reads do you have vs new assignment creation.
I think using ancestor queries is a better solution.
Set the ancestor of the Assignment entities as the Class to which the assignment is allotted
Set the ancestor of a Student entity as the Class to which the student belongs.
This way all the assignments and students in a particular class belong to the same entity group. So Strong consistency is guaranteed w.r.t a query that has to deal with only a single class.
N.B. I am assuming that not too many people won't be posting assignments into a class at the same time. (but any number of people can post assignments into different classes, as they belong to different entity groups)
We are working on a mapping application that uses Google Maps API to display points on a map. All points are currently fetched from a MySQL database (holding some 5M + records). Currently all entities are stored in separate tables with attributes representing individual properties.
This presents following problems:
Every time there's a new property we have to make changes in the database, application code and the front-end. This is all fine but some properties have to be added for all entities so that's when it becomes a nightmare to go through 50+ different tables and add new properties.
There's no way to find all entities which share any given property e.g. no way to find all schools/colleges or universities that have a geography dept (without querying schools,uni's and colleges separately).
Removing a property is equally painful.
No standards for defining properties in individual tables. Same property can exist with different name or data type in another table.
No way to link or group points based on their properties (somehow related to point 2).
We are thinking to redesign the whole database but without DBA's help and lack of professional DB design experience we are really struggling.
Another problem we're facing with the new design is that there are lot of shared attributes/properties between entities.
For example:
An entity called "university" has 100+ attributes. Other entities (e.g. hospitals,banks,etc) share quite a few attributes with universities for example atm machines, parking, cafeteria etc etc.
We dont really want to have properties in separate table [and then linking them back to entities w/ foreign keys] as it will require us adding/removing manually. Also generalizing properties will results in groups containing 50+ attributes. Not all records (i.e. entities) require those properties.
So with keeping that in mind here's what we are thinking about the new design:
Have separate tables for each entity containing some basic info e.g. id,name,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity (or a table if you like) to attribute using a many-to-many relation.
Store addresses in different table called addresses link entities via foreign keys.
We think this will allow us to be more flexible when adding, removing or querying on attributes.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given university we might have a query with 20+ joins to fetch all related attributes in a single row.
We desperately need to know some opinions or possible flaws in this design approach.
Thanks for your time.
In trying to generalize your question without more specific examples, it's hard to truly critique your approach. If you'd like some more in depth analysis, try whipping up an ER diagram.
If your data model is changing so much that you're constantly adding/removing properties and many of these properties overlap, you might be better off using EAV.
Otherwise, if you want to maintain a relational approach but are finding a lot of overlap with properties, you can analyze the entities and look for abstractions that link to them.
Ex) My Db has Puppies, Kittens, and Walruses all with a hasFur and furColor attribute. Remove those attributes from the 3 tables and create a FurryAnimal table that links to each of those 3.
Of course, the simplest answer is to not touch the data model. Instead, create Views on the underlying tables that you can use to address (5), (4) and (2)
1 cannot be an issue. There is one place where your objects are defined. Everything else is generated/derived from that. Just refactor your code until this is the case.
2 is solved by having a metamodel, where you describe which properties are where. This is probably needed for 1 too.
You might want to totally avoid the problem by programming this in Smalltalk with Seaside on a Gemstone object oriented database. Then you can just have objects with collections and don't need so many joins.