What would be the purpose of putting all datastore entities in a single group?

What would be the purpose of putting all datastore entities in a single group? - google-app-engine

I have started working on an existing project which uses Google Datastore where for some of the entity kinds every entity is assigned the same ancestor. Example:
class BaseModel(ndb.Model):
#classmethod
def create(cls, **kwargs):
return cls(parent=cls.make_key(), **kwargs)
#classmethod
def make_key(cls):
return ndb.Key('Group', cls.key_name())
class Vehicle(BaseModel):
#classmethod
def key_name(cls):
return 'vehicle_group'
So the keys end up looking like this:
Key(Group, 'vehicle_group', Vehicle, 5068993417183232)
There is no such kind as 'Group' nor entity 'vehicle_group' but that's OK in these docs: "note that unlike in a file system, the parent entity need not actually exist".
I understand from reading that this might have a performance benefit in that all the entities of a kind are colocated in the distributed datastore.
But putting all these entities in a single group would in my mind create problems as this project scales, and the once per second write limit would apply to the entire kind. There doesn't appear to be any transactional reason for the group.
No one on the project knows why it was originally done like this. My questions are:
Does anyone know where this "xxx_group" single entity scheme comes
from?
And is it as bunk as it appears to be?

Grouping many entities inside a single entity group offers at least 2 advantages I can think of:
ability to perform (ancestor) queries inside transactions - non-ancestor (or cross-group) queries are not allowed inside transactions
ability to access many entities inside the same transaction - cross-group transactions are limited to max 25 entity groups
The 1 write/second/group limit might not be a scalability issue at all for some applications (think write once read a lot kind of apps, for example, or apps for which 1 write per sec is more than enough).
As for the mechanics, the (unique) parent "entity" key for the group is the ndb.Key('Group', "xxx_group") key (which has the "xxx_group" key ID). The corresponding "entity" or its model doesn't need to exist (unless the entity itself needs to be created, bu that doesn't appear to be the case). The parent key is used simply to establish the group's "namespace" in the datastore, if you want.
You can see a somehow similar use in the examples from the Entity Keys documentation, check out the Message use (except Message is just a "parent" entity in the ancestor path, but not the root entity):
class Revision(ndb.Model):
message_text = ndb.StringProperty()
ndb.Key('Account', 'sandy#foo.com', 'Message', 123, 'Revision', '1')
ndb.Key('Account', 'sandy#foo.com', 'Message', 123, 'Revision', '2')
ndb.Key('Account', 'larry#foo.com', 'Message', 456, 'Revision', '1')
ndb.Key('Account', 'larry#foo.com', 'Message', 789, 'Revision', '2')
...
Notice that Message is not a model class. This is because we are
using Message purely as a way to group Revisions, not to store data.

This was probably done to achieve strongly consistent queries within the group. As you've pointed out this design has... drawbacks.
If this is solely reference data (i.e. Read many write once) that may mitigate some of the negatives, but also mostly invalidates the positives (i.e. Eventual consistency is not a problem if data doesn't update often).

Related

Is there any side effect of not having a physical entity for it to act as parent key

If I go through google app engine tutorial, I can see their example seem to encourage us to have parent for entities.
Hence, I have the following workable code, for user creation (with email as unique)
def parent_key():
return ndb.Key('parent', 'parent')
class User(ndb.Model):
email = ndb.StringProperty(required = True)
timestamp = ndb.DateTimeProperty(required = True)
class RegisterHandler2(webapp2.RequestHandler):
def get(self):
email = self.request.get('email')
user_timestamp = int(time.time())
user = User.get_or_insert(email, parent=parent_key(), email=email, timestamp=datetime.datetime.fromtimestamp(user_timestamp))
Note, parent entity physically doesn't exist.
Although the above code runs totally fine, I was wondering any possible problem can occur, if parent entity physically doesn't exist?
One of my concern of not having parent, is eventually consistency. After write operation, I want my read operation able to fetch the latest written value. I'm using User.get_or_insert to write (and read), and User.get_by_id to read only.
I want after I execute User.get_or_insert, and next request User.get_by_id will return latest value. I was wondering, to achieve strong consistency, is parent key an important thingy?

There are no problems as long as you don't actually need this parent entity.
You should not make a decision to use parent entities lightly. In fact, using entity groups (parent-child entities) limit the number of entities you can update per second and makes it necessary to know the parent key to retrieve a child entity.
You may run into serious problems. For example, if entity "User" is a child of some parent entity, and then all other entities are children of entities "User", that turns all of your data into one big entity group. Assuming your app is fairly active, you will see datastore operations failures because of this performance limitation.
Note also that a key of an entity gets longer if you have to include a key of a parent entity into it. If you create a chain of entities (e.g. parent -> user -> album -> photo), a key for each "photo" entity will include a key for album, a key for user and a key for parent entity. It becomes a nightmare to manage and requires much more storage space.

Using a parent key that doesn't correspond to an entity that actually has properties (which is what I think you're referring to as a 'physical entity') is a standard technique.
You can even decide later to add properties to that key.
I've been using this technique for years.

How would I achieve this using Google App Engine Datastore?

I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?

Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.

(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).

I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()

creating a compound or composite key on google app engine

I have two models:
Car(ndb.Model) and Branch(ndb.Model) each with a key method.
#classmethod
def car_key(cls, company_name, car_registration_id):
if not (company_name.isalnum() and car_registration_id.isalnum()):
raise ValueError("Company & car_registration_id must be alphanumeric")
key_name = company_name + "-" + car_registration_id
return ndb.Key("Car", key_name)
Branch Key:
#classmethod
def branch_key(cls, company_name, branch_name):
if not (company_name.isalnum() and branch_name.isalnum()):
raise ValueError("Company & Branch names must be alphanumeric")
key_name = company_name + "-" + branch_name
return ndb.Key("Branch", key_name)
However I'm thinking this is a bit ugly and not really how you're supposed to use keys.
(the car registration is unique to a car but sometimes one company may sell a car to another company and also cars move between branches).
Since a company may many cars or many branches, I suppose I don't want large entity groups because you can only write to an entity group once per second.
How should I define my keys?
e.g. I'm considering car_key = ndb.Key("Car", car_reg_id, "Company", company_name)
since it's unlikely for a car to have many companies so the entity group wont be too big.
However I'm not sure what to do about the branch key since many companies may have the same branch name, and many branches may have the same company.

You've rightly identified that ancestor relationships in GAE should not be based on the logical structure of your data.
They need to be based on the transactional behavior of your application. Ancestors make your life difficult. For example, once you use a compound key, you won't be able to fetch that entity by key unless you happen to know all the elements of the key. If you knew the Car id, you wouldn't be able to fetch it without also knowing the other component.
Consider what queries you would need to have strong consistency for. If you do happen to need strong consistency when querying all the cars in a given branch, then you should consider using that as an ancestor.
Consider what operations need to be done in a transaction, that's another good reason for using an entity group.
Keep in mind also, you might not need any entity group at all (probably the answer for your situation).
Or, on the flip side, you might need an entity group that might not exactly fit any logical conceptual model, but the ancestor might be an entity that exists purely to exists because you need an ancestor for a certain transaction.

NDB Modeling One-to-one with KeyProperty

I'm quite new to ndb but I've already understood that I need to rewire a certain area in my brain to create models. I'm trying to create a simple model - just for the sake of understanding how to design an ndb database - with a one-to-one relationship: for instance, a user and his info. After searching around a lot - found documentation but it was hard to find different examples - and experimenting a bit (modeling and querying in a couple of different ways), this is the solution I found:
from google.appengine.ext import ndb
class Monster(ndb.Model):
name = ndb.StringProperty()
#classmethod
def get_by_name(cls, name):
return cls.query(cls.name == name).get()
def get_info(self):
return Info.query(Info.monster == self.key).get()
class Info(ndb.Model):
monster = ndb.KeyProperty(kind='Monster')
address = ndb.StringProperty()
a = Monster(name = "Dracula")
a.put()
b = Info(monster = a.key, address = "Transilvania")
b.put()
print Monster.get_by_name("Dracula").get_info().address
NDB doesn't accept joins, so the "join" we want has to be emulated using class methods and properties. With the above system I can easily reach a property in the second database (Info) through a unique property in the first (in this case "name" - suppose there are no two monsters with the same name).
However, if I want to print a list with 100 monster names and respective addresses, the second database (Info) will be hit 100 times.
Question: is there a better way to model this to increase performance?

If its truly a one to one relationship, why are creating 2 models. Given your example the Address entity cannot be shared with any Monster so why not put the Address details in the monster.
There are some reasons why you wouldn't.
Address could become large and therefore less efficient to retrieve 100's of properties when you only need a couple - though project queries may help there.
You change your mind and you want to see all monsters that live in Transylvania - in which case you would create the address entity and the Monster would have the key property that points to the Address. This obviously fails when you work out that some monsters can live in multiple places (Werewolfs - London, Transylvania, New York ;-) , in which case you either have a repeating KeyProperty in the monstor or an intermediate entity that points to the monster and the address. In your case I don't think that monsters on the whole have that many documented Addresses ;-)
Also if you are uniquely identifying monsters by name you should consider storing the name as part of the key. Doing a Monster.get_by_id("dracula") is quicker than a query by name.
As I wrote (poorly) in the comment. If 1. above holds and it is a true one to one relationship. I would then create Address as a child entity (Monster is the parent/ancestor in the key) when creating address. This allows you to,
allow other entities to point to the Address,
If you create a bunch of child entities, fetch them with a single
ancestor query). 3 If you have get monster and it's owned entities
again it's an ancestor query.
If you have a bunch of entities that
should only exist if Monster instance exists and they are not
children, then you have to do querys on all the entity types with
KeyProperty's matching the key, and if theses entities are not
PolyModels, then you have to perform a query for each entity
type (and know you need to perform the query on a given entity,
which involves a registry of some type, or hard coding things)

I suspect what you may be trying could be achieved by using elements described in the link below
Have a look at "Operations on Multiple Keys or Entities" "Expando Models" "Model Hooks"
https://developers.google.com/appengine/docs/python/ndb/entities
(This is probably more a comment than an answer)

Google Appengine: Is This a Good set of Entity Groups?

I am trying to wrap my head around Entity Groups in Google AppEngine. I understand them in general, but since it sounds like you can not change the relationships once the object is created AND I have a big data migration to do, I want to try to get it right the first time.
I am making an Art site where members can sign up as regular a regular Member or as one of a handful of non-polymorphic Entity "types" (Artist, Venue, Organization, ArtistRepresentative, etc). Artists, for example can have Artwork, which can in turn have other Relationships (Gallery, Media, etc). All these things are connected via References and I understand that you don't need Entity Groups to merely do References. However, some of the References NEED to exist, which is why I am looking at Entity Groups.
From the docs:
"A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller."
That said, I have a couple hopefully yes/no questions.
Question 0: I gather you don't need Entity Groups just to do transactions. However, since Entity Groups are stored in the same region of Big Table, this helps cut down on consistency issues and race conditions. Is this a fair look at Entity Groups and Transactions together?
Question 1: When a child Entity is saved, do any parent objects get implicitly accessed/saved? i.e. If I set up an Entity Group with path Member/Artist/Artwork, if I save an Artwork object, do the Member and Artist objects get updated/accessed? I would think not, but I am just making sure.
Question 2: If the answer to Question 1 is yes, does the accessing/updating only travel up the path and not affect other children. i.e. If I update Artwork, no other Artwork child of Member is updated.
Question 3: Assuming it is very important that the Member and its associated account type entity exist when a user signs up and that only the user will be updating its Member and associated account type Entity, does it make sense to put these in Entity Groups together?
i.e. Member/Artist, Member/Organization, Member/Venue.
Similarly, assuming only the user will be able to update the Artwork entities, does it make sense to include those as well? Note: Media/Gallery/etc which are references to Artwork may be related to lots of Artwork, not just those owned by the user (i.e. many to many relations).
It makes sense to have all the user's bits in an entity group if it works the way I suspect (i.e. Q1/Q2 are "no"), since they will all be in the same region of BigTable. However, adding the Artwork to the entity group seems like it might violate the "keep it small" principal and honestly, may not need to be in Transactions aside from saving bandwidth/retrys when users are uploading artwork images.
Any thoughts? Am I approaching Entity Groups wrong?

0: You do need entity groups for transactions among multiple entities
1: Modifying/accessing children does not modify/access a parent
2: N/A
3: Sounds reasonable. My feeling is, entity groups should not be used unless you need transactions among them.
It is not necessary to have the the Artwork as a child for permission purposes. But if you need transactional modification to them (including e.g. creation and deletion) it might be better. For example: if you delete an account, you delete the user entity but before you delete the child, you get DeadlineExceeded or the server crashes. Now you have an orphaned Artwork. If you have more than 1,000 Artworks for an Artist, you must delete in batches.
Good luck!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight