Is it the closest or the most distant parent relative of the entity being written, which determines the entity group? (Question 1) For, if I have,
two simultaneous requests to write two different entities, in this example, both having immediate parent the Data entity (with key '2'), and having subsequent ancestors of:
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Shia La Boef
In either case they both belong to the same entity group, either anchored at entity Person:9, or at entity Data:2. Which is the correct determiner of the entity group, Person:9 or Data:2? Also if there are two kinds of entities descended from Data:2, say Record and Scheme, will all the Record and Scheme entities belong to the same entity group, anchored by Data:2, or, by virtue of being different kinds, belong to separate entity groups? (Question 2)
Incidentally, if it is Person:9 which determines the entity group, and different kinds under a parent do not form different entity groups under that parent, then everything descended from Person:9 is the same entity group and is going to have to be written in serial, the horror
Since in this example, I am performing these writes of the same kind of entity to the same entity group concurrently, they will be applied serially, and therefore take 'double the time.', than if they could be applied concurrently.
What is a good solution for this 'doubling' of time taken? (Question 3 -- optional!)
I have thought of the following:
Since I know that the two separate writes must be initiated by two separate client instances, I can add a further ancestor to the chain, which represents the client instance doing the writing, like so:
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:92** > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:37** > Record of Shia La Boef
This way the writes will belong to different entity groups (so long as the hypothesis of Person:9 anchoring the group is wrong), and therefore can always be performed concurrently. Can an AppEngineer/expert weigh in on this? (Question 4)
Further since I enforce the restriction that separate clients can only make serial requests to the datastore, and I can guarantee without any performance impact that any writes made by a single client never need to occur more than 1 time per second, the above method, if it works, will mean there is zero performance impact and as long as I have enough separate Clients (and, he, enough quota) I can make as many writes to the datastore as fast as the HTTP can carry them. Can an AppEngineer/expert weigh in on this? (Question 5)
The only issue I see with this group splitting approach is that querying for the Record entities under the Data:2 parent, is now complicated by the fact that, even though the records are related semantically, they are separated by the different clients. So in order to collect all records, I need to first collect all clients, and then collect all there records. Can anyone see whether this would create a stupendously horrible performance impact, doing this kind of "query all the children of the children you just queried" query...? Can an AppEngineer/expert weigh in on this? (Question 6)
You have some misconceptions here.
Firstly, the documentation is quite explicit on what constitutes an entity group: it is everything under a root entity.
However I don't know where you got the idea that writes within an entity group are in some way more "serial" than ones outside. The documentation doesn't say that, or imply it. The only thing that it does say about this is that writes to a single entity group take place at no more than 1 per second.
The rest of your questions make no sense at all: adding another element to the chain doesn't change the root entity.
I'm not sure why you need such deep entity group chains in the first place. The documentation's advice on scaling is to keep entity groups small. If each leaf entity will only ever be written to by one client, it sounds like the client itself should be the root, and the rest of the path should not be part of the ancestry at all: perhaps you could use a ReferenceProperty to refer to one or more of those entities via its key.
Related
I have two models which naturally exist in a parent-child relationship. IDs for the child are unique within the context of a single parent, but not necessarily globally, and whenever I want to query a specific child, I'll have the IDs for both parent and child available.
I can implement this two ways.
Make the datastore key name of each child entity be the string "<parent_id>,<child_id>", and do joins and splits to process the IDs.
Use parent keys.
Option 2 sounds like the obvious winner from a code perspective, but will it hurt performance on writes? If I never use transactions, is there still overhead for concurrent writes to different children of the same parent? Is the datastore smart enough to know that if I do two transactions in the same entity group which can't affect each other, they should both still apply? Or should parent keys be avoided if locking isn't necessary?
In terms of the datastore itself, parent/child relationships are conceptual only. That is, the actual entities are not joined in any way.
A key consists of a Parent Key, a Kind and Id. This is the only link between them.
Therefore, there isn't any real impact beyond the ability to do things transactionally. Similarly, siblings have no actual relationship, just a conceptual one.
For example, you can put an entity into the datastore referencing a parent which doesn't actually exist. That is entirely legitimate and oftentimes very useful.
So, the only difference between option 1 and option 2 is that with option 1 you have to do more heavy lifting and cannot take advantage of transactions or strongly consistent queries.
Edit: The points above to do not mention the limitation of 1 write per entity group per second. So to directly answer the original question, using parent keys limits throughput if you want to write to many entities sharing the same parent key within a second outside of a single transaction.
In general, if you don't need two entities to be updated or read in the same transaction, they should not be in the same entity group, i.e. have similar roots in their key paths, as they would if one were a key-parent of the other. If they're in the same entity group, then concurrent updates to either entity will contend for the entire group, and some updates may need to be retried.
From your question, it sounds like "<parent_id>,<child_id>" is an appropriate key name for the child. If you need to access these IDs separately (such as to get all entities with a particular "<child_id>"), you can store them as indexed properties, and perform queries as needed.
For the transactions, you cannot do multiple concurrent writes
https://developers.google.com/appengine/docs/java/datastore/transactions#Java_What_can_be_done_in_a_transaction
I am structuring my datastore 'schema' and I have created root entity that has many child entites. My application will do potentially thousands of writes in the child entities. (The reason for this was some simplicity in terms of transactions - I can save child entities in one transaction - they are all one entity group - but lets forget transactions for now).
I am afraid as my application will grow and there will be many more writes - wouldn't it be betters should I opt for a 'schema' where child entites were root entities thus writing to many entity groups.
Is it different to save batch of different entities that are root entities and the same batch if they all belong to one entity group in terms of performance - writes/second (abstracting from contention and transactions) ?
Besides that, is there differnce in performance if those child entites are of one kind or all different kinds?
There is a limit:
This approach achieves strong consistency by writing to a single
entity group per guestbook, but it also limits changes to the
guestbook to no more than 1 write per second (the supported limit for
entity groups).
(from Structuring Data for Strong Consistency)
There is no reason to put entities into the same group unless you need transactions. Besides performance considerations, the size of storage data will dramatically increase: a key of a child entity contains a key of every ancestor entity.
I have two models:
Car(ndb.Model) and Branch(ndb.Model) each with a key method.
#classmethod
def car_key(cls, company_name, car_registration_id):
if not (company_name.isalnum() and car_registration_id.isalnum()):
raise ValueError("Company & car_registration_id must be alphanumeric")
key_name = company_name + "-" + car_registration_id
return ndb.Key("Car", key_name)
Branch Key:
#classmethod
def branch_key(cls, company_name, branch_name):
if not (company_name.isalnum() and branch_name.isalnum()):
raise ValueError("Company & Branch names must be alphanumeric")
key_name = company_name + "-" + branch_name
return ndb.Key("Branch", key_name)
However I'm thinking this is a bit ugly and not really how you're supposed to use keys.
(the car registration is unique to a car but sometimes one company may sell a car to another company and also cars move between branches).
Since a company may many cars or many branches, I suppose I don't want large entity groups because you can only write to an entity group once per second.
How should I define my keys?
e.g. I'm considering car_key = ndb.Key("Car", car_reg_id, "Company", company_name)
since it's unlikely for a car to have many companies so the entity group wont be too big.
However I'm not sure what to do about the branch key since many companies may have the same branch name, and many branches may have the same company.
You've rightly identified that ancestor relationships in GAE should not be based on the logical structure of your data.
They need to be based on the transactional behavior of your application. Ancestors make your life difficult. For example, once you use a compound key, you won't be able to fetch that entity by key unless you happen to know all the elements of the key. If you knew the Car id, you wouldn't be able to fetch it without also knowing the other component.
Consider what queries you would need to have strong consistency for. If you do happen to need strong consistency when querying all the cars in a given branch, then you should consider using that as an ancestor.
Consider what operations need to be done in a transaction, that's another good reason for using an entity group.
Keep in mind also, you might not need any entity group at all (probably the answer for your situation).
Or, on the flip side, you might need an entity group that might not exactly fit any logical conceptual model, but the ancestor might be an entity that exists purely to exists because you need an ancestor for a certain transaction.
Say I have an Entity A and an Entity B, A is a parent of B, that is, many B's can be part of a single A's entity group.
Now, say I put to the HRD a bunch of B's (across many entity groups - i.e. they across many A parents). If I now query for all B's within a single entity group (i.e. the same A parent), am I guaranteed strong consistency? The subtlety here is that although I'm querying over a single entity group the orignal PUT was over multiple entity groups.
Yes - queries over a single entity group (provided you specified an ancestor filter - simply having all the results coincidentally in the same group is not sufficient) are always strongly consistent.
I am trying to wrap my head around Entity Groups in Google AppEngine. I understand them in general, but since it sounds like you can not change the relationships once the object is created AND I have a big data migration to do, I want to try to get it right the first time.
I am making an Art site where members can sign up as regular a regular Member or as one of a handful of non-polymorphic Entity "types" (Artist, Venue, Organization, ArtistRepresentative, etc). Artists, for example can have Artwork, which can in turn have other Relationships (Gallery, Media, etc). All these things are connected via References and I understand that you don't need Entity Groups to merely do References. However, some of the References NEED to exist, which is why I am looking at Entity Groups.
From the docs:
"A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller."
That said, I have a couple hopefully yes/no questions.
Question 0: I gather you don't need Entity Groups just to do transactions. However, since Entity Groups are stored in the same region of Big Table, this helps cut down on consistency issues and race conditions. Is this a fair look at Entity Groups and Transactions together?
Question 1: When a child Entity is saved, do any parent objects get implicitly accessed/saved? i.e. If I set up an Entity Group with path Member/Artist/Artwork, if I save an Artwork object, do the Member and Artist objects get updated/accessed? I would think not, but I am just making sure.
Question 2: If the answer to Question 1 is yes, does the accessing/updating only travel up the path and not affect other children. i.e. If I update Artwork, no other Artwork child of Member is updated.
Question 3: Assuming it is very important that the Member and its associated account type entity exist when a user signs up and that only the user will be updating its Member and associated account type Entity, does it make sense to put these in Entity Groups together?
i.e. Member/Artist, Member/Organization, Member/Venue.
Similarly, assuming only the user will be able to update the Artwork entities, does it make sense to include those as well? Note: Media/Gallery/etc which are references to Artwork may be related to lots of Artwork, not just those owned by the user (i.e. many to many relations).
It makes sense to have all the user's bits in an entity group if it works the way I suspect (i.e. Q1/Q2 are "no"), since they will all be in the same region of BigTable. However, adding the Artwork to the entity group seems like it might violate the "keep it small" principal and honestly, may not need to be in Transactions aside from saving bandwidth/retrys when users are uploading artwork images.
Any thoughts? Am I approaching Entity Groups wrong?
0: You do need entity groups for transactions among multiple entities
1: Modifying/accessing children does not modify/access a parent
2: N/A
3: Sounds reasonable. My feeling is, entity groups should not be used unless you need transactions among them.
It is not necessary to have the the Artwork as a child for permission purposes. But if you need transactional modification to them (including e.g. creation and deletion) it might be better. For example: if you delete an account, you delete the user entity but before you delete the child, you get DeadlineExceeded or the server crashes. Now you have an orphaned Artwork. If you have more than 1,000 Artworks for an Artist, you must delete in batches.
Good luck!