As far as I understand from app engine tutorial, entity groups exist only for the purpose of transactions:
"Only use entity groups when they are needed for transactions" (from the tutorial)
The definition of being in the same entity group is to have the same root.. In that case, what is the use of having more than 1 hierarchy level?
That is, why should I use "A -> B -> C" (A is the root, B his son, C his grandson)
instead of "A -> B ; A -> C" ? (A, B and C are still in the same entity group since A is their root).
If the only purpose of entity groups in to make transaction possible between entities, why should I use more than 1 hierarchy level (what do I earn from Root -> Grandson linkage)?
When you're doing queries, you can use ancestor() to restrict the query to children of a particular entity - in your example, you could look for only descendants of B, which you couldn't do if they were all at the top level.
There's more on Ancestor Queries in Programming Google App Engine
The Keys and Entity Groups doc also says that:
Entity group relationships tell App Engine to store several entities in the same part of the distributed network ... All entities in a group are stored in
the same datastore node
edit: The same document also lists some of the reasons why you don't want your entity groups to grow too large:
The more entity groups your
application has—that is, the more root
entities there are—the more
efficiently the datastore can
distribute the entity groups across
datastore nodes. Better distribution
improves the performance of creating
and updating data. Also, multiple
users attempting to update entities in
the same entity group at the same time
will cause some users to retry their
transactions, possibly causing some to
fail to commit changes. Do not put all
of the application's entities under
one root.
Any transaction on an entity in a Group will cause any other writes to the same entity group to fail. If you have a large entity group with lots of writes, this causes lots of contention, and your app then has to handle the expected write failures. Avoiding datastore contention goes into more detail on the strategies you can use to minimse the contention.
Actually, transaction is a side-effect of entity groups. Because entity group rows are co-located transactions on them are possible at all.
I would even go as far as claiming that entity groups is intrinsic property of datastore that makes it similar to hierarchical databases.
When you store A -> B -> C, A has many Bs, and a B has many Cs. When you store A -> B and A -> C, A has many Bs, and many Cs. In other words, a C doesn't belong to a single B.
Which structure you use really depends on the data you're storing.
When using lots of write accesses, you might have to do unintuitive things to your entitygroups, see Sharding Counters for an example of this:
Related
I am structuring my datastore 'schema' and I have created root entity that has many child entites. My application will do potentially thousands of writes in the child entities. (The reason for this was some simplicity in terms of transactions - I can save child entities in one transaction - they are all one entity group - but lets forget transactions for now).
I am afraid as my application will grow and there will be many more writes - wouldn't it be betters should I opt for a 'schema' where child entites were root entities thus writing to many entity groups.
Is it different to save batch of different entities that are root entities and the same batch if they all belong to one entity group in terms of performance - writes/second (abstracting from contention and transactions) ?
Besides that, is there differnce in performance if those child entites are of one kind or all different kinds?
There is a limit:
This approach achieves strong consistency by writing to a single
entity group per guestbook, but it also limits changes to the
guestbook to no more than 1 write per second (the supported limit for
entity groups).
(from Structuring Data for Strong Consistency)
There is no reason to put entities into the same group unless you need transactions. Besides performance considerations, the size of storage data will dramatically increase: a key of a child entity contains a key of every ancestor entity.
I'm trying to wrap my head around how I can represent a many-to-many relationship inside of AppEngine's Datastore in the Go Programming Language. I'm more used to traditional relational databases.
I have two types of entities in my system. Let's call them A and B. Every A entity is related to some number of B entities. Similarly, every B entity is related to some other number of A entities. I'd like to be able to efficiently query for all B entities given an A entity, and for all A entities given a Bentity.
In the Python SDK, there seems to be a way to note fields in an entity can be ReferencePropertys which reference some other entity. However, I can't find something similar in Go's AppEngine SDK. Go seems to just use basic structs to represent entities.
What's the best practice for dealing with this?
A python ReferenceProperty essentially stores a key to another entity. It's similar to using a Key field in Go.
There's at least two ways to solve your problem. A cheap way to store a limited number of references, and an expensive way for larger data sets.
fmt.Println.MKO provided the answer for the cheap way, except the query is simpler than what he suggests, it should actually be:
SELECT * FROM B where AIds = 'A1'
This method is limited to the number of indexed entries per entity, as well as the entity size. So the list of AIds or BIds will limit the number of entities to 20000 or less.
If you have an insane amount of data, you would want a mapping entity to represent the M2M relationship between a given A & B entity. It would simply contain a key to an A and a key to a B. You would then query for map entities, and then fetch the corresponding A or B entities you need. This would be much more expensive, but breaks past the entity size limit.
based on how you which to query you could do the following:
in your struct A add a field:
BIds []int64
in your struct B add a field:
AIds []int64
now any time you add a relation between A and B you just need to add the corresponding ids to your two variables
when you need to query now for all B which are related to this A1 you do your query like this:
SELECT * FROM B where AIds = 'A1'
for all A wich are related to this B1 your do it similar:
SELECT * FROM A where BIds = 'B1'
update:
altered querys on suggestion from dragonx
Is it the closest or the most distant parent relative of the entity being written, which determines the entity group? (Question 1) For, if I have,
two simultaneous requests to write two different entities, in this example, both having immediate parent the Data entity (with key '2'), and having subsequent ancestors of:
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Shia La Boef
In either case they both belong to the same entity group, either anchored at entity Person:9, or at entity Data:2. Which is the correct determiner of the entity group, Person:9 or Data:2? Also if there are two kinds of entities descended from Data:2, say Record and Scheme, will all the Record and Scheme entities belong to the same entity group, anchored by Data:2, or, by virtue of being different kinds, belong to separate entity groups? (Question 2)
Incidentally, if it is Person:9 which determines the entity group, and different kinds under a parent do not form different entity groups under that parent, then everything descended from Person:9 is the same entity group and is going to have to be written in serial, the horror
Since in this example, I am performing these writes of the same kind of entity to the same entity group concurrently, they will be applied serially, and therefore take 'double the time.', than if they could be applied concurrently.
What is a good solution for this 'doubling' of time taken? (Question 3 -- optional!)
I have thought of the following:
Since I know that the two separate writes must be initiated by two separate client instances, I can add a further ancestor to the chain, which represents the client instance doing the writing, like so:
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:92** > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:37** > Record of Shia La Boef
This way the writes will belong to different entity groups (so long as the hypothesis of Person:9 anchoring the group is wrong), and therefore can always be performed concurrently. Can an AppEngineer/expert weigh in on this? (Question 4)
Further since I enforce the restriction that separate clients can only make serial requests to the datastore, and I can guarantee without any performance impact that any writes made by a single client never need to occur more than 1 time per second, the above method, if it works, will mean there is zero performance impact and as long as I have enough separate Clients (and, he, enough quota) I can make as many writes to the datastore as fast as the HTTP can carry them. Can an AppEngineer/expert weigh in on this? (Question 5)
The only issue I see with this group splitting approach is that querying for the Record entities under the Data:2 parent, is now complicated by the fact that, even though the records are related semantically, they are separated by the different clients. So in order to collect all records, I need to first collect all clients, and then collect all there records. Can anyone see whether this would create a stupendously horrible performance impact, doing this kind of "query all the children of the children you just queried" query...? Can an AppEngineer/expert weigh in on this? (Question 6)
You have some misconceptions here.
Firstly, the documentation is quite explicit on what constitutes an entity group: it is everything under a root entity.
However I don't know where you got the idea that writes within an entity group are in some way more "serial" than ones outside. The documentation doesn't say that, or imply it. The only thing that it does say about this is that writes to a single entity group take place at no more than 1 per second.
The rest of your questions make no sense at all: adding another element to the chain doesn't change the root entity.
I'm not sure why you need such deep entity group chains in the first place. The documentation's advice on scaling is to keep entity groups small. If each leaf entity will only ever be written to by one client, it sounds like the client itself should be the root, and the rest of the path should not be part of the ancestry at all: perhaps you could use a ReferenceProperty to refer to one or more of those entities via its key.
Where — on what level — the locking and collisions occur in transactional operations on the same entity group? At root? At some common, wide enough parent?
It's not clear to me what is an "Entity Group" for a transaction. Is it always a group originating at a root entity (without parent) or is there a mechanism that selects a group wide enough for transaction.
For example when I have a model structure like this:
- School
- Teacher
- Class
- Course
- Lesson
- Evaluation
- Student
- Guardian
- Grade
- PresenceMarker
- TextBook
Do my transactional operations always refer to "Entity Group" as a group originating at a school level (regardless of the level where actual operation occur), or when I update students entities in the same class, I can only collide with with other transactional operation occurring below the same class entity.
In other words is there only one entity group staring from School or are there sub entity groups originating at every other level in the hierarchy? If there are sub entity groups, are they used in the for transaction isolation?
UPDATE:
Taking Sharding counters as an another example. Will the sharding work if all the shards have common parent? Will updating a single counter shard result in transaction collisions updates on other shards?
Transactions in App Engine happen at the Entity Group level. (see docs here and here)
are there sub entity groups originating at every other level in the hierarchy?
There are no "sub entity groups". Every entity is in exactly one Entity Group, because it has one ultimate ancestor. In your example, all your models ultimately belong in the School's group.
Will the sharding work if all the shards have common parent?
For sharding to work as intended, each shard must be in its own Entity Group. If you look at the sample code, you can see that each shard is in its own group. You can also see that while the increment() method uses a transaction, the get_count() does not. The increment is only affecting one group, while the get_count is grabbing data from multiple groups.
Note: The latest release of App Engine allows for cross group transactions, but those are a special case, and the definition of a group has not changed.
As far as I understood the docs GAE always uses the root of the entity group tree for its journal, which manages locks and transactions.
Say I have an Entity A and an Entity B, A is a parent of B, that is, many B's can be part of a single A's entity group.
Now, say I put to the HRD a bunch of B's (across many entity groups - i.e. they across many A parents). If I now query for all B's within a single entity group (i.e. the same A parent), am I guaranteed strong consistency? The subtlety here is that although I'm querying over a single entity group the orignal PUT was over multiple entity groups.
Yes - queries over a single entity group (provided you specified an ancestor filter - simply having all the results coincidentally in the same group is not sufficient) are always strongly consistent.