I am trying to wrap my head around Entity Groups in Google AppEngine. I understand them in general, but since it sounds like you can not change the relationships once the object is created AND I have a big data migration to do, I want to try to get it right the first time.
I am making an Art site where members can sign up as regular a regular Member or as one of a handful of non-polymorphic Entity "types" (Artist, Venue, Organization, ArtistRepresentative, etc). Artists, for example can have Artwork, which can in turn have other Relationships (Gallery, Media, etc). All these things are connected via References and I understand that you don't need Entity Groups to merely do References. However, some of the References NEED to exist, which is why I am looking at Entity Groups.
From the docs:
"A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller."
That said, I have a couple hopefully yes/no questions.
Question 0: I gather you don't need Entity Groups just to do transactions. However, since Entity Groups are stored in the same region of Big Table, this helps cut down on consistency issues and race conditions. Is this a fair look at Entity Groups and Transactions together?
Question 1: When a child Entity is saved, do any parent objects get implicitly accessed/saved? i.e. If I set up an Entity Group with path Member/Artist/Artwork, if I save an Artwork object, do the Member and Artist objects get updated/accessed? I would think not, but I am just making sure.
Question 2: If the answer to Question 1 is yes, does the accessing/updating only travel up the path and not affect other children. i.e. If I update Artwork, no other Artwork child of Member is updated.
Question 3: Assuming it is very important that the Member and its associated account type entity exist when a user signs up and that only the user will be updating its Member and associated account type Entity, does it make sense to put these in Entity Groups together?
i.e. Member/Artist, Member/Organization, Member/Venue.
Similarly, assuming only the user will be able to update the Artwork entities, does it make sense to include those as well? Note: Media/Gallery/etc which are references to Artwork may be related to lots of Artwork, not just those owned by the user (i.e. many to many relations).
It makes sense to have all the user's bits in an entity group if it works the way I suspect (i.e. Q1/Q2 are "no"), since they will all be in the same region of BigTable. However, adding the Artwork to the entity group seems like it might violate the "keep it small" principal and honestly, may not need to be in Transactions aside from saving bandwidth/retrys when users are uploading artwork images.
Any thoughts? Am I approaching Entity Groups wrong?
0: You do need entity groups for transactions among multiple entities
1: Modifying/accessing children does not modify/access a parent
2: N/A
3: Sounds reasonable. My feeling is, entity groups should not be used unless you need transactions among them.
It is not necessary to have the the Artwork as a child for permission purposes. But if you need transactional modification to them (including e.g. creation and deletion) it might be better. For example: if you delete an account, you delete the user entity but before you delete the child, you get DeadlineExceeded or the server crashes. Now you have an orphaned Artwork. If you have more than 1,000 Artworks for an Artist, you must delete in batches.
Good luck!
Related
I am designing a Database management project of gym management. There are 2 users, one is the clerk who can add,remove and edit all trainers, centers and members and the second user is the member who can only see and edit certain attributes related to him. Member ,center and trainers are 3 entities in the ER diagram so the question should I introduce entity for clerk and if so should it have a relationship with any of the three entities described above?
I wouldn't split up the two Entities based on the Fact that they have different permissions in your system.
I recommend you focus on the concepts behind the entities:
First, if all Attributes are equal I would start considering building 1 Entity out of the two. Once you end up with multiple columns that are mainly null it might have been a mistake to "merge" two entities.
In addition to that you should check if there is a central name that you can give your merged entity. For example if you have the two Entities: Manager, Employee and you want to merge them I would maybe just call it User and check if the Properties still make sense in that context.
Last but not least you should think about how the Entities are used later in the development. If you need two Joins instead of one once you split up your Entities that could be an argument for merging them. Maybe later in the development your 'clark' Entity will be extended by a few columns, this way you might end up with null columns again.
I think a general answer is not suitable since the Domain is unclear. Just collect arguments for and against merging the entities and compare those.
My app should contain several users, each of them having a list of objects ( only one user own the object ).
My question is : Would it be better to put an entity User that references the Ids of its objects, or should I put the user as the ancestor of the objects ? Please be kind, I am just beginning with nosql and datastore !
What approach you take will depend heavily on your access patterns, what make sense for easy retrieval, frequency of writes etc. You start your design process by building a basic entity relationship model, then start elaborating on what information you need to get to, and how frequently it is required what security restrictions are required. Then look at how you need to adjust the real model to reflect these access use cases taking into account performance, ease of use, security requirements.
Which approach you should choose depends mainly on the consistency model (strong vs eventual) you require for your entities. In Google Cloud Datastore, an entity group (an entity and its descendants) is a unit with strong consistency, transactionality, and locality.
You can read more on the topic here and here.
And there is one more important thing that is needed to take into account. If you model a parent-child relationship between a user and an object, the parent will be part of the object's key hence if you will change the object's owner later, you will end up with different object in terms of its key.
Is it the closest or the most distant parent relative of the entity being written, which determines the entity group? (Question 1) For, if I have,
two simultaneous requests to write two different entities, in this example, both having immediate parent the Data entity (with key '2'), and having subsequent ancestors of:
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > Record of Shia La Boef
In either case they both belong to the same entity group, either anchored at entity Person:9, or at entity Data:2. Which is the correct determiner of the entity group, Person:9 or Data:2? Also if there are two kinds of entities descended from Data:2, say Record and Scheme, will all the Record and Scheme entities belong to the same entity group, anchored by Data:2, or, by virtue of being different kinds, belong to separate entity groups? (Question 2)
Incidentally, if it is Person:9 which determines the entity group, and different kinds under a parent do not form different entity groups under that parent, then everything descended from Person:9 is the same entity group and is going to have to be written in serial, the horror
Since in this example, I am performing these writes of the same kind of entity to the same entity group concurrently, they will be applied serially, and therefore take 'double the time.', than if they could be applied concurrently.
What is a good solution for this 'doubling' of time taken? (Question 3 -- optional!)
I have thought of the following:
Since I know that the two separate writes must be initiated by two separate client instances, I can add a further ancestor to the chain, which represents the client instance doing the writing, like so:
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:92** > Record of Tom Cruise
Person:9 > Collection:3 > Script:4 > Data:2 > **Client:37** > Record of Shia La Boef
This way the writes will belong to different entity groups (so long as the hypothesis of Person:9 anchoring the group is wrong), and therefore can always be performed concurrently. Can an AppEngineer/expert weigh in on this? (Question 4)
Further since I enforce the restriction that separate clients can only make serial requests to the datastore, and I can guarantee without any performance impact that any writes made by a single client never need to occur more than 1 time per second, the above method, if it works, will mean there is zero performance impact and as long as I have enough separate Clients (and, he, enough quota) I can make as many writes to the datastore as fast as the HTTP can carry them. Can an AppEngineer/expert weigh in on this? (Question 5)
The only issue I see with this group splitting approach is that querying for the Record entities under the Data:2 parent, is now complicated by the fact that, even though the records are related semantically, they are separated by the different clients. So in order to collect all records, I need to first collect all clients, and then collect all there records. Can anyone see whether this would create a stupendously horrible performance impact, doing this kind of "query all the children of the children you just queried" query...? Can an AppEngineer/expert weigh in on this? (Question 6)
You have some misconceptions here.
Firstly, the documentation is quite explicit on what constitutes an entity group: it is everything under a root entity.
However I don't know where you got the idea that writes within an entity group are in some way more "serial" than ones outside. The documentation doesn't say that, or imply it. The only thing that it does say about this is that writes to a single entity group take place at no more than 1 per second.
The rest of your questions make no sense at all: adding another element to the chain doesn't change the root entity.
I'm not sure why you need such deep entity group chains in the first place. The documentation's advice on scaling is to keep entity groups small. If each leaf entity will only ever be written to by one client, it sounds like the client itself should be the root, and the rest of the path should not be part of the ancestry at all: perhaps you could use a ReferenceProperty to refer to one or more of those entities via its key.
Currently, a lot of my code makes extensive use of ancestors to put and fetch objects. However, I'm looking to change some stuff around.
I initially thought that ancestors helped make querying faster if you knew who the ancestor of the entity you're looking for was. But I think it turns out that ancestors are mostly useful for transaction support. I don't make use of transactions, so I'm wondering if ancestors are more of a burden on the system here than a help.
What I have is a User entity, and a lot of other entities such as say Comments, Tags, Friends. A User can create many Comments, Tags, and Friends, and so whenever a user does so, I set the ancestor for all these newly created objects as the User.
So when I create a Comment, I set the ancestor as the user:
comment = Comment(aUser, key_name = commentId)
Now the only reason I'm doing this is strictly for querying purposes. I thought it would be faster when I wanted to get all comments by a certain user to just get all comments with a common ancestor rather than querying for all comments where authorEmail = userEmail.
So when I want to get all comments by a certain user, I do:
commentQuery = db.GqlQuery('SELECT * FROM Comment WHERE ANCESTOR IS :1', userKey)
So my question is, is this a good use of ancestors? Should each Comment instead have a ReferenceProperty that references the User object that created the comment, and filter by that?
(Also, my thinking was that using ancestors instead of an indexed ReferenceProperty would save on write costs. Am I mistaken here?)
You are right about the writing cost, an ancestor is part of the key which comes "free". using a reference property will increase your writing cost if the reference property is indexed.
Since you query on that reference property if will need to be indexed.
Ancestor is not only important for transactions, in the HRD (the default datastore implementation) if you don't create each comment with the same ancestor, the quires will not be strongly consistent.
-- Adding Nick's comment ---
Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down iff you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern.
I am working on a project (based in Django although that's not really relevant to my question) and I am struggling to work out the best way to represent the data models.
I have the four following models:
User,
Client,
Meeting,
Location
User and Client have a many-to-many relationship through the Meeting model. The Meeting model has a one-to-one relationship with the Location model.
Meetings will take place at either:
The address defined in the User (or UserProfile) model
The address defined in the Client model.
Some other location which has to be defined at a later date.
I'm struggling to work out the best way to store the Location data in order to make it as clean and reusable as possible.
I considered making Location as a field in the Meetings model rather than a model in its own right - although this could also lead to redundant data if lots of Meetings are created at the same location, so this is probably a non-starter.
I could automatically create Location records for each User and Client that gets created and use a generic relationship between the relevant records, however, I understand that this can lead to inefficient database performance. Also, not every Client / User would be able to hold meetings at their Location.
Can anyone see an tidier alternative?
Any advice appreciated.
Thanks.
I considered making Location as a field in the Meetings model rather
than a model in its own right - although this could also lead to
redundant data if lots of Meetings are created at the same location,
so this is probably a non-starter.
No, that's a really good thought, because it points you straight at the real problem.
The real problem is that there's a difference between a meeting and the parties that attend a meeting. A meeting has some attributes that have nothing to do with the attendees: it has at the very least a time and a place.
So I think you should change your thinking about the Meeting model.
Instead of users having a M:N relationship with clients through the Meeting model, they should have a M:N relationship through, say, an Attendance model. (A Registration or Reservation or MightAttend model might be more appropriate for you.) And the Meeting model should change to reflect the unique attributes of a real-world meeting: time and place.
I would expect Meetings and Locations to have a many-to-one relationship. Can't a location be used for more than one meeting? (at different times, of course)
It seems to me that a location has attributes that persist beyond its use for a single meeting. Example: seating capacity.