I have to add to a PropertyList a value for 50 entities. I have to make sure that no other code changes the PropertyList in the same time for particular entity. Is it better to have one big transaction changing all 50 entities or 50 small ones changing only one entity?
If you need exactly what your post says (updates to many entities, and transaction safety only on each entity), then you can use many small transactions.
If you must guarantee that none of the many entities is changed during this period, you should use one transaction, with all of your entities in the same entity group. Beware that the recommended update limit to entity groups is once per second. If you really have to update 50 entities transactionally, and you cannot for some reason put them into the same entity group, you should consider reorganizing your data.
this requirements most likely implies that you might want to reconsider your design.
currently you are solving the 'how to implement this' question.
perhaps you want to share your original problem so there can be a better solution to the question of 'this is how it should work'
looking forward.
-J
Related
I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.
I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)
Am learning AppEngine and have started developing new app and want to clarify something.
I understood that
a. To achieve atomicity of update/delete of several entities we need to do it in a transaction and hence all should fall under same entity group
b. Having big entity groups is not scalable as it causes contention.
(Q1: Correct?)
So here is an entity model of an online examination system for sake of discussion:
Entities:
Subject
Exam
Page
Question
Answer
As you can see from top, each entity 1 - many relationship with the immediate bottom one i.e 1 Subject can have many exams, 1 exam -> many pages, 1 page can have many questions...
As you can see, i would like to establish cascading update/delete relationship among these entities (JPA datanucleus appengine implemention supports this (under the hood) by putting all entities under same entity group (Q2: Correct?) though AppEngine natively doesn't support this constraint) so naturally all would go under same entity group so that
a. i can delete a Page (if my user does) in a transaction and be sure that all pages, questions, answers are all deleted
b. or i can delete a subject altogether in a transaction all clear all stuff underneath it
So when i extend this to my real app, i see that all of my (or atleast most) entities are interrelated and fit into same entity group to be able to transact them altogether - making my model inefficient.
Q3: Please advice on how to rethink this design (and the best practice) and still achieve what i need. Ask me more if needed.
Would be great if you could point me to relevant examples.
p.s. 1 solution i could think of is having each entity in a separate entity group and a separate persistent field in each entity (say Exam) named 'IS_DELETED' defaulting to FALSE (value 0). Once a user deletes an Exam, i will set the field to 1 (TRUE) and that i don't load them anymore. I shall write a Cron job which clears all related entities in separate separate transaction in the backend which will retry upon failures if needed. But am sure this is not elegant and not sure whether this will work out..
Thanks all for your responses,
Hari
One of the simplest ways to improve things is to just have fewer entities in the first place. I can't really think of a terribly good reason why pages, questions and answers need to be separate entities. I suspect you normally display all of the questions on a single page in the same request, without exception. If that's really the case, just keep them in one entity.
It does make a lot of sense to use the Exam entities as the parent for pages; for one thing, each exam is probably limited to a reasonable, small number of pages, so scaling this up probably won't hurt much.
On the other hand, there probably are a great many exams per subject, and for that reason, subjects should not appear in the ancestry of exams (and by extension, pages).
If, for some reason you needed to delete all of the exams in the subject of math, even if they were in the same entity group, you'd probably be unable to complete the whole delete in one transaction without timing out. You might even have trouble completing the delete in a single request.
That suggests that you should be using the Task Queue for this operation. When a cascading change on a subject occurs, the request handler needs to insert a new task and then just return successfully. don't forget to just update the subject entity right there in the request handler.
The task queue pulls a block of affected entities from the datastore, updates them, and then checks the time. If there is still more time available for continued updates, it pulls another block of entities, and so on, until none remain. If time is almost up, the task just adds itself back to the queue so it can restart where it left off when it respawns.
It's a good idea to schedule the first task at least a few seconds into the future of the initial request, so that if, for instance, the subject was deleted, the delete can propagate to future requests and no new exams in that subject can be created by the time the task starts.
I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?
1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.
Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I know all details about how entity groups work in GAE's storage, but yesterday (at the App Engine meetup in Palo Alto), as a presenter was explaining his use of entity groups, it struck me that I've never really made use of them in my own GAE apps, and I don't recall seeing them used in open-source GAE apps I've used.
So, I suspect I've just been overlooking (not noticing or remembering) such examples because I'm simply not used to them enough to immediately connect "use of entity group" to "kind of application problems being solved" -- and I think I should remedy that by studying such sources with this goal in mind, focusing on what problem the EG use is solving (i.e., why the app works with it, but wouldn't work or wouldn't work well without it).
Can anybody suggest good URLs to such code? (Essays would also be welcome, if they focus on application-level problem solving, but not if, like most I've seen, they just focus on the details of how EGs work!-).
The main use of entity groups is to provide the means to update more than one entity in a transaction.
If you haven't had to use them, count your blessings. Either you have been designing your data models such that no two entities ever need to be updated at the same time in order to remain consistent, or else you do need them but you've gotten lucky :)
Imagine that I have an Invoice entity type, and a LineItem entity type. One Invoice can have multiple LineItems associated with it. My Invoice entity has a field called LastUpdated. Any time a LineItem gets added to my Invoice, I want to store the current date in the LastUpdated field.
My update function might look like this (pseudocode)
invoice.lastUpdated = now()
lineitem = new lineitem()
invoice.put()
lineitem.put()
What happens if the invoice put() succeeds and the lineitem put() fails? My invoice date will show that something was updated, but the actual update (the new LineItem) wouldn't be there. The solution is to put both puts() inside a transaction.
An alternative solution would be to use a query to find the date of the last inserted LineItem, instead of storing this data in the lastUpdated field. But that would involve fetching both the Invoice and all the LineItems every time you wanted to know the last time a lineitem was added, costing you precious datastore quota.
EDIT TO RESPOND TO POSTER's COMMENTS
Ah. I think I understand your confusion. The above paragraphs establish why transactions are important. But you say you still don't care about Entity groups, because you don't see how they relate to transactions. But if you are using db.run-in-transaction, then you are using entity groups, perhaps without realizing it! Every transaction involves one and only one entity group, and any given transaction can only affect entities belonging to the same group. see here
"All datastore operations in a
transaction must operate on entities
in the same entity group".
What kind of stuff are you doing in your transactions? There are plenty of good reasons to use transactions with just one Entity, which by default is in its own Entity Group. But sometimes you need to keep 2 or more entities in sync, like in my example above. If the Invoice and the LineItem Entities are not in the same entity group, then you could not wrap the modifications to them in a db.run-in-transaction call. So anytime you want to operate on 2 or more entities transactionally you need to first make sure they are in the same group. Hope that makes it more clear why they are useful.
I've used them here. I'm setting my customer object as the parent of the map markers. This creates an entity group for each customer and gives me two advantages:
Getting the markers of a customer is much faster, because they're stored physically with the customer object.(On the same server, probably on the same disk)
I can change the markers for a customer in a transaction. I suspect the reason transactions require all objects that they operate on to be in the same group is because they're stored in the same physical location, which makes it easier to implement a lock on the data.
I've used them here in this simple wiki system. The latest version of a page is always a root entity and past versions have the latest version as ancestor. The copy operation is done in a transaction to keep the version consistency and avoid losing a version in case of concurrency.