Evaluate a condition BEFORE put() in NDB and GAE - google-app-engine

I have a model in my GAE based application, which is being updated quite frequently. And in some cases, it happens that same entity is being updated nearly at the same time. The entity update functionality in the app works like that:
User enters an ID and other properties of the entity which are to be updated.
Retrieve entity from DB against this ID (to make sure ID was valid).
Do bunch of validation on the properties which are going to be updated (e.g. if group_id property is being updated then make sure group is present in DB against this ID and ID is integer.)
After validations, call put() on the entity which was retrieved at step #2.
As, i mentioned same entity can be updated multiple times nearly at the same time, so i came across the classical race condition issue. i.e. let's say 2 update calls were made sequentially, in first call entity was retrieved and validations were in process but, at the same time second call triggered it retrieves the same and updates it properties. The second call also executes put() and update entity in DB. NOTE that first call was not completed yet (due to some time delay), it completes now and calls put() and updates entity in DB.
The end result in DB will be off first update call, but expected result was of second call!
I researched on GAE about this and found pre-put hooks. I think i can use "updated" timestamps to solve this issue i.e. to make sure second call only updates an entity after first call does it. But, i want to use some better approach here e.g. some DB's (like in AWS) provide tags with each DB row and we can ask DB itself to validate this tag before actually putting the record. I am curious as is there any way in GAE to do so. i.e. to ask DB to do conditional put() instead of using pre_put hooks manually?

Related

How to verify that every call of a load test generated a successful result at the end of a process chain?

I have an application that goes like this:
ingestion --> queue --> validation --> persistance --> database
I want to load test the ingestion and at the end verify that every submitted entry is stored in the database.
I have an Artillery script that posts to ingestion and recovers the same item from the database, but it does so as part of the same scenario and since the two components are implemented separately I'm actually measuring a combined performance, instead of that of each component.
I would like to load test the ingestion component keeping hold of some search key that allow me to recover all sent items from the database. I've tried this by creating a Javascript that I call at the beginning of the ingestion scenario to generate a random search key, store it in Artillery's context and them at the end of the scenario call another function to recover all entries from the database.
The problem I found is that Artillery runs one copy of the scenario in each virtual client, so it calls the function each time it starts the scenario and recovers only one entry at the end. And the call to the database happens in the same scenario as the post to ingestion, so I'm again mixing performance.
What I would like to do, I suppose, would be to generate the search key in a scenario, run the posts in another scenario, and then retrieve the results in a third one. How can I do that?
Also, when I retrieve the results from the database, I would like to compare the quantity with the number of posts to ingestion. I couldn't find if expect works with variables returned in the context from function calls. Is this possible?
I don't believe this is possible. I have been reading the documentation and any examples I can find about Artillery scripts, and I don't see that there is any way to "chain" flows together.

Datastore sometimes fails to fetch all required entities, but works the second time

I have a datastore entity called lineItems, which consists of individual line items to be invoiced. The users find the line items and attach a purchase order number to the line items. These are they displayed on the web page where they can create the invoice.
I would display my code for fetching the entities, but I don't think it matters at all as this also happened a couple times when I was using managed VM's a few months ago and the code is completely different. (I was using objectify before, now I am using the datastore API). In a nutshell, I am currently just using a StructuredQuery.setFilter(new PropertyFilter.eq("POnum",ponum)).setFilter(new PropertyFilter.eq("Invoiced", false)); (this is pseudo code you can't do two .setFilters like this. The real code accepts a list of PropertyFilters and creates a composite filter properly.)
What happened this morning was the admin person created the invoice, and all but two of the lines were on the invoice. There were two lines which the code never fetched, and those lines were stuck in the "invoices to create" section.
The admin person simply created the invoice again for the given purchase order number, but the second time it DID pick up the two remaining lines and created a second invoice.
Note that the entities were created/edited almost 24 hours before (when she assigned the purchase order number to them), so they were sitting in the database for quite a while. (I checked my logs). This is not a case where they were just created, and then tried to be accessed within a short period of time. It is also NOT a case of failing to update the entities - the code creates the invoice in a 3'rd party accounting package, and they simply were not there. Upon success of the invoice creation, all of the entities are then updated with "invoiced = true" and written in the datastore. So the lines which were not on the invoice in the accounting program are the ones that weren't updated in the datastore. (This is not a "smart" check either, it does not check line-by line. It simply checks if the invoice creation was successful or not, and then updates all of the entities that it has in memory).
As far as I can tell, the datastore simply did not return all of the entities which matched the query the first time but it did the second time.
There are approximately 40'000 lineItem entities.
What are the conditions which can cause a datastore fetch to randomly fail to grab all of the entities which meet the search parameters of a StructuredQuery? (Note that this also happened twice while using Objectify on the now deprecated Managed VM architecture.) How can I stop this from happening, or check to see if it has happened?
You may be seeing eventual consistency because you are not using an ancestor query.
See: https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/

How do I do a batch operation with RIA?

Most of the operations in my silverlight client are things that add/update/insert/delete multiple entities in one go.
E.g:
CreateStandardCustomer adds a Customer, Address, Person and Contract record.
CreateEnterpriseCustomer adds a Customer, Address, 2x Person and a CreditLimit record.
It looks like with a DomainService you can only do one thing at a time, e.g. add a customer record, add an address etc. How can I do a batch operation?
You might say to simply add the relevant records from the Silverlight client and call the SubmitChanges() method. However this is difficult to validate against (server side) because only certain groups of records can be added/update/deleted at a time. E.g. in the example above, an Address record added alone would not be valid in this system.
Another example would be something like Renew which updates a Customer record and adds a Renewal. These operations aren't valid individually.
Thanks for your help,
Kurren
EDIT: The server side validation needs to check that the correct operations in the batch has taken place. E.g. From the example above we Renew then a Renewal should be created and a Customer should have been updated (one without the other is invalid).
I may be missing something here, but you update a batch of entities the same way you do individual entities: namely perform all the operations on your context and then call SubmitChanges on that context. At the server your insert/delete/update methods for the types will be called as appropriate for all the changes you're submitting.
We use RIA/EF in Silverlight to do exactly that. It doesn't matter if you just create a single entity in your client context (complete with graph) or 100, because as soon as you submit those changes the complete changeset for that context is operated upon.
EDIT: Failing setting up your entity metadata with Required and Composition attributes on the appropriate properties, you can also use the DomainService.ChangeSet object to inspect what has been submitted and make decisions on what changes you want to accept or not.

Concurrency with Objectify in GAE

I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
db.put(ent)
ent = db.get(ent.key()) # get, modify and save the entity
ent.property = 'foo'
db.put(ent)
ent.get(ent.key()) # get and delete the entity
db.delete(my_ent)
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

Resources