Concurrency with Objectify in GAE - google-app-engine

I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.

From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
db.put(ent)
ent = db.get(ent.key()) # get, modify and save the entity
ent.property = 'foo'
db.put(ent)
ent.get(ent.key()) # get and delete the entity
db.delete(my_ent)
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.

Related

Is it OK to ignore DbUpdateConcurrencyException?

I have a table that effectively implements a cache - we put stuff in, we take stuff out. It has a last_accessed field that indicates when it was inserted or read, and it has a max_age field indicating how long to keep the record.
We want to purge old records, rather than having them accumulate. Seemed to me that the easiest way to handle that was just to call a purge routine, every time we accessed a record.
This is simple enough to do, in EF:
var oldCacheItems =
this.xxDbContext.XXtokencaches.Where(
t => EntityFunctions.AddMinutes(t.lastAccessed, t.expirationMinutes) < DateTime.Now);
this.xxDbContext.XXtokencaches.RemoveRange(oldCacheItems);
this.xxDbContext.SaveChanges();
Works fine, alone. But when I have multiple clients running, I occasionally get DbUpdateConcurrencyExceptions:
System.Data.Entity.Infrastructure.DbUpdateConcurrencyException was unhandled by user code
HResult=-2146233087
Message=Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. Refresh ObjectStateManager entries.
Source=EntityFramework
Browsing around, I see a lot of discussion about concurrency issues in EF. But in this case, I'm not sure any of it matters. In essence, this particular error is EF complaining that it can't delete the records because someone else already deleted them.
So I'm considering simply catching DbUpdateConcurrencyException and ignoring it.
Are there any possible consequences to this that I might not have considered?
Or some alternative approach that would eliminate the problem?
I think it depends on your business logic. You can assume that database version wins over yours.

most efficient way to get, modify and put a batch of entities with ndb

in my app i have a few batch operations i perform.
unfortunately this sometimes takes forever to update 400-500 entities.
what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.
ill simplify my model to explain what i do (which is pretty simple anyway):
class Entity(ndb.Model):
title = ndb.StringProperty()
keys = [key1, key2, key3, key4, ..., key500]
entities = ndb.get_multi(keys)
for e in entities:
e.title = 'the new title'
ndb.put_multi(entities)
getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.
but what really bothers me is that a put takes up to 50seconds...
what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...
I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.
I also recommend looking at this with Appstats to see if that shows something surprising.
Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.
Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.
I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =
What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?

Enforcing Unique Constraint in GAE

I am trying out Google App Engine Java, however the absence of a unique constraint is making things difficult.
I have been through this post and this blog suggests a method to implement something similar. My background is in MySQL.Moving to datastore without a unique constraint makes me jittery because I never had to worry about duplicate values before and checking each value before inserting a new value still has room for error.
"No, you still cannot specify unique
during schema creation."
-- David Underhill talks about GAE and the unique constraint (post link)
What are you guys using to implement something similar to a unique or primary key?
I heard about a abstract datastore layer created using the low level api which worked like a regular RDB, which however was not free(however I do not remember the name of the software)
Schematic view of my problem
sNo = biggest serial_number in the db
sNo++
Insert new entry with sNo as serial_number value //checkpoint
User adds data pertaining to current serial_number
Update entry with data where serial_number is sNo
However at line number 3(checkpoint), I feel two users might add the same sNo. And that is what is preventing me from working with appengine.
This and other similar questions come up often when talking about transitioning from a traditional RDB to a BigTable-like datastore like App Engine's.
It's often useful to discuss why the datastore doesn't support unique keys, since it informs the mindset you should be in when thinking about your data storage schemes. The reason unique constraints are not available is because it greatly limits scalability. Like you've said, enforcing the constraint means checking all other entities for that property. Whether you do it manually in your code or the datastore does it automatically behind the scenes, it still needs to happen, and that means lower performance. Some optimizations can be made, but it still needs to happen in one way or another.
The answer to your question is, really think about why you need that unique constraint.
Secondly, remember that keys do exist in the datastore, and are a great way of enforcing a simple unique constraint.
my_user = MyUser(key_name=users.get_current_user().email())
my_user.put()
This will guarantee that no MyUser will ever be created with that email ever again, and you can also quickly retrieve the MyUser with that email:
my_user = MyUser.get(users.get_current_user().email())
In the python runtime you can also do:
my_user = MyUser.get_or_create(key_name=users.get_current_user().email())
Which will insert or retrieve the user with that email.
Anything more complex than that will not be scalable though. So really think about whether you need that property to be globally unique, or if there are ways you can remove the need for that unique constraint. Often times you'll find with some small workarounds you didn't need that property to be unique after all.
You can generate unique serial numbers for your products without needing to enforce unique IDs or querying the entire set of entities to find out what the largest serial number currently is. You can use transactions and a singleton entity to generate the 'next' serial number. Because the operation occurs inside a transaction, you can be sure that no two products will ever get the same serial number.
This approach will, however, be a potential performance chokepoint and limit your application's scalability. If it is the case that the creation of new serial numbers does not happen so often that you get contention, it may work for you.
EDIT:
To clarify, the singleton that holds the current -- or next -- serial number that is to be assigned is completely independent of any entities that actually have serial numbers assigned to them. They do not need to be all be a part of an entity group. You could have entities from multiple models using the same mechanism to get a new, unique serial number.
I don't remember Java well enough to provide sample code, and my Python example might be meaningless to you, but here's pseudo-code to illustrate the idea:
Receive request to create a new inventory item.
Enter transaction.
Retrieve current value of the single entity of the SerialNumber model.
Increment value and write it to the database
Return value as you exit transaction.
Now, the code that does all the work of actually creating the inventory item and storing it along with its new serial number DOES NOT need to run in a transaction.
Caveat: as I stated above, this could be a major performance bottleneck, as only one serial number can be created at any one time. However, it does provide you with the certainty that the serial number that you just generated is unique and not in-use.
I encountered this same issue in an application where users needed to reserve a timeslot. I needed to "insert" exactly one unique timeslot entity while expecting users to simultaneously request the same timeslot.
I have isolated an example of how to do this on app engine, and I blogged about it. The blog posting has canonical code examples using Datastore, and also Objectify. (BTW, I would advise to avoid JDO.)
I have also deployed a live demonstration where you can advance two users toward reserving the same resource. In this demo you can experience the exact behavior of app engine datastore click by click.
If you are looking for the behavior of a unique constraint, these should prove useful.
-broc
I first thought an alternative to the transaction technique in broc's blog, could be to make a singleton class which contains a synchronized method (say addUserName(String name)) responsible adding a new entry only if it is unique or throwing an exception. Then make a contextlistener which instantiates a single instance of this singleton, adding it as an attribute to the servletContext. Servlets then can call the addUserName() method on the singleton instance which they obtain through getServletContext.
However this is NOT a good idea because GAE is likely to split the app across multiple JVMs so multiple singleton class instances could still occur, one in each JVM. see this thread
A more GAE like alternative would be to write a GAE module responsible for checking uniqueness and adding new enteries; then use manual or basic scaling with...
<max-instances>1</max-instances>
Then you have a single instance running on GAE which acts as a single point of authority, adding users one at a time to the datastore. If you are concerned about this instance being a bottleneck you could improve the module, adding queuing or an internal master/slave architecture.
This module based solution would allow many unique usernames to be added to the datastore in a short space of time, without risking entitygroup contention issues.

Unit of Work - What is the best approach to temporary object storage on a web farm?

I need to design and implement something similar to what Martin Fowler calls the "Unit of Work" pattern. I have heard others refer to it as a "Shopping Cart" pattern, but I'm not convinced the needs are the same.
The specific problem is that users (and our UI team) want to be able to create and assign child objects (with referential integrity constraints in the database) before the parent object is created. I met with another of our designers today and we came up with two alternative approaches.
a) First, create a dummy parent object in the database, and then create dummy children and dummy assignments. We could use negative keys (our normal keys are all positive) to distinguish between the sheep and the goats in the database. Then when the user submits the entire transaction we have to update data and get the real keys added and aligned.
I see several drawbacks to this one.
It causes perturbations to the indexes.
We still need to come up with something to satisfy unique constraints on columns that have them.
We have to modify a lot of existing SQL and code that generates SQL to add yet another predicate to a lot of WHERE clauses.
Altering the primary keys in Oracle can be done, but its a challenge.
b) Create Transient tables for objects and assignments that need to be able to participate in these reverse transactions. When the user hits Submit, we generate the real entries and purge the old.
I think this is cleaner than the first alternative, but still involves increased levels of database activity.
Both methods require that I have some way to expire transient data if the session is lost before the user executes submit or cancel requests.
Has anyone solved this problem in a different way?
Thanks in advance for your help.
I don't understand why these objects need to be created in the database before the transaction is committed, so you might want to clarify with your UI team before proceeding with a solution. You may find that all they want to do is read information previously saved by the user on another page.
So, assuming that the objects don't need to be stored in the database before the commit, I give you plan C:
Store initialized business objects in the session. You can then create all the children you want, and only touch the database (and set up references) when the transaction needs to be committed. If the session data is going to be large (either individually or collectively), store the session information in the database (you may already be doing this).

Creating a Notifications type feed in GAE Objectify

I'm working on a notification feed for my mobile app and am looking for some help on an issue.
The app is a Twitter/Facebook like app where users can post statuses and other users can like, comment, or subscribe to them.
One thing I want to have in my app is to have a notifications feed where users can see who liked/comment on their post or subscribed to them.
The first part of this system I have figured out, when a user likes/comments/subscribes, a Notification entity will be written to the datastore with details about the event. To show a users Notification's all I have to do is query for all Notification's for that user, sort by date created desc and we have a nice little feed of actions other users took on a specific users account.
The issue I have is what to do when someone unlikes a post, unsubscribes or deletes a comment. Currently, if I were to query for that specific notification, it is possible that nothing would return from the datastore because of eventual consistency. We could imagine someone liking, then immediate unliking a post (b/c who hasn't done that? =P). The query to find that Notification might return null and nothing would get deleted when calling ofy().delete().entity(notification).now(); And now the user has a notification in their feed saying Sally liked his post when in reality she liked then quickly unliked it!
A wrench in this whole system is that I cannot delete by Key<Notification>, because I don't really have a way to know id of the Notification when trying to delete it.
A potential solution I am experimenting with is to not delete any Notifications. Instead I would always write Notification's and simply indicate if the notification was positive or negative. Then in my query to display notifications to a specific user, I could somehow only display the sum-positive Notification's. This would save some money on datastore too because deleting entities is expensive.
There are three main ways I've solved this problem before:
deterministic key
for example
{user-Id}-{post-id}-{liked-by} for likes
{user-id}-{post-id}-{comment-by}-{comment-index} for comments
This will work for most basic use cases for the problem you defined, but you'll have some hairy edge cases to figure out (like managing indexes of comments as they get edited and deleted). This will allow get and delete by key
parallel data structures
The idea here is to create more than one entity at a time in a transaction, but to make sure they have related keys. For example, when someone comments on a feed item, create a Comment entity, then create a CommentedOn entity which has the same ID, but make it have a parent key of the commenter user.
Then, you can make a strongly consistent query for the CommentedOn, and use the same id to do a get by key on the Comment. You can also just store a key, rather than having matching IDs if that's too hard. Having matching IDs in practice was easier each time I did this.
The main limitation of this approach is that you're effectively creating an index yourself out of entities, and while this can give you strongly consistent queries where you need them the throughput limitations of transactional writes can become harder to understand. You also need to manage state changes (like deletes) carefully.
State flags on entities
Assuming the Notification object just shows the user that something happened but links to another entity for the actual data, you could store a state flag (deleted, hidden, private etc) on that entity. Then listing your notifications would be a matter of loading the entities server side and filtering in code (or possibly subsequent filtered queries).
At the end of the day, the complexity of the solution should mirror the complexity of the problem. I would start with approach 3 then migrate to approach 2 when the fuller set of requirements is understood. It is a more robust and flexible approach, but complexity of XG transaction limitations will rear its head - but ultimately a distributed feed like this is a hard problem.
What I ended up doing and what worked for my specific model was that before creating a Notification Entity I would first allocate and ID for it:
// Allocate an ID for a Notification
final Key<Notification> notificationKey = factory().allocateId(Notification.class);
final Long notificationId = notificationKey.getId();
Then when creating my Like or Follow Entity, I would set the property Like.notificationId = notificationId; or Follow.notificationId = notificationId;
Then I would save both Entities.
Later, when I want to delete the Like or Follow I can do so and at the same time get the Id of the Notification, load the Notification by key (which is strongly consistent to do so), and delete it too.
Just another approach that may help someone =D

Resources