If two processes modify the same entity concurrently, but only modify different properties, can they potentially overwrite the changes made by the other process when calling DatastoreService.put?
Process A:
theSameEntity.setProperty ("foo", "abc");
DatastoreService.put (theSameEntity);
Process B:
theSameEntity.setProperty ("bar", 123);
DatastoreService.put (theSameEntity);
Yes, it's possible they'll overwrite each other's changes, since the entire entity is sent to the datastore (serialized using protocol buffers) with each write (not just a diff).
You'll need to use transactions if you want to avoid this.
Yes, I have observed this (though in my case the concurrent requests modified the same property).
I don't think transactions will help because they don't lock the datastore they guarantee, that the operations in the transaction will see the same data. I Would like to know if anyone has found a solution to this.
Related
Suppose you have a method like this:
Entity entity = ofy().load().type(Entity.class)
.ancestor(key)
.filter("someField", someValue)
.first().now();
if (entity == null) {
// Entity does not exist yet.
// Write new entity to Datastore.
}
The method checks if a certain entity exists via an ancestor query. If it does not, it writes it to the datastore.
One thing I have found is that if the above method is executed twice at the exact same time, then two entities are written. It appears that entity is still seen as null due to a race condition.
I thought that ancestor queries are supposed to be strongly consistent and thus prevent the above scenario? It appears it is not strongly consistent when an entity does not exist.
Is there a way to prevent two entities from being written in the above case?
I could be wrong, but from the way you have worded the question, it sounds like you are confusing "strongly consistent" with "atomic". A strongly consistent read is guaranteed to reflect all updates that came before it, but is also stale the moment you execute it - someone else could have overwritten the data one microsecond later.
The only way to prevent race conditions in the datastore is to use transactions. There is a fair amount of documentation on that online so I won't belabor it here.
I am aware about how concurrency can easily create issues in a program. I have a program that requires an asynchronous task that will save information to a database. In order to prevent many calls to the database, I cache the data and only save to the database as needed. Before I commit to adding this functionality, I just want a bit of input on this method.
My question is: Is it safe to read static-data in an Asynchronous task, or is it more inclined to produce bugs? Should I go about doing this another way?
My apologies if this is a novice question. I searched for this question and couldn't find the information I needed.
Yes, it's pretty safe if done right. Just try to follow several rules:
Use Readers-Writer lock if you want your collection be consistent at every moment. The benefit of this lock type is that it allows readers to read data almost without performance penalties.
Have only one entry point to modify the collection or actual data. If you will use any locking technique, then, of course, don't allow read data from entries that do not use locking.
If you use any locking techniques, use one lock per one resource (collection, variable, property). Don't expose this lock to the outside world, use it privately.
After several months of evaluating and reevaluating and planing different data structures and web/application servers I'm now at a point where I need to bang my head around with implementation details. The (at the moment theoretical) question I'm facing is this:
Say I'm using GWANs KV store to store C structs for Users and the like (works fine, tested), how should I go about removing these objects from KV, and later on from memory, without encountering a race condition?
This is what I'm at at the moment:
Thread A:
grab other objects referencing the one to be deleted
set references to NULL
delete object
Thread B:
try to get object -> kv could return object, as it's not yet deleted
try to do something with the object -> could already be deleted here, so I would access already freed memory?
or something else which could happen:
Thread B:
get thing referencing object
follow reference -> object might not be deleted here
do something with reference -> object might be deleted here -> problem
or
Thread B:
got some other object which could reference the to be deleted object
grab object which isn't yet deleted
set reference to object -> object might be deleted here -> problem
Is there a way to avoid those kind of conditions, except for using locks? I've found a latitude of documents describing algorithms dealing with different producer/consumer situations, hashtables, ... with even sometimes wait free implementations (I haven't yet found a good example to show me the difference between lock-free and wait-free, though I get it conceptually), but I haven't been able to figure out how to deal with these kind of things.
Am I overthinking this, or is there maybe an easy way to avoid all these situations? I'm free to change the data- and -storage layout in any way I want, and I can use processor specific instructions freely (e.g. CAS)
Thanks in advance
Several questions there:
deleting a GWAN KV stored struct
When removing a KV from a persistence pointer or freeing the KV, you have to make sure that nobody is dereferencing freed data.
This is application dependent. You can introduce some tolerance by using G-WAN memory pools which will make data survive a KV deletion as long as the memory is not overwrited (or the pool freed).
deleting a GWAN KV key-value pair
G-WAN's KV store does the bookkeeping (using atomic intrinsics) to protect values fetched by threads and unprotects them after the request has been processed.
If you need to keep data for a longer time, make a copy.
Other storage tools, like in-memory SQLite use locks. In this case, lock granularity is very important.
I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
db.put(ent)
ent = db.get(ent.key()) # get, modify and save the entity
ent.property = 'foo'
db.put(ent)
ent.get(ent.key()) # get and delete the entity
db.delete(my_ent)
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.
We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.