Using the Azure-Search-Client-Library with parallel tasks - azure-cognitive-search

I am using the Azure Search Client Library and I want to call a query multiple times in parallel with different query parameters. When I try this I get the exception:
"Collection was modified; enumeration operation may not execute."
I handled the problem by adding a SemaphoreSlim object before the call which prevents multiple threads to execute the query at the same time. However this solution doubles the execution time.
private static readonly SemaphoreSlim syncLock = new SemaphoreSlim(1);
....
await syncLock.WaitAsync();
result = await SearchClient.Indexes[IndexName].QueryAsync<MyIndex>(queryParams);
syncLock.Release();
Since each query is an independent call I assume the threads should not be affected by each other?

Behind the doors, there's a common enumerable object which lists the indexes created within a service. If there isn't a reference in the memory of the index you're trying to retrieve, one will be created after getting the index's statistics, schema and some other properties on your behalf, totally transparent to the user. However, this operation, if done on another thread in parallel multiple times will throw this exception.
Thanks a lot for this feedback, I try to update the library asap and have this situation treated correspondigly. Until then (I suspect this might take a couple of days), please keep using your Semaphore solution which works great.
Thanks again!
Alex

Related

Reading static data in asynchronous task

I am aware about how concurrency can easily create issues in a program. I have a program that requires an asynchronous task that will save information to a database. In order to prevent many calls to the database, I cache the data and only save to the database as needed. Before I commit to adding this functionality, I just want a bit of input on this method.
My question is: Is it safe to read static-data in an Asynchronous task, or is it more inclined to produce bugs? Should I go about doing this another way?
My apologies if this is a novice question. I searched for this question and couldn't find the information I needed.
Yes, it's pretty safe if done right. Just try to follow several rules:
Use Readers-Writer lock if you want your collection be consistent at every moment. The benefit of this lock type is that it allows readers to read data almost without performance penalties.
Have only one entry point to modify the collection or actual data. If you will use any locking technique, then, of course, don't allow read data from entries that do not use locking.
If you use any locking techniques, use one lock per one resource (collection, variable, property). Don't expose this lock to the outside world, use it privately.

Concurrency with Objectify in GAE

I created a test web application to test persist-read-delete of Entities, I created a simple loop to persist an Entity, retrieve and modify it then delete it for 100 times.
At some interval of the loop there's no problem, however there are intervals that there is an error that Entity already exist and thus can't be persisted (a custom exception handling I added).
Also at some interval of the loop, the Entity can't be modified because it does not exist, and finally at some interval the Entity can't be deleted because it does not exist.
I understand that the loop may be so fast that the operation to the Appengine datastore is not yet complete. Thus causing, errors like Entity does not exist, when trying to access it or the delete operation is not yet finished so creating an Entity with the same ID can't be created yet and so forth.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
From what I understand you are doings something like the following:
for i in range(0,100):
ent = My_Entity() # create and save entity
db.put(ent)
ent = db.get(ent.key()) # get, modify and save the entity
ent.property = 'foo'
db.put(ent)
ent.get(ent.key()) # get and delete the entity
db.delete(my_ent)
with some error checking to make sure you have entities to delete, modify, and you are running into a bunch of errors about finding the entity to delete or modify. As you say, this is because the calls aren't guaranteed to be executed in order.
However, I want to understand how to handle these kind of situation where concurrent operation is being done with a Entity.
You're best bet for this is to batch any modifications you are doing for an entity persisting. For example if you are going to be creating/saving/modifying/savings or modifying/saving/deleting where ever possible try to combine these steps (ie create/modify/save or modify/delete). Not only will this avoid the errors you're seeing but it will also cut down on your RPCs. Following this strategy the above loop would be reduced to...
prop = None
for i in range(0,100):
prop = 'foo'
Put in other words, for anything that requires setting/deleting that quickly just use a local variable. That's GAE's answer for you. After you figure out all the quick stuff you can't persist that information in an entity.
Other than that there isn't much you can do. Transactions can help you if you need to make sure a bunch of entities are updated together but won't help if you're trying to multiple things to one entity at once.
EDIT: You could also look at the pipelines API.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

most efficient way to get, modify and put a batch of entities with ndb

in my app i have a few batch operations i perform.
unfortunately this sometimes takes forever to update 400-500 entities.
what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.
ill simplify my model to explain what i do (which is pretty simple anyway):
class Entity(ndb.Model):
title = ndb.StringProperty()
keys = [key1, key2, key3, key4, ..., key500]
entities = ndb.get_multi(keys)
for e in entities:
e.title = 'the new title'
ndb.put_multi(entities)
getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.
but what really bothers me is that a put takes up to 50seconds...
what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...
I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.
I also recommend looking at this with Appstats to see if that shows something surprising.
Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.
Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.
I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =
What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?

Delayed Table Initialization

Using the net-snmp API and using mib2c to generate the skeleton code, is it possible to support delayed initialization of tables? What I mean is, the table would not be initialized until any of it's members were queried directly. The reason for this is that the member data is obtained from another server, and I'd like to be able to start the snmpd daemon without requiring the other server to be online/ready for requests. I thought of maybe initializing the table with dummy data that gets updated with the real values when a member is queried, but I'm not sure if this is the best way.
The table also has only one row of entries, so using mib2c.iterate.conf to generate table iterators and dealing with all of that just seems unnecessary. I thought of maybe just implementing the sequence defined in the MIB and not the actual table, but that's not usually how it's done in all the examples I've seen. I looked at /mibgroup/examples/delayed_instance.c, but that's not quite what I'm looking for. Using mib2c with the mib2c.create-dataset.conf config file was the closest I got to getting this to work easily, but this config file assumes the data is static and not external (both of which are not true in my case), so it won't work. If it's not easily done, I'll probably just implement the sequence and not the table, but I'm hoping there's an easy way. Thanks in advance.
The iterator method will work just fine. It won't load any data until it calls your _first and _next routines. So it's up to you, in those routines and in the _handler routine, to request the data from the remote server. In fact, by default, it doesn't cache data at all so it'll make you query your remote server for every request. That can be slow if you have a lot of data in the table, so adding a cache to store the data for N seconds is advisable in that case.

Resources