I have a logic app that uses For_each to iterate through email attachments and saves them to an Azure Files container.
Depending on some conditions, I want the attachment stored with a different name or at a different path, but the default concurrency of for_each is concerning; I have path and file name variables set at top level and I'm setting them in the loop as conditions are met.
Is there a way to be sure that these variables will hold the value I'm setting in that iteration without setting concurrency to 1?
It seems like it's working correctly with default concurrency, but I'm going to set concurrency to 1 until I'm sure about whether or not these iterations can interfere with each other in terms of setting variables.
If you are changing the variable value inside the for-each loop (and potentially also consuming it in the same loop), you should set concurrency to 1 to ensure the loop runs in a sequential fashion to avoid racing conditions.
Since variables can be declared only on top level, this will not be used on some scenarios without settings concurrency to 1. Consider we have JSON with this schema:
`[
{
[JSON Object]
}
]`
and needs to iterate each inner to get value and increase the counter and based on counter we have some action. In this condition, we require local variable instead of global variables.
Related
This picture below shows a sequence diagram for two clients storing values into a Key-Value datastore:
The problem I'm trying to solve is how to prevent overriding keys. The way the applications (Client_A, and Client_B) prevent this is by checking if key exists first before storing. The issue now is if both clients manage to get the same "does not exist" result, any of the two clients would be able to overwrite the values.
What strategy can be done to be able to prevent such from happening in a database client design?
A "key-value store", as it's usually defined, doesn't store duplicate keys at all. If two clients write to the same key, then only one value is stored -- the one from which ever client wrote "latest".
In order to reliably update values in a consistent way (where the new value depends on the old value associated with a key, or even whether or not there was an old value), your key-value store needs to support some kinds of atomic operations other than simple get and set.
Memcache, for example, supports atomic compare-and-set operations that will only set a value if it hasn't been set by someone else since you read it. Amazon's DynamoDB supports atomic transactions, atomic counters, etc.
START TRANSACTION;
SELECT ... FOR UPDATE;
take action depending on result
UPDATE ...;
COMMIT;
The "transaction" makes the pair. SELECT and UPDATE, "atomic".
Write this sort of code for any situation where another connection can sneak in and mess up what you are doing.
Note: The code written here uses MySQL's InnoDB syntax and semantics; adjust accordingly for other brands.
For my project I need ids that can be easily shared, so firestores default auto generated ids won't work.
I am looking for a way to auto generate id like 8329423 that would be incremented or randomly chosen in range 0 to 9999999.
Firestore's auto-ID fields are designed to statistically guarantee that no two clients will ever generate the same value. This is why they're as long as they are: it's to ensure there is enough randomness (entropy) in them.
This allows Firestore to determine these keys completely client-side without needing to look up on the server whether the key it generated was already generated on another client before. And this in turn has these main benefits:
Since the keys are generated client-side, they can also be generated when the client is not connected to any server.
Since the keys are generated client-side, there is no need for a roundtrip to the server to generate a new key. This significantly speeds up the process.
Since the keys are generated client-side, there is no contention between clients generating keys. Each client just generates keys as needed.
If these benefits are important to your use-case, then you should strongly consider whether you're likely to create a better unique ID than Firestore already does. For example, Firestore's IDs have 62^20 unique values, which is why they're statistically guaranteed to never generate the same value over a very long period of time. Your proposed range of 0 - 9999999 has 1 million unique values, which is much more likely to generate duplicate.
If you really want this scheme for IDs, you will need to store the IDs that you've already given out on the server (likely in Firestore), so that you can check against it when generating a new key. A very common way to do this is to keep a counter of the last ID you've already handed out in a document. To generate a new unique ID, you:
Read the latest counter value from the document.
Increment the counter.
Write the updated counter value to the document.
Use the updated counter value in your code.
Since this read-update-write happens from multiple clients, you will need to use a transaction for it. Also note that the clients now are coordinating the key-generation, so you're going to experience throughput limits on the number of keys you can generate.
We're loading a list of emails from file while putting a large number of datastore entities concurrently for each email address and getting occasional errors in the form:
BadRequestError: cross-group transaction need to be explicitly specified, see TransactionOptions.Builder.withXG
The failing Python method call:
EmailObj.get_or_insert(key_name=email, source=source, reason=reason)
where email is the address string and source and reason are StringProperties.
Question: how can the get_or_insert call start a transaction for one simple datastore model (2 string properties) and get entities involved of different groups? I expect the method above should either read the existing object matching the given key or store the new entity.
Note: I don't know the exact internal implementation, this is just a theory...
Since you didn't specify a parent then there is no entity group that could be "locked"/established from the beginning as the group to which the transaction would be limited to.
The get_or_insert operation would normally be a check if the entity with that keyname exists and, if not, then create a new entity. Without a parent, this new entity would be in its own group, let's call it new_group.
Now to complete the transaction automatically associated with get_or_insert a conflict check would need to be done.
The conflict would mean an entity with the same keyname was created by one of the concurrent tasks (after our check that such entity exists but before our transaction end), which entity would also have its own, different group, let's call it other_group.
This very conflict check, only in the case in which the conflict is real, would effectively access both new_group and other_group, causing the exception.
By specifying a parent the issue doesn't exist, since both new_group and other_group would actually be the same group - the parent's group. The get_or_insert transaction would be restricted to this group.
The theory could be verified, I think, even in production (the errors are actually harmless if the theory is correct - the addresses are after all duplicates)
First step would be to confirm that the occurences are related to the same email address being inserted twice in a very short time - concurently.
Wrap in try/except the exception and log the corresponding email address, then check it against the input files - they should be duplicates and I imagine located quite close to each-other in the file.
Then you could force that condition - intentionally inserting duplicate addresses every 5-10 emails (or some other rate you consider relevant).
Depending on how the app is actually implemented you could insert the duplicates in the input files, or, if using tasks - create/enqueue duplicate tasks, or simply call the get_or_insert twice (not sure if the last one will be effective enough as it may happen on the same thread/instance). Eventually log or keep track of the injected duplicated addresses
The occurence rate of the exception should now increase, potentially proportionally with the forced duplicate insertion rate. Most if not all of the corresponding logged addresses should match the tracked duplicates being injected.
If you got that far the duplicate entries being the trigger is confirmed. What would be left to confirm would be the different entity groups for the same email address - I can't think of a way to check that exactly, but I also can't think of some other group related to the duplicate email addresses.
Anyways, just add xg=True in the get_or_insert calls' context options and, if my theory is correct, the errors should dissapear :)
I am using the Azure Search Client Library and I want to call a query multiple times in parallel with different query parameters. When I try this I get the exception:
"Collection was modified; enumeration operation may not execute."
I handled the problem by adding a SemaphoreSlim object before the call which prevents multiple threads to execute the query at the same time. However this solution doubles the execution time.
private static readonly SemaphoreSlim syncLock = new SemaphoreSlim(1);
....
await syncLock.WaitAsync();
result = await SearchClient.Indexes[IndexName].QueryAsync<MyIndex>(queryParams);
syncLock.Release();
Since each query is an independent call I assume the threads should not be affected by each other?
Behind the doors, there's a common enumerable object which lists the indexes created within a service. If there isn't a reference in the memory of the index you're trying to retrieve, one will be created after getting the index's statistics, schema and some other properties on your behalf, totally transparent to the user. However, this operation, if done on another thread in parallel multiple times will throw this exception.
Thanks a lot for this feedback, I try to update the library asap and have this situation treated correspondigly. Until then (I suspect this might take a couple of days), please keep using your Semaphore solution which works great.
Thanks again!
Alex
I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.