Summary
I have an issue where the database writes from my task queue (approximately 60 tasks, at 10/s) are somehow being overwritten/discarded during a concurrent database read of the same data. I will explain how it works. Each task in the task queue assigns a unique ID to a specific datastore entity of a model.
If I run a indexed datastore query on the model and loop through the entities while the task queue is in progress, I would expect that some of the entities will have been operated on by the task queue (ie.. assigned an ID) and others are still yet-to-be effected. Unfortunately what seems to be happening is during the loop through the query, entities that were already operated on (ie.. successfully assigned an ID) are being overwritten or discarded, saying that they were never operated on, even though -according to my logs- they were operated on.
Why is this happening? I need to be able to read the status of my data without affecting the taskqueue write operation in the background. I thought maybe it was a caching issue so I tried enforcing use_cache=False and use_memcache=False on the query, but that did not solve the issue. Any help would be appreciated.
Other interesting notes:
If I allow the task queue to complete fully before doing a datastore query, and then do a datastore query, it acts as expected and nothing is overwritten/discarded.
This is typically an indication that the write operations to the entities are not performed in transactions. Transactions can detect such concurrent write (and read!) operations and re-try them, ensuring that the data remains consistent.
You also need to be aware that queries (if they are not ancestor queries) are eventually consistent, meaning their results are a bit "behind" the actual datastore information (it takes some time from the moment the datastore information is updated until the corresponding indexes that the queries use are updated accordingly). So when processing entities from query results you should also transactionally verify their content. Personally I prefer to make keys_only queries and then obtain the entities via key lookups, which are always consistent (of course, also in transactions if I intend to update the entities and, on reads, if needed).
For example if you query for entities which don't have a unique ID you may get entities which were in fact recently operated on and have an ID. So you should (transactionally) check if the entity actually has an ID and skip its update.
Also make sure you're not updating entities obtained from projection queries - results obtained from such queries may not represent the entire entities, writing them back will wipe out properties not included in the projection.
Related
Stages of MongoDB aggregation pipeline are always executed sequentially. Can the documents that the pipeline processes be changed between the stages? E.g. if stage1 matches some docs from collection1 and stage2 matches some docs from collection2 can some documents from collection2 be written to during or just after stage1 (i.e. before stage2)? If so, can such behavior be prevented?
Why this is important: Say stage2 is a $lookup stage. Lookup is the NoSQL equivalent to SQL join. In a typical SQL database, a join query is isolated from writes. Meaning while the join is being resolved, the data affected by the join cannot change. I would like to know if I have the same guarantee in MongoDB. Please note that I am coming from noSQL world (just not MongoDB) and I understand the paradigm well. No need to suggest e.g. duplicating the data, if there was such a solution, I would not be asking on SO.
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved. However, MongoDB documentation does not say anything about aggregation pipeline locks. Does aggregation pipeline hold read (shared) locks to all the collections it reads? Or just to the collection used by the current pipeline stage?
More context: I need to run a "query" with multiple "joins" through several collections. The query is generated dynamically, I do not know upfront what collections will be "joined". Aggregation pipeline is the supposed way to do that. However, to get consistent "query" data, I need to ensure that no writes are interleaved between the stages of the pipeline.
E.g. a delete between $match and $lookup stage could remove one of the joined ("lookuped") documents making the entire result incorrect/inconsistent. Can this happen? How to prevent it?
#user20042973 already provided a link to https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot- in the very first comment, but considering followup comments and questions from OP regarding transactions, it seems it requires full answer for clarity.
So first of all transactions are all about writes, not reads. I can't stress it enough, so please read it again - transaction, or how mongodb introduced the "multidocument transactions" are there to ensure multiple updates have a single atomic operation "commit". No changes made within a transaction are visible outside of the transaction until it is committed, and all of the changes become visible at once when the transaction is committed. The docs: https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity
The OP is concerned that any concurrent writes to the database can affect results of his aggregation operation, especially for $lookup operations that query other collections for each matching document from the main collection.
It's a very reasonable consideration, as MongoDB has always been eventually consistent and did not provide guarantees that such lookups will return the same results if the linked collection were changed during aggregation. Generally speaking it doesn't even guarantee a unique key is unique within a cursor that uses this index - if a document was deleted, and then a new one with same unique key was inserted there is non-zero chance to retrieve both.
The instrument to workaround this limitation is called "read concern", not "transaction". There are number of read concerns available to balance between speed and reliability/consistency: https://www.mongodb.com/docs/v6.0/reference/read-concern/ OP is after the most expensive one - "snapshot", as ttps://www.mongodb.com/docs/v6.0/reference/read-concern-snapshot/ put it:
A snapshot is a complete copy of the data in a mongod instance at a specific point in time.
mongod in this context spells "the whole thing" - all databases, collections within these databases, documents within these collections.
All operations within a query with "snapshot" concern are executed against the same version of data as it was when the node accepted the query.
Transactions use this snapshot read isolation under the hood and can be used to guarantee consistent results for $lookup queries even if there are no writes within the transaction. I'd recommend to use read concern explicitly instead - less overhead, and more importantly it clearly shows the intent to devs who are going to maintain your app.
Now, regarding this part of the question:
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved.
It would be nice to have source of this claim. As of today (v5.0+) aggregation is lock-free, i.e. it is not blocked even if other operation holds an exclusive X lock on the collection: https://www.mongodb.com/docs/manual/faq/concurrency/#what-are-lock-free-read-operations-
When it cannot use lock-free read, it gets intended shared lock on the collection. This lock prevents only write locks on collection level, like these ones: https://www.mongodb.com/docs/manual/faq/concurrency/#which-administrative-commands-lock-a-collection-
IS lock on a collections still allows X locks on documents within the collection - insert, update or delete of a document requires only intended IX lock on collection, and exclusive X lock on the single document being affected by the write operation.
The final note - if such read isolation is critical to the business, and you must guarantee strict consistency, I'd advise to consider SQL databases. It might be more performant than snapshot queries. There are much more factors to consider, so I'll leave it to you. The point is mongo shines where eventual consistency is acceptable. It does pretty good with causal consistency within a server session, which gives enough guarantee for much wider range of usecases. I encourage you to test how good it will do with snapshots queries, especially if you are running multiple lookups, which can by its own be slow enough on larger datasets and might not even work without allowing disk use.
Q: Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?
A: Depending on how the transactions are isolated from each other.
Snapshot isolation refers to transactions seeing a consistent view of data: transactions can read data from a “snapshot” of data committed at the time the transaction starts. Any conflicting updates will cause the transaction to abort.
MongoDB transactions support a transaction-level read concern and transaction-level write concern. Clients can set an appropriate level of read & write concern, with the most rigorous being snapshot read concern combined with majority write concern.
To achieve it, set readConcern=snapshot and writeConcern=majority on connection string/session/transaction (but not on database/collection/operation as under a transaction database/collection/operation concern settings are ignored).
Q: Do transactions apply to all aggregation pipeline stages as well?
A: Not all operations are allowed in transaction.
For example, according to mongodb docs db.collection.aggregate() is allowed in transaction but some stages (e.g $merge) is excluded.
Full list of supported operation inside transaction: Refer mongodb doc.
Yes, MongoDB documents processed by an aggregation pipeline can be affected by external writes during pipeline execution. This is because the MongoDB aggregation pipeline operates on the data at the time it is processed, and it does not take into account any changes made to the data after the pipeline has started executing.
For example, if a document is being processed by the pipeline and an external write operation modifies or deletes the same document, the pipeline will not reflect those changes in its results. In some cases, this may result in incorrect or incomplete data being returned by the pipeline.
To avoid this situation, you can use MongoDB's snapshot option, which guarantees that the documents returned by the pipeline are a snapshot of the data as it existed at the start of the pipeline execution, regardless of any external writes that occur during the execution. However, this option can affect the performance of the pipeline.
Alternatively, it is possible to use a transaction in MongoDB 4.0 and later versions, which allows to have atomicity and consistency of the write operations on the documents during the pipeline execution.
In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]
In one place of code I do something like this:
FormModel(.. some data here..).put()
And a couple lines below I select from the database:
FormModel.all().filter(..).fetch(100)
The problem I noticed - sometimes the fetch doesn't notice the data I just added.
My theory is that this happens because I'm using high replication storage, and I don't give it enough time to replicate the data. But how can I avoid this problem?
Unless the data is in the same entity group there is no way to guarantee that the data will be the most up to data (If I understand this section correctly).
Shay is right: there's no way to know when the datastore will be ready to return the data you just entered.
However, you are guaranteed that the data will be entered eventually, once the call to put completes successfully. That's a lot of information, and you can use it to work around this problem. When you get the data back from fetch, just append/insert the new entities that you know will be in there eventually! In most cases it will be good enough to do this on a per-request basis, I think, but you could do something more powerful that uses memcache to cover all requests (except cases where memcache fails).
The hard part, of course, is figuring out when you should append/insert which entities. It's obnoxious to have to do this workaround, but a relatively low price to pay for something as astonishingly complex as the HRD.
From https://developers.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency
This consistent snapshot view also extends to reads after writes
inside transactions. Unlike with most databases, queries and gets
inside a Datastore transaction do not see the results of previous
writes inside that transaction. Specifically, if an entity is modified
or deleted within a transaction, a query or get returns the original
version of the entity as of the beginning of the transaction, or
nothing if the entity did not exist then.
So I am building a Java webapp with Spring and Hibernate. In the application userw can add points to a object and I'd like to count the points given to order my objects. The objects are also stored in the database. And hopefully hundreds of people will give points to the objects at the same time.
But how do I count the points and save them in the database at the same time? Usually I would just have a property on my object and just increase the points. But that would mean that I have to lock the data in the database with a pessimistic transaction in order to prevent concurrency issues (reading the amount of points while another thread is half way through changing it already). That would possibly make my app much slower (at least I imagine it would).
The other solution would be to store the amount of given points in an associated object and store them separately in the database while counting the points in memory within a "small" synchronized block or something.
Which solution has the least performance impact when handling many concurrent operations on the same objects. Or are there any other fitting solutions?
If you would like the values to be persisted, then you should persist them in your database.
Given that, try the following:
create a very narrow row, like just OBJ_ID and POINTS.
create an index only on OBJ_ID, so not a lot of time is spent updating indexes when values are inserted, updated or deleted.
use INNODB, which has row-level locking, so the locks will be smaller
mysql will give you the last committed (consistent) value
That's all pretty simple. Give a whirl! Setup a test case that mimics your expected load and see how it performs. Post back if you get stuck.
Good luck.
I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!