Datastore Entity read (keys_only) inside ndb transactions - google-app-engine

I have a question regarding the datastore entity reads inside a ndb transaction.
I know that when we read an entity inside an ndb transaction, that specific entity gets locked and no other thread can put/update/write the same entity because it will result in a contention error.
That totally makes sense.
However, what happens when we read only the key of an entity instead of the whole entity itself inside the transaction? This can be done by passing keys_only flag as True in ndb.query().fetch()
In that case, will the entity again get locked?

The Datastore documentation for Transaction Locks says:
Read-write transactions use reader/writer locks to enforce isolation and serializability.
And it does not mention any situation with the specifics of the use of keys_only during transactions. So I would assume that the same applies to that situation, which does make sense if you consider that you are still making a read never the less, you are just ignoring the data.
That being said, maybe this is something that could be improved into Datastore, or even made clear in the documentation. If you wish, you could consider opening a Feature Request for Google to implement that by following this link.

In general, it's better to think of transactions in terms of their guarantee - serializability - rather than their implementation details - in this case, read/write locks. The implementation details (how queries are executed, locking granularity, exactly what gets locked, etc) can potentially change at any time, while the guarantee will not change.
For this specific question, and assuming the current implementation of Firestore in Datastore Mode: to ensure serializability, a keys-only query in a transaction T1 locks the ranges of index entries examined by the query. If such a query returned a key K for entity E, then an attempt to delete E in a different transaction T2 must remove all of E's index entries, including the one in the range locked by the query. So in this example T1 and T2 require the same locks and one of the two transactions will be delayed or aborted.
Note that there are other ways for T2 to conflict with T1: it could also be creating a new entity that would match T1's query (which would require writing an index entry in the range locked by T1's query).
Finally, if T2 were to update (rather than delete) E in a way that does require any updates to index entries in the range examined by T1's query (e.g., if the query is something like 'select * from X where a = 5' and the update to E does not change the value of it's 'a' property) then T1 and T2 will not conflict (this is an optimisation - behaviour would still be correct if these two transactions did conflict, and in fact for "Datastore Native" databases they can conflict).

Related

Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?

Stages of MongoDB aggregation pipeline are always executed sequentially. Can the documents that the pipeline processes be changed between the stages? E.g. if stage1 matches some docs from collection1 and stage2 matches some docs from collection2 can some documents from collection2 be written to during or just after stage1 (i.e. before stage2)? If so, can such behavior be prevented?
Why this is important: Say stage2 is a $lookup stage. Lookup is the NoSQL equivalent to SQL join. In a typical SQL database, a join query is isolated from writes. Meaning while the join is being resolved, the data affected by the join cannot change. I would like to know if I have the same guarantee in MongoDB. Please note that I am coming from noSQL world (just not MongoDB) and I understand the paradigm well. No need to suggest e.g. duplicating the data, if there was such a solution, I would not be asking on SO.
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved. However, MongoDB documentation does not say anything about aggregation pipeline locks. Does aggregation pipeline hold read (shared) locks to all the collections it reads? Or just to the collection used by the current pipeline stage?
More context: I need to run a "query" with multiple "joins" through several collections. The query is generated dynamically, I do not know upfront what collections will be "joined". Aggregation pipeline is the supposed way to do that. However, to get consistent "query" data, I need to ensure that no writes are interleaved between the stages of the pipeline.
E.g. a delete between $match and $lookup stage could remove one of the joined ("lookuped") documents making the entire result incorrect/inconsistent. Can this happen? How to prevent it?
#user20042973 already provided a link to https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot- in the very first comment, but considering followup comments and questions from OP regarding transactions, it seems it requires full answer for clarity.
So first of all transactions are all about writes, not reads. I can't stress it enough, so please read it again - transaction, or how mongodb introduced the "multidocument transactions" are there to ensure multiple updates have a single atomic operation "commit". No changes made within a transaction are visible outside of the transaction until it is committed, and all of the changes become visible at once when the transaction is committed. The docs: https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity
The OP is concerned that any concurrent writes to the database can affect results of his aggregation operation, especially for $lookup operations that query other collections for each matching document from the main collection.
It's a very reasonable consideration, as MongoDB has always been eventually consistent and did not provide guarantees that such lookups will return the same results if the linked collection were changed during aggregation. Generally speaking it doesn't even guarantee a unique key is unique within a cursor that uses this index - if a document was deleted, and then a new one with same unique key was inserted there is non-zero chance to retrieve both.
The instrument to workaround this limitation is called "read concern", not "transaction". There are number of read concerns available to balance between speed and reliability/consistency: https://www.mongodb.com/docs/v6.0/reference/read-concern/ OP is after the most expensive one - "snapshot", as ttps://www.mongodb.com/docs/v6.0/reference/read-concern-snapshot/ put it:
A snapshot is a complete copy of the data in a mongod instance at a specific point in time.
mongod in this context spells "the whole thing" - all databases, collections within these databases, documents within these collections.
All operations within a query with "snapshot" concern are executed against the same version of data as it was when the node accepted the query.
Transactions use this snapshot read isolation under the hood and can be used to guarantee consistent results for $lookup queries even if there are no writes within the transaction. I'd recommend to use read concern explicitly instead - less overhead, and more importantly it clearly shows the intent to devs who are going to maintain your app.
Now, regarding this part of the question:
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved.
It would be nice to have source of this claim. As of today (v5.0+) aggregation is lock-free, i.e. it is not blocked even if other operation holds an exclusive X lock on the collection: https://www.mongodb.com/docs/manual/faq/concurrency/#what-are-lock-free-read-operations-
When it cannot use lock-free read, it gets intended shared lock on the collection. This lock prevents only write locks on collection level, like these ones: https://www.mongodb.com/docs/manual/faq/concurrency/#which-administrative-commands-lock-a-collection-
IS lock on a collections still allows X locks on documents within the collection - insert, update or delete of a document requires only intended IX lock on collection, and exclusive X lock on the single document being affected by the write operation.
The final note - if such read isolation is critical to the business, and you must guarantee strict consistency, I'd advise to consider SQL databases. It might be more performant than snapshot queries. There are much more factors to consider, so I'll leave it to you. The point is mongo shines where eventual consistency is acceptable. It does pretty good with causal consistency within a server session, which gives enough guarantee for much wider range of usecases. I encourage you to test how good it will do with snapshots queries, especially if you are running multiple lookups, which can by its own be slow enough on larger datasets and might not even work without allowing disk use.
Q: Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?
A: Depending on how the transactions are isolated from each other.
Snapshot isolation refers to transactions seeing a consistent view of data: transactions can read data from a “snapshot” of data committed at the time the transaction starts. Any conflicting updates will cause the transaction to abort.
MongoDB transactions support a transaction-level read concern and transaction-level write concern. Clients can set an appropriate level of read & write concern, with the most rigorous being snapshot read concern combined with majority write concern.
To achieve it, set readConcern=snapshot and writeConcern=majority on connection string/session/transaction (but not on database/collection/operation as under a transaction database/collection/operation concern settings are ignored).
Q: Do transactions apply to all aggregation pipeline stages as well?
A: Not all operations are allowed in transaction.
For example, according to mongodb docs db.collection.aggregate() is allowed in transaction but some stages (e.g $merge) is excluded.
Full list of supported operation inside transaction: Refer mongodb doc.
Yes, MongoDB documents processed by an aggregation pipeline can be affected by external writes during pipeline execution. This is because the MongoDB aggregation pipeline operates on the data at the time it is processed, and it does not take into account any changes made to the data after the pipeline has started executing.
For example, if a document is being processed by the pipeline and an external write operation modifies or deletes the same document, the pipeline will not reflect those changes in its results. In some cases, this may result in incorrect or incomplete data being returned by the pipeline.
To avoid this situation, you can use MongoDB's snapshot option, which guarantees that the documents returned by the pipeline are a snapshot of the data as it existed at the start of the pipeline execution, regardless of any external writes that occur during the execution. However, this option can affect the performance of the pipeline.
Alternatively, it is possible to use a transaction in MongoDB 4.0 and later versions, which allows to have atomicity and consistency of the write operations on the documents during the pipeline execution.

Sql Server transactions - usage recommendations

I saw this sentence not only in one place:
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
What does this really mean?
It puzzles me now because I want to use transactions for my app which in normal use will deal with inserting of hundreds of rows from many clients, concurrently.
For example, I have a service which exposes a method: AddObjects(List<Objects>) and of course these object contain other nested different objects.
I was thinking to start a transaction for each call from the client performing the appropriate actions (bunch of insert/update/delete for each object with their nested objects). EDIT1: I meant a transaction for entire "AddObjects" call in order to prevent undefined states/behaviour.
Am I going in the wrong direction? If yes, how would you do that and what are your recommendations?
EDIT2: Also, I understood that transactions are fast for bulk oeprations, but it contradicts somehow with the quoted sentence. What is the conclusion?
Thanks in advance!
A transaction has to cover a business specific unit of work. It has nothing to do with generic 'objects', it must always be expressed in domain specific terms: 'debit of account X and credit of account Y must be in a transaction', 'subtract of inventory item and sale must be in a transaction' etc etc. Everything that must either succeed together or fail together must be in a transaction. If you are down an abstract path of 'adding objects to a list is a transaction' then yes, you are on a wrong path. The fact that all inserts/updates/deletes triggered by a an object save are in a transaction is not a purpose, but a side effect. The correct semantics should be 'update of object X and update of object Y must be in a transaction'. Even a degenerate case of a single 'object' being updated should still be regarded in terms of domain specific terms.
That recommendation is best understood as Do not allow user interaction in a transaction. If you need to ask the user during a transaction, roll back, ask and run again.
Other than that, do use transaction whenever you need to ensure atomicity.
It is not a transactions' problem that they may cause "concurrency issues", it is the fact that the database might need some more thought, a better set of indices or a more standardized data access order.
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
The longer a transaction is kept open the more likely it will lock resources that are needed by other transactions. This blocking will cause other concurrent transactions to wait for the resources (or fail depending on the design).
Sql Server is usually setup in Auto Commit mode. This means that every sql statement is a distinct transaction. Many times you want to use a multi-statement transaction so you can commit or rollback multiple updates. The longer the updates take, the more likely other transactions will conflict.

Can a database support "Atomicity" but not "Consistency" or vice-versa?

I am reading about ACID properties of a database. Atomicity and Consistency seem to be very closely related. I am wondering if there are any scenarios where we need to just support Atomicity but not Consistency or vice-versa. An example would really help!
They are somewhat related but there's a subtle difference.
Atomicity means that your transaction either happens or doesn't happen.
Consistency means that things like referential integrity are enforced.
Let's say you start a transaction to add two rows (a credit and debit which forms a single bank transaction). The atomicity of this has nothing to do with the consistency of the database. All it means it that either both rows or neither row will be added.
On the consistency front, let's say you have a foreign key constraint from orders to products. If you try to add an order that refers to a non-existent product, that's when consistency kicks in to prevent you from doing it.
Both are about maintaining the database in a workable state, hence their similarity. The former example will ensure the bank doesn't lose money (or steal it from you), the latter will ensure your application doesn't get surprised by orders for products you know nothing about.
Atomicity:
In an atomic transaction, a series of
database operations either all occur,
or nothing occurs. A guarantee of
atomicity prevents updates to the
database occurring only partially,
which can cause greater problems than
rejecting the whole series outright.
Consistency:
In database systems, a consistent
transaction is one that does not
violate any integrity constraints
during its execution. If a transaction
leaves the database in an illegal
state, it is aborted and an error is
reported
A database that supports atomicity but not consistency would allow transactions that leave the database in an inconsistent state (that is, violate referential or other integrity checks), provided the transaction completes successfully. For instance, you could add a string to an int column provided that the transaction performing this completed successfully.
Conversely, a database that supports consistency but not atomicity would allow partial transactions to complete, so long as the effects of that transaction didn't break any integrity checks (e.g. foreign keys must match an existing identity).
For instance, you could try adding a new row that included string and int values, and even if the insertion failed half way through losing half the data, the row would be allowed provided that none of the lost data was for required columns and no data was inserted into an incorrectly typed column.
Having said that, consistency relies on atomicity for the reversal of inconsistent transactions.
There is indeed a strong relation between Atomicity and Consistency, but they are not the same:
A DBMS can (theoretically) support Consistency and not Atomicity: for example, consider a transaction that consists SQL operations O1,O2, and O3. Now, assume that after O1 and O2 the DB is already in a consistent state. Then the DBMS can stop the transaction after O1 and O2 without O3 and still preserves consistency. Clearly, such a DBMS does nto supports atomicity (as O3 was not executed by O1 and O2 was).
A DBMS can (theoretically) support Atomicity and not Consistency: this can occur in a multi-user scenario, where atomicity only ensures that all actions of a transaction will be performed (or none of them) but it does not guaranteee that actions of one transaction done concurrently with another transaction may not end up in an inconsistent state.
However, what I do believe (but have not proven formally) is that if your DMBS guarantees both Atomicity and Isolation, then it must also guarantee Consistency.
I was also getting confused when reading about atomicity & consistency. Let's say there is scenario to do batch insert of 1000 records in the account table.
Atomicity of the batch is if all the 1000 records are inserted or none of the records are inserted if there is an error.
Consistency of the batch will be violated if at the account record level, we have put the logic to make the insert successful even if data type didn't match, related record was inserted in the foreign key table and later deleted after the successful account record update.
Hopefully this example clears the confusion.
I have a different understanding of consistency in the ACID context:
Within a transaction, if a given item of data is retrieved and retrieved again later in the same transaction, no changes are seen. That is, the transaction is given a consistent state of the database throughout the transaction. The only updates that can change data visible to the transaction are updates done by the transaction itself.
In my mind, this is tantamount to serializability.

When I update/insert a single row should it lock the entire table?

I have two long running queries that are both on transactions and access the same table but completely separate rows in those tables. These queries also perform some update and inserts based on those queries.
It appears that when these run concurrently that they encounter a lock of some kind and it’s preventing the task from finishing and locks up when it goes to update one of the rows. I’m using an exclusive row lock on the rows being read and the lock that shows up on the process is a lck_m_ix lock.
Two questions:
When I update/insert a single row does it lock the entire table?
What can be done to work around this sort of issue?
Typically no, but it depends (most often used answer for SQL Server!)
SQL Server will have to lock the data involved in a transaction in some way. It has to lock the data in the table itself, and the data any affected indexes, while you perform a modification. In order to improve concurrency, there are several "granularities" of locking that the server might decide to use, in order to allow multiple processes to run: row locks, page locks, and table locks are common (there are more). Which scale of locking is in play depends on how the server decides to execute a given update. Complicating things, there are also classifications of locks like shared, exclusive, and intent exclusive, that control whether the locked object can be read and/or modified.
It's been my experience that SQL Server mainly uses page locks for changes to small portions of tables, and past some threshold will automatically escalate to a table lock, if a larger portion of a table seems (from stats) to be affected by an update or delete. The idea is that it is faster to lock a table (one lock) than obtaining and managing thousands of individual row or page locks for a big update.
To see what is happening in your specific case, you'd need to look at the query logic and, while your stuff is running, examine the locking/blocking conditions in sys.dm_tran_locks, sys.dm_os_waiting_tasks or other DMV's. You would want to discover what exactly is getting locked by what step in each of your processes, to discover why one is blocking the other.
The short version:
No
Fix your code.
The long version:
LCK_M_IX is an intent lock, meaning the operation will place an X lock on a subordinate element. Eg. When updating a row in a table, the operation table takes an IX lock on the table before locking X the row being updated/inserted/deleted. Intent locks are common strategy to deal with hierarchies, like table/page/row, because the lock manager cannot understand the physical structure of resources requested to be locked (ie. it cannot know that an X-lock on page P1 is incompatible with an S-lock on row R1 because R1 is contained in P1). For more details, see Lock Modes.
The fact that you are seeing contention on intent locks means you are trying to obtain high level object locks, like table locks. You will need to analyze your source code for the request being blocked (the one requesting the lock incompatible with LCK_M_IX) and remove the cause of the object level lock request. What that means will depend on your source code, I cannot know what you're doing there. My guess is that you use an erroneous lock hint.
A more general approach is to rely on SNAPSHOT ISOLATION. But this, most likely, will not solve the problem you're seeing, since snapshot isolation can only benefit row level contention issues, not applications that request table locks.
A frequent aim of using transactions: keep them as short and sweet as possible. I get the sense from your wording in the question that you are opening a transaction, then doing all kinds of things, some of which take a long time. Then expecting multiple users to be able to run this same code concurrently. Unfortunately, if you perform an insert at the beginning of that set of code, then do 40 other things before committing or rolling back, it is possible that that insert will block everyone else from running the same type of insert, essentially turning your operation from free-for-all to serial.
Find out what each query is doing, and if you are getting lock escalations that you wouldn't expect. Just because you say WITH (ROWLOCK) on a query doesn't mean SQL Server will be able to comply... if you are touched multiple indexes, indexed views, persisted computed columns etc. then there are all kinds of reasons why your rowlock may not hold any water. You also might have things later in the transaction that are taking longer than you think, and maybe you don't realize that the locks on all of the objects involved in the transaction (not just the statement that is currently running) can be held for the duration of the transaction.
Different databases have different locking mechanisms, but ones like SQL Server and Oracle have different types of locking.
The default on SQL Server appears to be pessimistic Page locking - so if you have a small number of records then all of them may get locked.
Most databases should not lock when running a script, so I'm wondering whether you're potentially running multiple queries concurrently without transactions.

Database deadlocks

One of the classical reasons we have a database deadlock is when two transactions are inserting and updating tables in a different order.
For example, transaction A inserts in Table A then Table B.
And transaction B inserts in Table B followed by A.
Such a scenario is always at risk of a database deadlock (assuming you are not using serializable isolation level).
My questions are:
What kind of patterns do you follow in your design to make sure that all transactions are inserting and updating in the same order.
A book I was reading- had a suggestion that you can sort the statements by the name of the table. Have you done something like this or different - which would enforce that all inserts and updates are in the same order?
What about deleting records? Delete needs to start from child tables and updates and inserts need to start from parent tables. How do you ensure that this would not run into a deadlock?
All transactions are
inserting\updating in the same order.
Deletes; identify records to be
deleted outside a transaction and
then attempt the deletion in the
smallest possible transaction, e.g.
looking up by the primary key or similar
identified during the lookup stage.
Small transactions generally.
Indexing and other performance
tuning to both speed transactions
and to promote index lookups over
tablescans.
Avoid 'Hot tables',
e.g. one table with incrementing
counters for other tables primary
keys. Any other 'switchboard' type
configuration is risky.
Especially if not using Oracle, learn
the looking behaviour of the target
RDBMS in detail (optimistic /
pessimistic, isolation levels, etc.)
Ensure you do not allow row locks to
escalate to table locks as some
RDMSes will.
Deadlocks are no biggie. Just be prepared to retry your transactions on failure.
And keep them short. Short transactions consisting of queries that touch very few records (via the magic of indexing) are ideal to minimize deadlocks - fewer rows are locked, and for a shorter period of time.
You need to know that modern database engines don't lock tables; they lock rows; so deadlocks are a bit less likely.
You can also avoid locking by using MVCC and the CONSISTENT READ transaction isolation level: instead of locking, some threads will just see stale data.
Carefully design your database processes to eliminate as much as possible transactions that involve multiple tables. When I've had database design control there has never been a case of deadlock for which I could not design out the condition that caused it. That's not to say they don't exist and perhaps abound in situations outside my limited experience; but I've had no shortage of opportunities to improve designs causing these kinds of problems. One obvious strategy is to start with a chronological write-only table for insertion of new complete atomic transactions with no interdependencies, and apply their effects in an orderly asynchronous process.
Always use the database default isolation levels and locking settings unless you are absolutely sure what risks they incur, and have proven it by testing. Redesign your process if at all possible first. Then, impose the least increase in protection required to eliminate the risk (and test to prove it.) Don't increase restrictiveness "just in case" - this often leads to unintended consequences, sometimes of the type you intended to avoid.
To repeat the point from another direction, most of what you will read on this and other sites advocating the alteration of database settings to deal with transaction risks and locking problems is misleading and/or false, as demonstrated by how they conflict with each other so regularly. Sadly, especially for SQL Server, I have found no source of documentation that isn't hopelessly confusing and inadequate.
I have found that one of the best investments I ever made in avoiding deadlocks was to use a Object Relational Mapper that could order database updates. The exact order is not important, as long as every transaction writes in the same order (and deletes in exactly the reverse order).
The reason that this avoids most deadlocks out of the box is that your operations are always table A first, then table B, then table C (which perhaps depends on table B).
You can achieve a similar result as long as you exercise care in your stored procedures or data layer's access code. The only problem is that it requires great care to do it by hand, whereas a ORM with a Unit of Work concept can automate most cases.
UPDATE: A delete should run forward to verify that everything is the version you expect (you still need record version numbers or timestamps) and then delete backwards once everything verifies. As this should all happen in one transaction, the possibility of something changing out from under you shouldn't exist. The only reason for the ORM doing it backwards is to obey the key requirements, but if you do your check forward, you will have all the locks you need already in hand.
I analyze all database actions to determine, for each one, if it needs to be in a multiple statement transaction, and then for each such case, what the minimum isolation level is required to prevent deadlocks... As you said serializable will certainly do so...
Generally, only a very few database actions require a multiple statement transaction in the first place, and of those, only a few require serializable isolation to eliminate deadlocks.
For those that do, set the isolation level for that transaction before you begin, and reset it whatever your default is after it commits.
Your example would only be a problem if the database locked the ENTIRE table. If your database is doing that...run :)

Resources