Here are two potential workflows I would like to perform in a web application.
Variation 1
user sends request
server reads data
server modifies data
server saves modified data
Variation 2:
user sends request
server reads data
server sends data to user
user sends request with modifications
server saves modified data
In each of these cases, I am wondering: what are the standard approaches to ensuring that concurrent access to this service will produce sane results? (i.e. nobody's edit gets clobbered, values correspond to some ordering of the edits, etc.)
The situation is hypothetical, but here are some details of where I would likely need to deal with this in practice:
web application, but language unspecified
potentially, using a web framework
data store is a SQL relational database
the logic involved is too complex to express well in a query e.g. value = value + 1
I feel like I would prefer not to try and reinvent the wheel here. Surely these are well known problems with well known solutions. Please advise.
Thanks.
To the best of my knowledge, there is no general solution to the problem.
The root of the problem is that the user may retrieve data and stare at it on the screen for a long time before making an update and saving.
I know of three basic approaches:
When the user reads the database, lock the record, and don't release until the user saves any updates. In practice, this is wildly impractical. What if the user brings up a screen and then goes to lunch without saving? Or goes home for the day? Or is so frustrated trying to update this stupid record that he quits and never comes back?
Express your updates as deltas rather than destinations. To take the classic example, suppose you have a system that records stock in inventory. Every time there is a sale, you must subtract 1 (or more) from the inventory count.
So say the present quantity on hand is 10. User A creates a sale. Current quantity = 10. User B creates a sale. He also gets current quantity = 10. User A enters that two units are sold. New quantity = 10 - 2 = 8. Save. User B enters one unit sold. New quantity = 10 (the value he loaded) - 1 = 9. Save. Clearly, something went wrong.
Solution: Instead of writing "update inventory set quantity=9 where itemid=12345", write "update inventory set quantity=quantity-1 where itemid=12345". Then let the database queue the updates. This is very different from strategy #1, as the database only has to lock the record long enough to read it, make the update, and write it. It doesn't have to wait while someone stares at the screen.
Of course, this is only useable for changes that can be expressed as a delta. If you are, say, updating the customer's phone number, it's not going to work. (Like, old number is 555-1234. User A says to change it to 555-1235. That's a change of +1. User B says to change it to 555-1243. That's a change of +9. So total change is +10, the customer's new number is 555-1244. :-) ) But in cases like that, "last user to click the enter key wins" is probably the best you can do anyway.
On update, check that relevant fields in the database match your "from" value. For example, say you work for a law firm negotiating contracts for your clients. You have a screen where a user can enter notes about negotiations. User A brings up a contract record. User B brings up the same contract record. User A enters that he just spoke to the other party on the phone and they are agreeable to the proposed terms. User B, who has also been trying to call the other party, enters that they are not responding to phone calls and he suspects they are stonewalling. User A clicks save. Do we want user B's comments to overwrite user A's? Probably not. Instead we display a message indicating that the notes have been changed since he read the record, and allowing him to see the new value before deciding whether to proceed with the save, abort, or enter something different.
[Note: the forum is automatically renumbering my numbered lists. I'm not sure how to override this.]
If you do not have transactions in mysql, you can use the update command to ensure that the data is not corrupted.
UPDATE tableA SET status=2 WHERE status = 1
If status is one, then only one process well get the result that a record was updated. In the code below, returns -1 if the update was NOT executed (if there were no rows to update).
PreparedStatement query;
query = connection.prepareStatement(s);
int rows = -1;
try
{
rows = query.executeUpdate();
query.close();
}
catch (Exception e)
{
e.printStackTrace();
}
return rows;
Things are simple in the application layer - every request is served by a different thread (or process), so unless you have state in your processing classes (services), everything is safe.
Things get more complicated when you reach the database - i.e. where the state is held. There you need transactions to ensure that everything is ok.
Transactions have a set of properties - ACID, that "guarantee database transactions are processed reliably".
Related
We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.
I need a solution for a special scenario:
Each user has many(millions to billions) database rows of his products. Each row is an product.
Each user can only change his own products.
Each subset of those products can be changed differently.
Each user can change different values(price, amount,...) by changing their value by adding, subtracting a fixed entered value.
Each user can also change those values by adding or subtracting a percentage value(add 3 % to all selected values or to a subset of all products).
Each change can be done through executing any amount of changes unless he saves his changes.
In addition those users need the possibility to roll back their changes to the initial state or to a state he defined any time ago, so he can choose which state he wants to be restored.(like many undo, redo functionalities stored in a state with a timestamp)
If a user has defined 12 or another amount of changing states and he decides to roll back them in reverse order than he must get all values restored to its initial state.
Based on the massive amount of data for each user it is not really practical to store all changes to each product.
It will be used in a web based application written in PHP, Javascript and MySQL.
Is there any possibility(database functionality, another database, api,...) to realize that?
Maybe something like the command pattern in a different way?
I hope someone has an idea how i can realize that.
I have a situation where some information is valid only for a limited period of time.
One example is conversion rates stored in DB with validFrom and ValidTo timestamps.
Imagine the situation when user starts the process and I render a pre-receipt for him with one conversion rate, but when he finally hits the button other rate is already valid.
Some solutions I see for now:
Show user a message about new rate, render updated pre-receipt and ask him to submit form again.
To have overlaying periods of rates. So the transactions started with one rate could finish, but the new ones will start with the new rate.
While the 1st solution seems most logical, I've never seen such messages on websites. I wonder are there other solutions and what is the best practice.
So this is a question best posed to the product owner of your application. If I were wearing my product owner hat, I would want that the data being displayed never be out of sync, such that option (2) above never occurs. This is to make sure the display is fair in all respects.
Ways to handle this:
As you say: display an alert that something changed and allow a refresh.
handle updates to the data tables using DHTML/ AJAX updates so that the data is usually fresh.
To summarize: it's a business decision, but generally speaking it's a bad choice to show unfair and/or out of data data on a page.
I'm working on a viewer program that formats the contents of a database. So far it's all been read-only, and I have a Refresh button that re-queries the database if I want to make sure to use current data.
Now I'm looking at changing the viewer to an editor (read-write) which involves writing back to the database, and am realizing there are potential concurrency issues: if more than one user is working on the database then there are the possibilities of stale data & other concurrency bugaboos.
What I'm wondering is, what are appropriate design patterns both for the database and the application UI to avoid concurrency problems?
To be bulletproof, I could force the user to use an explicit transaction (e.g. it's in read-only mode most of the time, then they have to push an Edit button to start a transaction, then Commit and Revert buttons to commit or revert the transaction) but that seems clunky and wouldn't work well with large sets of changes (Edit, then 1 hour's worth of changes yields an overly large transaction and may prevent other people from making changes). Also it would suck if someone's making a bunch of changes and then it fails -- then what should they do to avoid losing that work?
It seems like I'd want to notify the user when the relevant data is being changed so that the granularity to changes is small and they get cued to refresh from the database & get in the habit of doing so.
Also, if there are updates, should I automatically bring them into the application display? (assuming they don't clobber what the user is working on) Or should the user be forced to explicitly refresh?
A great example, which is sort of close to the situation I'm working on, is filesystem explorers (e.g. Windows Explorer) which show a hierarchy of folders/directories and a list of objects within them. Windows Explorer lets you refresh, but there's also some notification from the filesystem to the Explorer window, so that if a new file is created, it will just appear in the viewport without you having to hit F5 to refresh.
I found these StackOverflow posts, but they're not exactly the same question:
Web services and database concurrency
Distributed Concurrency Control
C# Database Application
Only display one record for editing at a time.
Submit new values conditionally, after applying whatever domain-specific validation is appropriate. If the record has changed in the meantime (most DAL-type software will throw an exception so you don't need to check manually), display the current (changed) values, advise the user, and accept another try (or abandon). You may want to indicate the source and timestamp of the change you are displaying.
That's the simplest reliable standard pattern I know of. Trying to induce the user to explicitly choose "Display" vs. "Edit" mode is problematic. It locks the record for some indeterminate amount of time, and it's not always reliable that you know when the user (for instance) gives up, turns off their computer, and goes home.
If you have a case where you have a parent record with editable child records (e.g. the line items on a purchase order), it gets more complex but let's worry about that later? There are patterns for those too.
a good working way i use:
don't open tran until really applying changes to db (after user presses Save button)
don't even need to refresh record before beginning user's edit dialog.
but just before applying changes, check if record is changed by another user in your app code.
it's done trough a select statement just before update statement.
if record with old field values (in DataSet) not exists in database, alert user that 'record is changed by another user' and user must close dialog, refresh record and start editing again.
else open tran and the rest.
Optimistic locking works fine for most cases where your records are composed of short simple fields (e.g., a short string or single numeric value per field), giving users the greatest access to the data, and not forcing them to worry about locks and stuff. Apply a write lock only when actually in the process of saving a record. No records are locked while anyone is merely editing. If the app finds a record it's trying to save is already locked, then the app simply retries a short time (<500 ms) later. There’s no need to alert the user (other than maybe hourglass/pointer feedback if last longer than 500 ms) since no lock is ever in place long enough to matter to the user. When User A saves a record, the database only updates the fields that User A has changed (along with any other fields that depend on those changed values). This avoids overwriting with old values the fields changed by User B since User A retrieved the record.
The implicit assumption is that whoever edits a field of a record last has the final say, which not an unreasonable way of doing business. For example, User A retrieves a record and edits a field, then User B retrieves a record and edits the same field. User B saves, then User A saves. User A’s changes over-write User B. User B’s work was “a waste,” but that sort of thing is going to happen anyway when users share data. Locks can only prevent wasted work when users happen to try to edit the same record in the same thin slice of time. However, the more likely event is that User B edits the record’s the field and saves, then User A edits the field and saves, again wasting User B’s work. There’s nothing you can do with locks to prevent that. If there’s really a high chance of wasted work by user interactions, it’s better to prevent it through the design of the business process rather than database locks.
As for the UI, there are two server styles I recommend: (1) Real Time, and (2) Transactional.
In Real Time style, the users’ displays automatically correspond as closely as practical to what’s in the database. Refreshes are automatic either being based on a short period (every fives seconds), or “pushed” to the user when changes are made by others. When the user enters a field and makes an edit, the app suppresses refreshes for that field, but continues to refresh other fields and records. There is no Save button or menu item. The app saves a record anytime a user edits a field and then leaves it or hits Enter. When the user starts to edit a field, the app changes the field's appearance to indicate that things are tentative (e.g., changing the border around the field to a dashed line) in order to encourage the user to hit Enter or Tab when done.
In Transactional, the users' displays are presented as a snapshot of what's in the database. The user must explicitly save and manually refresh data with buttons or menu items, except the app should automatically refresh a record when the user starts to edit it or after the user saves it. The user can edit any number of fields or records before saving. However, you can encourage frequent saves by changing the appearance of edited fields to indicate their tentative state, like recommended for Real Time. You can also display a timestamp or other indication of the last refresh to encourage users to refresh frequently.
Generally, Real Time is preferred. Users don’t have to worry about stale data or losing a lot of work by forgetting to save. However, use Transactional if it is necessary to maintain sufficient database performance. You probably don’t want Real Time if updating a field typically takes more than 1.0 second for server response. You should also consider Transactional if users’ edits trigger major events that are difficult to reverse or can produce wasted work (e.g., changing a budget value triggers notice to superior for approval). An explicit Save command is good for saying, “Okay, I’ve checked my work, let ‘er rip.”
I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.