Time-dependent information: showing one rate and applying another - database

I have a situation where some information is valid only for a limited period of time.
One example is conversion rates stored in DB with validFrom and ValidTo timestamps.
Imagine the situation when user starts the process and I render a pre-receipt for him with one conversion rate, but when he finally hits the button other rate is already valid.
Some solutions I see for now:
Show user a message about new rate, render updated pre-receipt and ask him to submit form again.
To have overlaying periods of rates. So the transactions started with one rate could finish, but the new ones will start with the new rate.
While the 1st solution seems most logical, I've never seen such messages on websites. I wonder are there other solutions and what is the best practice.

So this is a question best posed to the product owner of your application. If I were wearing my product owner hat, I would want that the data being displayed never be out of sync, such that option (2) above never occurs. This is to make sure the display is fair in all respects.
Ways to handle this:
As you say: display an alert that something changed and allow a refresh.
handle updates to the data tables using DHTML/ AJAX updates so that the data is usually fresh.
To summarize: it's a business decision, but generally speaking it's a bad choice to show unfair and/or out of data data on a page.

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

NOSQL denormalization datamodel

Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.
What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.
Is it true that data is often stored denormalized like in my example?
You are correct, the data is often stored de-normalized in NoSQL databases.
The problem with the updates is partially where the term "eventual consistency" comes from.
In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.
You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views
One way to look at it is that the number of times the user changes his/her name is extremely rare.
But the number of times that board data is read and changed is immense.
So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.
Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).
Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.
For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.
I have been working for the past 7 years with NoSQL (Firestore) for 2 fairly big projects where I was able to write code from scratch (both around 50k LoC and one has about 15k daily active users). I didn't use denormalization at all. The concept never appealed to me, and document reads are fairly cheap in Firestore.
To come back to your example; loading the other data for the chess game seems way more important than instantly being able to show the name. I would load the name based on the user id in the background and put a simple client-side memoize / cache around it to prevent fetching the same user document over and over.
What I did use quite a bit to solve performance issues is generate derived data. I would set a listener on a database document "onWrite" and then store some computed data in another derived document. These documents would automatically update when the source changes, so it doesn't complicate things really. In the case of a chess game, a distilled document could be the leaderboard that is constantly shown to all users of the app.
Another optimization I had to do was to distill a long list of titles + metadata for recently opened "projects". Firestore on the web client side doesn't give the ability to select fields from a document in a query. It only fetches full documents and that was too much data for the list, so we solved this by making an API endpoint to fetch the distilled data through there.
I'm not saying you should follow my advice, but we seem to be doing well in terms of code complexity and database costs. So when I read that NoSQL requires data denormalization I become skeptical :)
That's my 2 cents.

Track time spent in MSCRM

My requirement is I need to be able to track time for each sales person on activities. Also a report that administrators can run to see the amount of time each user spent working on calls/sales/opportunities etc.
What should I use to track how long is user using particular activities?
I think I can do it using auditing.
Do you have any better ideas?
User Auditing isnt going to be much help here, for starters it only reports every couple of hours and secondly you cant write report against for the audit table.
I would suggest adding a duration field to entities you want to track time against - activities already have a field for this. Then users just have to manually populate this.
Or if you want to automate you could use form JavaScript, for example:
New Field: Number, Duration
On Form Load: Capture a start time
On Form Save: Capture an end time
On Form Save: Work out the difference between the two, then add to the duration field
You would have to do this for every entity you want to track though. Its also not guaranteed to be accurate, for example if a user opens a couple of records at once, or goes to lunch, or just doesnt save the record immediately a much longer Duration could be recorded than actually occurred.

How do I deal with concurrent changes in a web application?

Here are two potential workflows I would like to perform in a web application.
Variation 1
user sends request
server reads data
server modifies data
server saves modified data
Variation 2:
user sends request
server reads data
server sends data to user
user sends request with modifications
server saves modified data
In each of these cases, I am wondering: what are the standard approaches to ensuring that concurrent access to this service will produce sane results? (i.e. nobody's edit gets clobbered, values correspond to some ordering of the edits, etc.)
The situation is hypothetical, but here are some details of where I would likely need to deal with this in practice:
web application, but language unspecified
potentially, using a web framework
data store is a SQL relational database
the logic involved is too complex to express well in a query e.g. value = value + 1
I feel like I would prefer not to try and reinvent the wheel here. Surely these are well known problems with well known solutions. Please advise.
Thanks.
To the best of my knowledge, there is no general solution to the problem.
The root of the problem is that the user may retrieve data and stare at it on the screen for a long time before making an update and saving.
I know of three basic approaches:
When the user reads the database, lock the record, and don't release until the user saves any updates. In practice, this is wildly impractical. What if the user brings up a screen and then goes to lunch without saving? Or goes home for the day? Or is so frustrated trying to update this stupid record that he quits and never comes back?
Express your updates as deltas rather than destinations. To take the classic example, suppose you have a system that records stock in inventory. Every time there is a sale, you must subtract 1 (or more) from the inventory count.
So say the present quantity on hand is 10. User A creates a sale. Current quantity = 10. User B creates a sale. He also gets current quantity = 10. User A enters that two units are sold. New quantity = 10 - 2 = 8. Save. User B enters one unit sold. New quantity = 10 (the value he loaded) - 1 = 9. Save. Clearly, something went wrong.
Solution: Instead of writing "update inventory set quantity=9 where itemid=12345", write "update inventory set quantity=quantity-1 where itemid=12345". Then let the database queue the updates. This is very different from strategy #1, as the database only has to lock the record long enough to read it, make the update, and write it. It doesn't have to wait while someone stares at the screen.
Of course, this is only useable for changes that can be expressed as a delta. If you are, say, updating the customer's phone number, it's not going to work. (Like, old number is 555-1234. User A says to change it to 555-1235. That's a change of +1. User B says to change it to 555-1243. That's a change of +9. So total change is +10, the customer's new number is 555-1244. :-) ) But in cases like that, "last user to click the enter key wins" is probably the best you can do anyway.
On update, check that relevant fields in the database match your "from" value. For example, say you work for a law firm negotiating contracts for your clients. You have a screen where a user can enter notes about negotiations. User A brings up a contract record. User B brings up the same contract record. User A enters that he just spoke to the other party on the phone and they are agreeable to the proposed terms. User B, who has also been trying to call the other party, enters that they are not responding to phone calls and he suspects they are stonewalling. User A clicks save. Do we want user B's comments to overwrite user A's? Probably not. Instead we display a message indicating that the notes have been changed since he read the record, and allowing him to see the new value before deciding whether to proceed with the save, abort, or enter something different.
[Note: the forum is automatically renumbering my numbered lists. I'm not sure how to override this.]
If you do not have transactions in mysql, you can use the update command to ensure that the data is not corrupted.
UPDATE tableA SET status=2 WHERE status = 1
If status is one, then only one process well get the result that a record was updated. In the code below, returns -1 if the update was NOT executed (if there were no rows to update).
PreparedStatement query;
query = connection.prepareStatement(s);
int rows = -1;
try
{
rows = query.executeUpdate();
query.close();
}
catch (Exception e)
{
e.printStackTrace();
}
return rows;
Things are simple in the application layer - every request is served by a different thread (or process), so unless you have state in your processing classes (services), everything is safe.
Things get more complicated when you reach the database - i.e. where the state is held. There you need transactions to ensure that everything is ok.
Transactions have a set of properties - ACID, that "guarantee database transactions are processed reliably".

best practices for database + resultset concurrency (both UI and control issues)

I'm working on a viewer program that formats the contents of a database. So far it's all been read-only, and I have a Refresh button that re-queries the database if I want to make sure to use current data.
Now I'm looking at changing the viewer to an editor (read-write) which involves writing back to the database, and am realizing there are potential concurrency issues: if more than one user is working on the database then there are the possibilities of stale data & other concurrency bugaboos.
What I'm wondering is, what are appropriate design patterns both for the database and the application UI to avoid concurrency problems?
To be bulletproof, I could force the user to use an explicit transaction (e.g. it's in read-only mode most of the time, then they have to push an Edit button to start a transaction, then Commit and Revert buttons to commit or revert the transaction) but that seems clunky and wouldn't work well with large sets of changes (Edit, then 1 hour's worth of changes yields an overly large transaction and may prevent other people from making changes). Also it would suck if someone's making a bunch of changes and then it fails -- then what should they do to avoid losing that work?
It seems like I'd want to notify the user when the relevant data is being changed so that the granularity to changes is small and they get cued to refresh from the database & get in the habit of doing so.
Also, if there are updates, should I automatically bring them into the application display? (assuming they don't clobber what the user is working on) Or should the user be forced to explicitly refresh?
A great example, which is sort of close to the situation I'm working on, is filesystem explorers (e.g. Windows Explorer) which show a hierarchy of folders/directories and a list of objects within them. Windows Explorer lets you refresh, but there's also some notification from the filesystem to the Explorer window, so that if a new file is created, it will just appear in the viewport without you having to hit F5 to refresh.
I found these StackOverflow posts, but they're not exactly the same question:
Web services and database concurrency
Distributed Concurrency Control
C# Database Application
Only display one record for editing at a time.
Submit new values conditionally, after applying whatever domain-specific validation is appropriate. If the record has changed in the meantime (most DAL-type software will throw an exception so you don't need to check manually), display the current (changed) values, advise the user, and accept another try (or abandon). You may want to indicate the source and timestamp of the change you are displaying.
That's the simplest reliable standard pattern I know of. Trying to induce the user to explicitly choose "Display" vs. "Edit" mode is problematic. It locks the record for some indeterminate amount of time, and it's not always reliable that you know when the user (for instance) gives up, turns off their computer, and goes home.
If you have a case where you have a parent record with editable child records (e.g. the line items on a purchase order), it gets more complex but let's worry about that later? There are patterns for those too.
a good working way i use:
don't open tran until really applying changes to db (after user presses Save button)
don't even need to refresh record before beginning user's edit dialog.
but just before applying changes, check if record is changed by another user in your app code.
it's done trough a select statement just before update statement.
if record with old field values (in DataSet) not exists in database, alert user that 'record is changed by another user' and user must close dialog, refresh record and start editing again.
else open tran and the rest.
Optimistic locking works fine for most cases where your records are composed of short simple fields (e.g., a short string or single numeric value per field), giving users the greatest access to the data, and not forcing them to worry about locks and stuff. Apply a write lock only when actually in the process of saving a record. No records are locked while anyone is merely editing. If the app finds a record it's trying to save is already locked, then the app simply retries a short time (<500 ms) later. There’s no need to alert the user (other than maybe hourglass/pointer feedback if last longer than 500 ms) since no lock is ever in place long enough to matter to the user. When User A saves a record, the database only updates the fields that User A has changed (along with any other fields that depend on those changed values). This avoids overwriting with old values the fields changed by User B since User A retrieved the record.
The implicit assumption is that whoever edits a field of a record last has the final say, which not an unreasonable way of doing business. For example, User A retrieves a record and edits a field, then User B retrieves a record and edits the same field. User B saves, then User A saves. User A’s changes over-write User B. User B’s work was “a waste,” but that sort of thing is going to happen anyway when users share data. Locks can only prevent wasted work when users happen to try to edit the same record in the same thin slice of time. However, the more likely event is that User B edits the record’s the field and saves, then User A edits the field and saves, again wasting User B’s work. There’s nothing you can do with locks to prevent that. If there’s really a high chance of wasted work by user interactions, it’s better to prevent it through the design of the business process rather than database locks.
As for the UI, there are two server styles I recommend: (1) Real Time, and (2) Transactional.
In Real Time style, the users’ displays automatically correspond as closely as practical to what’s in the database. Refreshes are automatic either being based on a short period (every fives seconds), or “pushed” to the user when changes are made by others. When the user enters a field and makes an edit, the app suppresses refreshes for that field, but continues to refresh other fields and records. There is no Save button or menu item. The app saves a record anytime a user edits a field and then leaves it or hits Enter. When the user starts to edit a field, the app changes the field's appearance to indicate that things are tentative (e.g., changing the border around the field to a dashed line) in order to encourage the user to hit Enter or Tab when done.
In Transactional, the users' displays are presented as a snapshot of what's in the database. The user must explicitly save and manually refresh data with buttons or menu items, except the app should automatically refresh a record when the user starts to edit it or after the user saves it. The user can edit any number of fields or records before saving. However, you can encourage frequent saves by changing the appearance of edited fields to indicate their tentative state, like recommended for Real Time. You can also display a timestamp or other indication of the last refresh to encourage users to refresh frequently.
Generally, Real Time is preferred. Users don’t have to worry about stale data or losing a lot of work by forgetting to save. However, use Transactional if it is necessary to maintain sufficient database performance. You probably don’t want Real Time if updating a field typically takes more than 1.0 second for server response. You should also consider Transactional if users’ edits trigger major events that are difficult to reverse or can produce wasted work (e.g., changing a budget value triggers notice to superior for approval). An explicit Save command is good for saying, “Okay, I’ve checked my work, let ‘er rip.”

Resources