Customer Deduplication in Booking Application - database

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.

In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.

I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

Related

If I am having any random accountId then How to find its ultimate parent account - looking for best optimized solution (for multiple level hierarchy)

If I am having any random accountId then How to find its ultimate parent account - looking for best-optimized solution (for multiple level hierarchy)
except 10 levels of formula field solution
It depends. Optimized for what, read operations (instant simple answer when querying) or write (easy save but more work when reading).
If you want easy read - you need to put some effort when saving the data. And remember you can't get away with a simple custom lookup called "Ultimate Parent" - because for standalone account SF will not let you form a cycle, create record that looks up itself. You might need 2 text fields (Id and Name) or some convention that yes, you'll make a lookup to Account - but if it's blank - the reading process needs to check the ParentId field too to determine what exactly is going on. (you could make a formula field to simplify reading but still - don't think you're getting away with simple lookup)
How much data you have, how deep hierarchies? The basic answer is to keep track of ultimate parent on every insert, update, delete and undelete. Write a trigger, SOQL query can go "up" 5 "dots" max
SELECT ParentId,
Parent.ParentId,
Parent.Parent.ParentId,
Parent.Parent.Parent.ParentId,
Parent.Parent.Parent.Parent.ParentId,
Parent.Parent.Parent.Parent.Parent.ParentId
FROM Account
WHERE Id IN :trigger.new
It gets messier if you need multiple queries (but still, this form would be most effective). And also you might hit performance issues when something reparents close to top of the tree and you're suddenly looking at having to cascade update hundreds of accounts. Remember you have a limit of 10K rows inserted/updated/deleted in single operation. You might have to propagate the changes down as a batch/future/queueable async process.
Another option would be to have a flat helper object aside from account table, with unique id set to account id. Have a background process periodically refreshing that table, even every hour. Using a batch job or reporting snapshot. Still not great if you have milions of accounts, waste of storage... but maybe you could use Big Objects.
Have you ever used platform cache? If the ultimate parent has to be fetched via apex (instead of being a real field on Account) - you could try to make some kind of "linked list" implementation where you store Id -> ParentId in cache and can travel it without wasting any queries. Cache's max is 48h (so might still need a nightly job to rebuild it) and you'd still have to update it on every insert/update/delete/undelete...
So yeah, "it depends". Write more about your requirement.

Reusing tables for purposes other than the intended

We need to implement some new functionality for some clients. The functionality is essentially an EULA accept interface for the users. Users will open our app, will be presented with the corresponding EULA (varies from client to client). It needs to be able to store different versions of the EULA for the same client, and it also needs to store which users have accepted which version of the EULA. If a new version is saved, it will be presented to the users the next time they log in.
I've written a document suggesting to add two tables, EULAs and UserAcceptedEULA. That will allow us to store different EULAs and keep track of the accepted ones, current and previous ones.
My problem comes with how some people at the company want to do the implementation. They suggest to use a table ConstantGroups (which contains ConstantGroupID, Timestamp, ClientID and Name) that we use for grouping constants with their values that are stored in another table, e.g.: ConstantGroup would be Quality, and the values would be High, Medium, Low.
To me this is a horrible, incredibly wrong way to do it. They're suggesting it because we already have an endpoint where you pass the ClientID and you get back a string, so it "does what we need".
I wrote the document explaining the whole solution, covering DB changes, APIs needed and UI modifications, but they still don't want to accept it because they thing their way will save us time.
How do I make them understand how horribly wrong they are?
This depends somewhat on your assumptions about "good" design.
Many software folk have adopted the SOLID principles as being "good" (I am one of them). While the original thinking is about object oriented design, I think many apply to databases too.
The first element of that is "Single responsibility". A table should do one thing, and one thing only. Your colleagues are trying to get a single entity to manage different concepts; the Constants table suddenly is responsible for "constants" and "EULA acceptance".
That leads to "Open to extension, closed to change" - if you need to change the implementation of "constants" or "EULAs", you have to untangle the other. So any time you (might) save now will cost you later.
The second principle I like (especially in database design) is the Principle of Least Astonishment. Imagine a new developer joining the team, and having to figure out how EULAs work. They would naturally look for some kind of "EULA" and "Acceptance" tables, and would be astonished to learn that actually, this is managed in a thing called "constants". Any time you save now will be repaid by onboarding new people (or indeed, reminding yourself in 2 years time when you have to fix a bug).

Couchbase, two user registering with same username but different datacenters?

Let's say I have two users, Alice in North America and Bob in Europe. Both want to register a new account with the same username, at the same time, on different datacenters. The datacenters are configured to replicate between each other using eventual consistency.
How can I make sure only one of them succeeds at registering the username? Keep in mind that the connection between the datacenters might even be offline at the time (worst case, but daily occurance on spotify's cassandra setup).
EDIT:
I do realize the key uniqueness is the big problem here. The thing is that I need all usernames to be unique. Imagine using twitter if you couldn't tag a specific person, but had to tag everyone with the same username.
With any eventual consistency system, and particularly in the presence of a network partition, you essentially have two choices:
Accept collisions, and pick a winner later.
Ensure you never have a collision.
In the case of Couchbase:
For (1) that means letting two users register with the same address in both NA and EU, and then later picking one as the "winner" (when the network link is present - not a very desirable outcome for something like a user account. A slight variation on this would be something like #Robert's suggestion and putting them in a staging area (which means the account cannot be made "active" until the partition is resolved), and then telling the "winning" user they have successfully registered, and the "loser" that the name is taken and to try again.
For (2) this means making the users unique, even though they pick the same username - for example adding a NA:: / EU:: prefix to their username document. When they login the application would need some logic to try looking up both document variations - likely trying the prefix for the local region first. (This is essentially the same idea as "realms" or "servers" that many MMO games use).
There are variations of both of these, but ultimately given an AP-type system (which Couchbase across XDCR is) you've essentially chosen Availability & Partition-Tolerance over Consistancy, and hence need to reconcile that at the application layer.
Put the user name registrations into a staging table until you can perform a replication to determine if the name already exists in one of the other data centers.
You tagged Couchbase, so I will answer about that.
As long as the key for each object is different, you should be fine with Couchbase. It is the keys that would be unique and work great with XDCR. Another solution would be to have a concatenated key made up of the username and other values (company name, etc) if that suits your use case, again giving you a unique key for the object. Yet another would be to have a key/value in a JSON document that is the username.
It's not clear to me whether you're using Cassandra or Couchbase.
As far as Cassandra is concerned, since version 2.0, you can use Lightweight Transactions which are created for the goal. A Serial Consistency has been created just to achieve what you need. In the above link you can read what follows:
For example, suppose that I have an application that allows users to
register new accounts. Without linearizable consistency, I have no way
to make sure I allow exactly one user to claim a given account — I
have a race condition analogous to two threads attempting to insert
into a [non-concurrent] Map: even if I check for existence before
performing the insert in one thread, I can’t guarantee that no other
thread inserts it after the check but before I do.
As far as the missing connection between two or more cluster its your choice how to handle it. If you can't guarantee the uniqueness at insert-time you can both refuse the registration or dealing with it, accepting and apologize later.
HTH, Carlo

historical data modelling literature, methods and techniques

Last year we launched http://tweetMp.org.au - a site dedicated to Australian politics and twitter.
Late last year our politician schema needed to be adjusted because some politicians retired and new politicians came in.
Changing our db required manual (SQL) change, so I was considering implementing a CMS for our admins to make these changes in the future.
There's also many other sites that government/politics sites out there for Australia that manage their own politician data.
I'd like to come up with a centralized way of doing this.
After thinking about it for a while, maybe the best approach is to not model the current view of the politician data and how they relate to the political system, but model the transactions instead. Such that the current view is the projection of all the transactions/changes that happen in the past.
Using this approach, other sites could "subscribe" to changes (a la` pubsubhub) and submit changes and just integrate these change items into their schemas.
Without this approach, most sites would have to tear down the entire db, and repopulate it, so any associated records would need to be reassociated. Managing data this way is pretty annoying, and severely impedes mashups of this data for the public good.
I've noticed some things work this way - source version control, banking records, stackoverflow points system and many other examples.
Of course, the immediate challenges and design issues with this approach includes
is the current view cached and repersisted? how often is it updated?
what base entities must exist that never change?
probably heaps more i can't think of right now...
Is there any notable literature on this subject that anyone could recommend?
Also, any patterns or practices for data modelling like this that could be useful?
Any help is greatly appreciated.
-CV
This is a fairly common problem in data modelling. Basically it comes down to this:
Are you interesting in the view now, the view at a point in time or both?
For example, if you have a service that models subscriptions you need to know:
What services someone had at a point in time: this is needed to work out how much to charge, to see a history of the account and so forth; and
What services someone has now: what can they access on the Website?
The starting point for this kind of problem is to have a history table, such as:
Service history: id, userid, serviceid, start_date, end_date
Chain together the service histories for a user and you have their history. So how do you model what they have now? The easiest (and most denormalized view) is to say the last record or the record with a NULL end date or a present or future end date is what they have now.
As you can imagine this can lead to some gnarly SQL so this is selectively denomralized so you have a Services table and another table for history. Each time Services is changed a history record is created or updated. This kind of approach makes the history table more of an audit table (another term you'll see bandied about).
This is analagous to your problem. You need to know:
Who is the current MP for each seat in the House of Representatives;
Who is the current Senator for each seat;
Who is the current Minister for each department;
Who is the Prime Minister.
But you also need to know who was each of those things at a point in time so you need a history for all those things.
So on the 20th August 2003, Peter Costello made a press release you would need to know that at this time he was:
The Member for Higgins;
The Treasurer; and
The Deputy Prime Minister.
because conceivably someone could be interesting in finding all press releases by Peter Costello or the Treasurer, which will lead to the same press release but will be impossible to trace without the history.
Additionally you might need to know which seats are in which states, possibly the geographical boundaries and so on.
None of this should require a schema change as the schema should be able to handle it.

Names of businesses keyed differently by different people

I have this table
tblStore
with these fields
storeID (autonumber)
storeName
locationOrBranch
and this table
tblPurchased
with these fields
purchasedID
storeID (foreign key)
itemDesc
In the case of stores that have more than one location, there is a problem when two people inadvertently key the same store location differently. For example, take Harrisburg Chevron. On some of its receipts it calls itself Harrisburg Chevron, some just say Chevron at the top, and under that, Harrisburg. One person may key it into tblStore as storeName Chevron, locationoOrBranch Harrisburg. Person2 may key it as storeName Harrisburg Chevron, locationOrBranch Harrisburg. What makes this bad is that the business's name is Harrisburg Chevron. It seems hard to make a rule (that would understandably cover all future opportunities for this error) to prevent people from doing this in the future.
Question 1) I'm thinking as the instances are found, an update query to change all records from one way to the other is the best way to fix it. Is this right?
Questions 2) What would be the best way to have originally set up the db to have avoided this?
Question 3) What can I do to make future after-the-fact corrections easier when this happens?
Thanks.
edit: I do understand that better business practices are the ideal prevention, but for question 2 I'm looking for any tips or tricks that people use that could help. And question 1 and 3 are important to me too.
This is not a database design issue.
This is an issue with the processes around using the database design.
The real question I have is why are users entering in stores ad-hoc? I can think of scenarios, but without knowing your situation it is hard to guess.
The normal solution is that the tblStore table is a lookup table only. Normally users only have access to stores that have already been entered.
Then there is a controlled process to maintain the tblStore table in a consistent manner. Only a few users would have access to this process.
Of course as I alluded to above this is not always possible, so you may need a different solution.
UPDATE:
Question #1: An update script is the best approach. The best way to do this is to have a copy of the database if possible, or a close copy if not, and test the script against this data. Once you have ensured that the script runs correctly, then you can run it against the real data.
If you have transactional integrity you should use that. Use "begin" before running the script and if the number of records is what you expect, and any other tests you devise (perhaps also scripted), then you can "commit"
Do not type in SQL against a live DB.
Question #3: I suggest your first line of attack is to create processes around the creation of new stores, but this may not be wiuthin your ambit.
The second is possibly to get proactive and identify and enter new stores (if this is the problem) before the users in the field need to do so. I don't know if this works inside your scenario.
Lastly if you had a script that merged "store1" into "store2" you can standardise on that as a way of reducing time and errors. You could even possibly build that into an admin only screen that automated merging stores.
That is all I can think of off the top of my head.

Resources