Is this database normalized enough? - database

I was wondering if you guys could look at this database design and tell me if it normalized to the best it can be. Likewise, if you see any problems or improvements.
The database is essentially has one to many relationship and is made for households entering into an assistive living program. Each year we have them complete the same form to see if any changes have occurred and analyze the data.
Head of household: Information that I do not believe will be changing. Unique to the social provided.
Contact & Address: I separated address from contact because people that enter this program are required to be living in selected homes. Likewise, people do not stay in this program forever. So it isn't uncommon for us to see different households living in the same address over a period of time.
Income: basically analyzing what they had at the beginning and what they developed over time. So unearned_entry and earned_entry would be consistent but earned and unearned will be fluctuating from each form.
Household_information: for veteran status and disabilities status, although unlikely, can still change. In regards to these two attributes, it's either yes or no (either you have someone disabled in your household or you don't). For the household size and minors, this could be potentially different since relatives may join the household, or children become adults, etc.
Program Information: attributes about each form basically. Intake date - will stay consistent (when they entered the program). Transaction type ( will either be entry, update, or close). Exit date ( will be null until transaction type is close. ).

Related

Customer Deduplication in Booking Application

We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.

Database schema naming conventions and common mistakes?

(https://i.stack.imgur.com/VYkV6.png) :
I'm asked to design a relational database to keep data to answer clinic operation queries such as:
● List the patient appointments for each doctor for a given date.
● When a patient rings to make an appointment, give the available time slots for a given date.
● Retrieve the address of patients to send notices via mail services.
I have one database schema of one relation as shown below, but I was wondering whether there were any mistakes I've made?
ABC(doc-name, doc-gender, registration_num, qualification, pat-name, pat-gender, DOB, address, phone-num, appoint-date, appoint-time, type)
Is the use of words such as date and the use of hyphens generally discouraged? Are there any other weaknesses in my design?
Thank you
So, that's not a schema or a design. Not for a relational database, which, based on the tags for the question, is what you're looking for. That's the storage definition for an ID/Value style of database. If you're looking for actual relational storage, you should be building out those relationships through the process of normalization.
For example, let's start at the beginning with doc-name (I am personally not crazy about using hyphens, but it's not a showstopper, so at least on that note, be sure whichever RDBMS you're working with supports them in the name and then you're good to go). If we think about this just from a data entry stand point, we don't want to have to type in the name of the doctor every time we use that doctor. Instead, we'd want to pull that from a list. So, clearly, we can break that apart from the rest of the information. There is the beginning of our normalization process. We can also easily note the fact that a patient is likely to have more than one appointment. Under the current structure, we'd have to re-enter every bit of patient information prior to the appointment. There's another place where we'd break this apart.
There is tons more to this simple example that could be split out and normalized.
I'd suggest you read up on data normalization. My favorite teacher on the subject is Louis Davidson. Here's his book on the topic. Read that and then try to readdress the situation you're facing.
I'm assuming this isn't just homework. If it is, currently, I'd give you an "F". If it isn't, you should track down someone to give you hand with this database design. You won't be able to quickly read Louis' book on the topic and turn around even a rough working design in any reasonable period of time.
I have to second what Grant said, this is not a relational design at all.
Stop and ask yourself for example what happens if Steven Arrow has to take an afternoon off and update his schedule. You need to be very careful updating the database lest you reassign all his patients.
Spending a total of 5 minutes on this, I see at the very least:
A Doctors table, a Patients Table, and probably a table of open appointment times (which btw, is a bit harder than you think, so you have to give some thought how to handle that and some reading up on tables for scheduling).
That's for starters. I might break out Patients phone numbers to its own table. Why? Well how many columns do you want have for phone numbers? 1? What if they have a work AND home number? Or a Work and Cell and Home? And more.
The concept you're looking for is normal forms. You don't need to go overboard, but generally 3NF is about right.

How do I structure multiple Identity Data in a database

Am designing a database for a credit bureau and am seeking some guidance.
The data they receive from Banks, MFIs, Saccos, Utility companies etc comes with various types of IDs. E.g. It is perfectly legal to open a bank account with a National ID and also a Passport. Scenario One that has my head banging is that Customer1 will take a credit facility (call it loan for now) in bank1 with the passport and then go to bank2 and take another loan with their NationalID and Bank3 with their MilitaryID. Eventually when this data comes from the banks to the bureau, it would be seen as 3 different people while we know that its actually 1 person. At this point, there is nothing we can do as a bureau.
However, one way out (for now) is using the Govt registry which provides a repository which holds both passports and IDS. So once we query for this information and get a response, how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Again, a person's name could be captured in various orders states. Bank1 could do FName, LName, OName while Bank3 can do LName, FName only. How do I store this names?
Even against one ID type e.g. NationalID, you will often find misspellt names or missing names. So one NationalID in our database could end up with about 6 different names because the person's name was captured different by the various banks where he has transacted.
And that is just the tip of the iceberg. We have issues with addresses, telephone numbers, etc etc.
Could you have any insight as to how I'd structure my database to ensure we capture all data from all banks and provide the most accurate information possible regarding an individual? Better yet, do you have experience with this type of setup?
Thanks.
how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Trivial.
You ahve an identity table, that has an AlternateId field if the Identity is linked to another one. Use the first IDentity you created as master. Any alternative will have AlternateId pointing to it.
You need to separate the identity from the data in it, so you can have alterante versions of it, possibly with an origin and timestampt. You need oto likely fully support versioning and tying different identities to each other as alternative, including generating a "master identity" possibly by algorithm with the "official" version of your data (i.e. consolidated).
The details are complex - mostly you ahve to make a LOT of compromises without killing performance, so at the end HIRE A SPECIALIST. There is a reason there are people out as sensior database designers or architects that have 20+ years experience finding the optimal solution given the constrints you may not even be aware of (application wise).
Better yet, do you have experience with this type of setup?
Yes. Try financial information. Stock symbols / feeds / definitions are not necessariyl compatible and vary by whom you get it. Any non-trivial setup has different data feeds that may show the same item slightly different, sometimes in error. DIfferent name, sometimes different price (example: ES, CME group, is 50 USD per point, but on TT Fix it is 5 - to make up, the price is multiplied by 10, so instad of 1000.25 you get 10002.5). THis is the same line of consolidation, and it STINKS.
Tons of code, tons of proper database design, redoing it half a dozen time to get the proper performance. THis is tricky, sadly.

What are the arguments against merging contact details into a single field?

We have a customer that insists on putting contact details, at this time first and last names, into a single field. Take, for example, Mr. Bob Smith and Mrs. Jane Smith. Mr. Bob and Mrs. Jane would be entered into the first name field and Smith would be entered into the last name. It gets messier if the contacts have different last names or if there is a hyphenated name. The customer only wants one contact record so they came up with this system and implemented it on their own.
Our system is designed around contacts and each individual person is intended to be an individual contact, even married. Due to some of the attributes we must assign to people and notes we need to keep, a contact-centric approach is best. The above issue occurs in about 1/3 of the cases we handle.
Internally, my team has discussed how to sell the customer on using the database the way it was designed. We listed form letters and contact lists as being the main reasons for keeping the data clean and in the fields we designed. For example, using our recommendation, the customer will have much more granular control over form letter creation and sorting of data.
Any suggestions for how we sell this to the customer?
Tell them what they can get out of your system is only as good as what gets put in. If they want to enter inconsistent data, the cost they'll pay down the line is the inability to generate letters or mailing lists in the future.
They may need to learn this lesson the hard way for themselves. I see more problems with switching the names, for example, entering Smith as the first name and Bob as the last.
Also, can you make both fields required?
It sounds like what they want to enter is similar to AddressLine1, AddressLine2. It's just a poor design, I thought you had 2 name fields but they would only enter data in one of them (the first name).
All you can do it try to help them when they ask for it. They'll get the system they deserve.
Just show your customer the normal forms for database design:
> http://www.phlonx.com/resources/nf3/
Tell him that these normal forms are designed to make the database more manageable over time and make it more flexible.
Can't you just create a view that holds First and Last name together? For some servers you can also create editable views... So your customer will be happy and data will be stored normalized.
I'd try to put it in terms of money and time. You're going to spend more time trying to keep duplicates out of a db with their design, more time building relevant reports or queries (constantly having to parse a block name field... do they want address all in one too?!?), more money to scrub the data (either themselves or someone else) if they ever want to send the data to a third party for analysis and metrics.
It sounds like they don't want to let go of their design, maybe partly because they understand it. You may want to try and meet them halfway somehow at first, and involve them in the process of making incremental improvements to the design. That way they can see and understand the benefits that right now may just be over their head, pushing them out of their comfort zone. They have to trust you with their baby :)
The best argument is that you won't be responsible for the behavior of the database unless they put things where they belong.
If they want to make a single mailing to each "household", then I'm sure your app can do that. (Probably already does.) Y'all just have to come to terms on what "household" means. Since there may be rented rooms or long-term guests, it doesn't always mean "only one mailing piece per address".
FWIW, I've been doing this stuff for decades, and I still find doctors and attorneys (and their staffs) the hardest people to deal with. One time, I walked out of a meeting (and, of course, lost the chance to bid on the contract) when a doctor's IT guy stood up, pounded his fist on the table, and screamed at me over and over, "Doctors are not people! Doctors are not people!".

Atomicity of field for part numbers

In our internal inventory application, we store three values (in separate fields) that become the printed "part number" in this format: PPP-NNNNN-VVVV (P = Prefix, N = Number, V = version).
So for example, if you have a part 010-00001-01 you know it's version 1 of a part of type "010" (which let's say is a printed circuit board).
So, in the process of creating parts engineering wants to group parts together by keeping the "number" component (the middle 5 digits) the same across multiple prefixes like so:
001-00040-0001 - Overall assembly
010-00040-0001 - PCB
015-00040-0001 - Schematics
This seems problematic and frustrating as it sometimes adds extra meaning to the "number" field (but not consistently since not all parts with the same "number" component are necessarily linked).
Am I being a purist or is this fine? 1NF is awfully vague with regards to atomicity. I think I'm mostly frustrated because of the extra logic to ensure that the next "number" part of the overall part number is valid and available for all prefixes.
There have been a number of enterprises that have foundered, or nearly foundered, on the "part number syndrome". You might be able to find some case studies. DEC part numbers were somewhat mixed up.
The customer is not always right, but the customer is always the customer.
In this case, it sounds to me like engineering is trying to use as single number to model a relationship. I mean the relationship between Overall assembly, PCB, and Scematics. It's better to model relationships as relations. It allows you more flexibility down the road. You may have a hard time selling engineering on this point.
In my experience, regardless of database normative rules, when the client/customer/user wants something done a certain way, there is most likely a reason for it, and that reason will save them money (in some fashion). Sometimes it will save money by reducing steps, by reducing training costs, or simply because That's The Way It's Always Been. Whatever the reason, eventually you'll end up doing it because they're paying to have it done (unless it violates accounting rules).
In this instance, it sounds like an extra sorting criteria on some queries for reports, and a new 'allocated number' table with an auto-incrementing key. That doesn't sound too bad to me. Ask me sometime about the database report a client VP commissioned strictly to cast data in such a fashion as to make a different VP look bad in meetings (not that he told me that up front).

Resources