Logic to identify a person uniquely

Logic to identify a person uniquely - database

I am working on a medical php application which will be implemented at national level.
It will be used by multiple hospitals and the patient record will be centralized i.e every hospital will be accessing and adding the patient records into same database.
I want that there should be only 1 record of a patient without any duplication. Simply speaking no hospital can again enter the 2nd record for same patient but in order to make it possible I need to know which criteria should we use which will remain fix throughout the entire lifetime of a patient. Only 2 are there in my mind i.e Name and Date of birth.
What other criterias can be there? I dont want to use mobile numbers and phone numbers etc. Moreover infants cant be having it. I need the criteria which will be there for every patient and unique.
Please give me your suggestions or any other better way to implement this functionality?

I'll take a shot because I've been involved in some data matching and validation, although not specifically in the medical industry. You haven't specified a particular country, just mentioned Asia, so I'll use an example from my home country of Australia just because I'm familiar with the rules and I believe the same would apply to many Asian countries:
We have a unique Medicare number used for health care, but it's not mandatory and while the free / discounted care means I expect 99%+ of people would have one you can't rely on it.
There is also a tax file number, likewise not mandatory even if you
work and people who have never had a job wouldn't normally have one.
You might be dealing with foreign people that aren't residents.
Drivers licenses are of course not mandatory to get healthcare.
It's perfectly legal to have "no fixed address". Plus some people will lie to get treatments and repeats of drugs etc. Not to mention many people move often.
Changing name is common in case of marriage / divorce and unless done
for illegal purposes someone can change their name just because they
don't like their original. Not to mention people use common substitutions for various things like Jim versus James.
Typing mistakes will be very common over a large dataset.
In short I think the 'perfect' scheme you are asking for is impossible. The best you can do is apply a weighting rule to find likely duplicates. Same name / date of birth / place of birth for example is an unlikely but possible event so show a warning to the data entry operator it's a likely duplicate and let them see the details of the likely duplicate. Even things like a drivers license number that should be unique may indicate that the original entry just had a data entry error, not a new duplicate.
From my experience the best thing is a report that lists likely duplicates that must be reviewed by someone higher up the chain, and give them an easy option to merge the duplicates. Then you can start to use more vague regex expressions that throw a few false positives that can be dismissed when a human reviews them. You can also refine the model over time to get the best match results.

Combination of name, date of birth, blood group, place of birth etc., can be tried.

You need to use some national-wide ID. Like Passport ID, or health insurance number.

Social Insurance Number with country.

Related

Database design, an included attribute vs multiple joins? Confused

So I am taking a class in database design and management and am kind of confused from a design perspective. My example is an invoice system. I just made it up quick so it doesn't have a ton of complexity in it.
There are Customers, Orders, Invoices and Payments entities
Customers
CustId(PK),
Street,
Zip,
City,
..
Orders
OrderID(PK)
CustID(FK)
Date
Amt
....
Invoices
InvoiceID(PK),
OrderID(FK),
Date,
AmtDue,
AmtPaid,
....
Payments
PaymentNo(PK),
InvoiceID(FK),
PayMethod,
Date,
Amt,
...
Customer entity has a one to many relationship with Orders
Purchases entity has a one to many relationship with Invoices
Invoices Entity has a one to many relationship with Payments.
To get the results of a query to list all Payments made by a Customer the query would have to join Payments with the Invoice table, the Invoice table with the Orders table and the Orders table with the Customer table.
Is this the correct way to do it? One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
Bonus question. Lets say there should be a report that says what the total customer balance is. Does there need to be a customer balance field in the database or can this be a calculated item that is produced by joining tables and adding up the amount billed vs amount paid?
Thanks!

Is this the correct way to do it?
Yes. Based on the information provided, it looks reasonable.
One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
The question you're asking falls under "normal forms", often called normalization. Your target should be Boyce-Codd normal form (similar to 3NF), which should be described in your textbook. I will warn you that misinformation and misuderstanding of database design issues is very abundant on the interwebs, so beware of which answers you pay attention to.
The goal of normalization is to eliminate redundancy, and thus to eliminate "anomaliies", whereby two logically equivalent queries produce inconsistent results. If the same information is kept in two places, and is updated in only one, then two queries against the two different values will produce different -- i.e, inconsistent -- results.
In your example, if there is a Payments.CustID, should I believe that one, or the one derived from joining Payments to Orders? The same goes for total customer balance: do I believe the stored total, or the one I computed from the consituents?
If you are going to "denomalize for performance", as is so often alleged to be necessary, what are you going to do to ensure the redundant values are consistent?
Bonus question. Lets say there should be a report that says what the total customer balance is.
As a matter of fact, in practice balances are sort of a special case. It's often necessary to know the balance at points in time. While it's possible to compute, say, monthy account balances from inception based on transactions, as a practical matter applications usually "draw a line in the sand" and record the balance for future reference. Step are taken -- must be, for the sake of the business -- to ensure the historical information does not change or, if it does, that the recorded balance is updated to reflect the change. From that description alone, you can imagine that the work of enforcing consistency throughout the system is much more work than relying on the DBMS to enforce it. And that is why, insofar as is feasible, it's better to elimate all redundant data, and let the DBMS do the job it was designed to do.
In your analysis, seek Boyce-Codd normal form. Understand your data, eliminate the redundancies, and recognize the relations. Let the DBMS enforce referential integrity. Countless errors will be avoided, and time saved. Only when specific circumstances conspire to show that specific business requirements cannot be satisfied on a particular system with a given, correct design, does one begin the tedious and error-prone work of introducing redundant information and compensating for it with external controls.

"Is this the correct way to do it?" Of course, given your current design. But it's not the ONLY way. So you're studying DB "normalization" and seeing the pros and cons of the various "forms" of normalization. In the "real world" things can change on a dime, due to a management decision or whatever. I tend to use "compound primary keys" instead of simply one field for primary and others as FK. I handle my "FK" programmatically instead of relegating that responsibility to the DB.
I also create and utilize a number of "intermediate" tables, or sometimes "VIEWS", that I use more easily than a bunch of code with too many JOINs. (3rd Normal form addicts can hate, but my code runs faster than a scalded rabbit).
An Order means nothing without a Customer; an Invoice means nothing without an Order; a Payment is great, but means nothing without both an Order and Invoice. So lemme throw this out there -- what's wrong with having a "summary" type of entity that has Cust, Order, Invoice #, and Payment Id ?

Modeling Fact Tables that have direct relationships, but at a detail and not a dimension layer

This is very similar to my issue.
http://forum.kimballgroup.com/t2534-modeling-fact-tables-that-have-direct-relationships-but-at-a-detail-and-not-a-dimension-layer
I’ve got a fact table for POs, Supplier Invoices, Payments, Receipts, etc. They have some dimensions in common, others not. Problem is, for example, say if they are looking at invoices by their gl account, (using an excel pivot table connected to the cube) then they expect to be able drop in a column for the PO number, the buyer of the PO, etc. Even though the buyer dimension is only related to the PO, and the account dimension is only related to the invoice. But they say, well the PO is related to the invoice, so you should be able to pull it in.
I do have a PO Ref field on the invoice fact table, but it is only filled out 50% of the time. Even when it is, you could have a one to many relationship in either way between a PO and an invoice, as far as I understand it at least.
Anyway, they expect to be able to throw in any measure from any measure group, and every single possible dimension to work, and then be able to drill down to the detail to see the POs, Invoices, Payments and Receipts and how they match up. Best practice is to keep the fact tables separate if they are different grains according to Kimball, but then all the business problems aren't solved this way.
The only solutions I can come up with are:
to either tack on a bunch of detail related columns to the degenerate dimensions when I load them. i.e. add PO to invoice and invoice to PO etc., but have it as a comma separated list in that column when it is many to one.
Create every possible relationship with every fact and dimension table. This would be a lot of work though, and some still may not have a relationship to certain dimensions.
Create a monstrous fact table with all the current ones joined together, and somehow figure out logic to only display the measure values once for the many to one joins.
This is probably a bad idea, but thought maybe somehow I could create a relationship between every measure group and the corresponding degenerate dimensions reference field. Like create a relationship between the supplier invoice degenerate dimension PO Ref field and the purchase order line measure group PO field.
Lower their expectations, lol.
Here's a screen shot of the dimension usage tab to give an idea of what it looks like currently.

I tried option 3 once. The performance was terrible. The output was misleading. Never ever again.
Your best bet is to work with the business. Where the data is not readily available (invoice without PO, for example) agree what should be done. You could show a default value (PO not recorded on invoice). You could agree on a logic, implemented in the ETL, that extracts the most likely PO.
Whatever approach you choose you must discuss it. If you do not the business will make decisions based on false assumptions. The business will find itself looking at reporting it does not understand. You must help your users to avoid these outcomes.
Once the approach has been agreed, document it. When queries arise, share the documentation. Make sure the documentation highlights all calculations, difficulties and missing source data.
Work with the teams that generate your source date. If an important field is sparsely populated arrange a meeting. See if the capture processes can be improved. Let your users know that you are investigating this area. Keep them informed of the outcome. If the source data cannot be improved (invoices continue to be raised without a PO), inform your users of the reasons for this.
Managing your customers can be challenging. Especially those who hold senior positions in the company. Transparency and solid documentation will help you.

Common identifier for a person across disparate systems

Not sure which is the best Stack Exchange site for this, so will try my hand here.
I have a web application that stores user disciplinary data for organisations. Rather than clients enter their staff into multiple systems, some want to push the basic personnel data into ours (data such as First Name, Surname, DOB, Job Title etc) from their source (e.g. HR/ERP) databases.
Our clients are using a range of existing systems to store their data, such as Oracle, SAP, JD Edwards, etc.
I am familiar with the technical methods to get this data (e.g. web service, web API), but not for a case such as when a person's surname changes (e.g. Janet Smith gets married and becomes Janet Doe). Unless there is a unique identifier for that person across both systems, I can't see how that change can be managed reliably.
How is this process best-managed please? Is an additional field added to the destination database that contains the UID of the source data? Or, do both parties agree on a common field, e.g. employee number, that never changes?

This issue arises in many circumstances. One case is in the Texas school system where students are tracked longitudinally through numerous education subsystems. A social security number, providing a unique identifier in some cases (although not all) was considered too sensitive for use. Thus, a unique identifier has been generated for each student and staff member. This is part of the permanent information associated with each individual, regardless of employment change, location change, or name change.
This link describes the rationale for the unique id.
This link is the documentation on the Texas Student Data System (TSDS) unique identifier. You might find the XML examples at the end of the document of most interest. Much of the information involves submitting requests for an id where demographic information is needed for disambiguation.
Basically, something similar to a Java UUID as an extra field in the database should be sufficient to achieve your aim.
Hope this helps.

Yes, the UID is the only solution. This problem comes up in medical systems too, for example. Another is photos, I'm not sure which causes more problems!

I know approach which is using "external_id" field for that. Several external ids can be exploited in case of many systems.

Does this database model make sense?

I am new speaking about modelling databases. But I give my best to learn as much as possible by my own. Therefore I want to ask you, whether my first attemp make sense for the following example:
So I modeled the database as followed:
The databse is about medicine. There are several medicine items which should be dosed depending on the age of the patient. Every medicine item can belong to one ore more groups (or none).
This is just a test case to show what I learned so far. So every tip to improve my skills is welcome!
Thanks a lot!

The relationtable table name is just a placeholder, right? It should be more descriptive, maybe dosage?
Something tells me that age ranges will greatly vary. Some medicines have different rules for children under 3 years, other under 5, 10, and so on. Instead of creating a separate table, just include two extra columns (start and end) in relationtable. It will be much easier to query and I won't consider this a denormalization.
Talking about age and dose tables - get rid of unit column and use normalized, fixed unit. Years for age and mg for doses. This will make querying much simpler. Don't be afraid to use floating numbers, e.g. 0.5 to represent six months.

I agree with what Tomasz write and would like to add:
If the relationtable is the correct way to go depends on some knowledge not contained in the table. It sounds strange that one medicine can be part of different groups and that the dosage depends on that relation. I would expect that a medicine can belong to different groups (resulting in a medicine2group mapping table) and that their exist different dosages depending on the age for a medicine (so you get dosage4age table, combining the existing age and dose tables. That new table would directly reference the medicine)
Which version is correct can not be told from the table alone.
As a rule of thumb: I get skeptical when a table without a proper name and concept links more then two other table. It is possible but often hints at concept hiding somewhere.
In order to check if the proposed model is correct, ask the business experts if the table is still correct if you replace Antibiotika with Superantibiotika in one of the first three rows. If it is, this means that the dosage does not depend on the group and should not be linked to it, so the model proposed by me would be more correct.
If the altered table is not correct, your model might be the better one, but I would listen carefully about the explanation why it isn't correct.

Is normalizing the gender table going too far?

I am not a database guy, but am trying to clean up another database. So my question is would normalizing the gender table be going too far?
User table:
userid int pk,
genderid char(1) fk
etc...
gender table:
genderid char(1) pk,
gender varchar(20)
Now at first it seemed silly to me, but then I considered it because i can then have a constant data source to populate from or bind from. I will be using WPF. If it was another framework I would probably avoid it, but what do you think?

Whether or not you choose to normalize your table structure to accomodate gender is going to depend on the requirements of your application and your business requirements.
I would normalize if:
You want to be able to manage the "description" of a gender in the database, and not in code.
This allows you to quickly change the description from Man/Woman to Male/Female, for example.
Your application currently must handle, or will possible handle in the future, localization requirements, i.e. being able to specify gender in different languages.
Your business requires that everything be normalized.
I would not normalize if:
You have a relatively simple application where you can easily manage the description of the gender in code rather than in the database.
You have tight programmatic control of the data going in and out of the gender field such that you can ensure consistency of the data in that field.
You only care about the gender field for information capture, meaning, you don't have a lot of programmatic need to update this field once it is set the first time.

I'm also not a database guy but I do it. It gives me the possibility to assure that only the genders are entered, that are valid (referencial integrity) and I can also use it to populate the selection control.

I can think of applications where I'd use different columns for sex and gender, have three values for sex (male/female/decline to state) and six for gender (male/female/transgendered male/transgendered female/asexual/decline to state). Granted, I live in San Francisco, where there's an level of public discussion of transgender issues that much of the rest of the world is behind the curve on.
The point is: without a compelling reason to think otherwise, I'd assume that any simplifying assumption I made about demographics was limited and parochial. The cost of breaking sex out to its own table is small now and expensive later. I wouldn't avoid the small cost on the basis of an assumption.

Well, your company might have a requirement that, if possible, everything be normalized.
Also, depending on the business & data, you might need to include transgenders as well which would create 3+ genders (I don't know how many there are, haven't checked)

I'll remark on another aspect: sorting. Normally, 'M' sorts after 'F'; in a project one time, a database table had a gender field with either of those two values. There was a desire to be able to sort results on the gender (census data) and a further preference to have 'M' appear before 'F'. My solution was to add a separate lookup table, assigning the Male value an ID of 0, and Female an ID of 1. So queries on the main table could easily be sorted on the new genderID field.

Just thought I'd throw an opinion in here. #Ben McCormack has a great answer with a minor caveat: Regarding localization, there are sometimes better ways of handling this than having the values defined directly in your database.
For example, you mention WPF. With .Net you have various localization resources that are much better suited to managing differences in whether to emit "Male" or "Samec" (Czech).
By letting the built in localization features take care of this you don't have to worry about having multiple database records defining the exact same thing.. which could complicate reporting.
That said, I'd suggest that you might want to consider if "gender" is really what you are after. Gender is defined as "a set of characteristics distinguishing between male and female".
On the face of it this sounds like your standard Male/Female options; but it's not. Gender is much more complicated than that as it needs context in order to have meaning. For example, in the context of a relationship a Male (by Sex) could have one of several "genders": Masculine, Feminine or even Neutral. This is regardless of what sex their partner is.
In the context of just an individual, a Male (by Sex) might be Masculine, Feminine, Neutral, Transgender, Intersex or any of a number of other options acceptable to the person filling out the form.
At least one person commented that Gender is necessary in order to determine the honorific used in mailings. I'd suggest that there is no relationship between gender and those honorifics. For example, a Female (by Sex) might want to be addressed as Ms/Miss/Mrs/Dr/Madam/Professor or even Mr if they are in the process of, or have completed, surgery to become "Male". That list is by no means all inclusive and in any event it's much better to allow that person to select how they want to be addressed.
Which leads me to my last item: Before collecting any piece of data you should have a defined reason for having it. My company specializes in data collection through online forms. One of the things we do is look at what our clients are asking for and go field by field to determine if the data is even used anywhere.
More often than not an entity (company/governmental/etc) asks for far more information than they care about. This can have additional consequences in the event the data is lost, stolen or simply viewed by unauthorized individuals. Further there is a burden placed on the person filling out the forms for each field they are asked to complete.
I bring this up because "Gender" is almost never needed for any normal system. Instead, Sex is a better qualifier and even then it has little value. Exempting dating sites and governmental census.

Yes. I think that You can use enum in code and bind eventuatly to it.
null - unknow ;
0 - male ;
1 - female;
or you can use bool type to define this
null - unknow; true - male; false - female

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight