Common identifier for a person across disparate systems - database

Not sure which is the best Stack Exchange site for this, so will try my hand here.
I have a web application that stores user disciplinary data for organisations. Rather than clients enter their staff into multiple systems, some want to push the basic personnel data into ours (data such as First Name, Surname, DOB, Job Title etc) from their source (e.g. HR/ERP) databases.
Our clients are using a range of existing systems to store their data, such as Oracle, SAP, JD Edwards, etc.
I am familiar with the technical methods to get this data (e.g. web service, web API), but not for a case such as when a person's surname changes (e.g. Janet Smith gets married and becomes Janet Doe). Unless there is a unique identifier for that person across both systems, I can't see how that change can be managed reliably.
How is this process best-managed please? Is an additional field added to the destination database that contains the UID of the source data? Or, do both parties agree on a common field, e.g. employee number, that never changes?

This issue arises in many circumstances. One case is in the Texas school system where students are tracked longitudinally through numerous education subsystems. A social security number, providing a unique identifier in some cases (although not all) was considered too sensitive for use. Thus, a unique identifier has been generated for each student and staff member. This is part of the permanent information associated with each individual, regardless of employment change, location change, or name change.
This link describes the rationale for the unique id.
This link is the documentation on the Texas Student Data System (TSDS) unique identifier. You might find the XML examples at the end of the document of most interest. Much of the information involves submitting requests for an id where demographic information is needed for disambiguation.
Basically, something similar to a Java UUID as an extra field in the database should be sufficient to achieve your aim.
Hope this helps.

Yes, the UID is the only solution. This problem comes up in medical systems too, for example. Another is photos, I'm not sure which causes more problems!

I know approach which is using "external_id" field for that. Several external ids can be exploited in case of many systems.

Related

Have I resolved my database relationships correctly?

I'm sorry if this is the wrong place for this question. I volunteer for a charity group that has to store sensitive data, as we are a new type of format, there are no systems that fit within our needs or our budget. Someone else started building the database, I wasn't sure he was resolving the relationships correctly, so I presented him with an alternate ER model and now we haven't heard back from him, so I am left to build it by myself.
As we have to store sensitive data, I'm reluctant to put my database design on here in it's entirety, so if there is a way I can privately discuss this with someone, that would be my preference, as I would love to get someone else to check it in full to make sure it's ALL good... but for now, can someone confirm if I have resolved the relationships correctly, or if the original design was better?
The database description is: There are different types of members -
Client, Staff, Professional (Offsite), Supplier, Family, General. There are different types of Staff members: Managers, Volunteer, Professional (Onsite), Admin, Committee, Lecturer. A member can be one or many types eg: Client/Volunteer/Family, Supplier/Volunteer, Manager/Lecturer/Volunteer/Committee/Family.
The original guy resolved this by creating a separate table for each user, each table storing a name and address eg:
Client - ClientName, ClientAddress
Professional - ProfessionalName, ProfessionalAddress
Employee - EmployeeName, EmployeeAddress
Family - FamilyName, FamilyAddress
My only problem with this is that I would ideally like one person to have one MemberID with their name and address, but with the original design each person would have a different ID for each type of person that they were, all storing name, address, phone number, email etc.
I thought that creating a Member table and having a Member Type table with a joining Member Type List table would be a better design. This is how I have resolved the issue:
Member Tables
Have I done this correctly or should I continue with the original design?
Thanks
Update:
Staff Model
It makes sense to store all member related data within one table.
Also for programming, I cannot imagine any use case that would support having different tables for each member type.
That being said, I advise you to look up the concept of "user roles", since this seems very similar.
You have different users (members) and they can have different roles (member type). Based on your roles you might want to show different data / allow different actions / send specific mails (or whatever else you can imagine).
So generally your approach looks good. The only thing I think about is that right now you don't have stored who is a "Staff" member for example. If you just have one list with different names you don't store the structure.
Depending on your use cases you can e.g. make another column in MemberType table "isStaff". Or, if you need to be more flexible and there are likely more different member types in the future, you can make another table (e.g.) MemberTypeParent and set a foreign key on your MemberType table to that table to make the connection.
It all depends on what you want to do with the data in the future.

Extendable database schema for contacts (social)

I have an old application that needs upgrading. Doesn't everything now days?
The existing DB schema consists of predefined fields like phone, fax, email. Obviously with the social explosion over the last 5-7 years (or longer depending on your country) end users need more control over creating contact cards the way they see fit rather than just what I think might be useful.
Im concerned here with "digital" addresses. i.e. One line type addresses. phone=ccc ccc ccc ccc etc
Since physical addresses are pretty standard in terms of requirements in this case users will have to use what they are given (location, postal, delivery) in order to keep the scope managable.
So I'm wondering what the best practice format for storing digital info is. To me it seems I have two choices:
A simple 4 field table (ContactId, AddressTypeId, Address, FormatterId)
1000, "phone", "ccc ccc ccc ccc", phoneformatter
1000, "facebook", "myfacebook", facebookformatter
This would then be JOINED anywhere it's need. The table would get massive though and the join performance would degrade over time i suspect.
A json blob that would require additional processing once read (ContactId, Addresses)
1000, {{"phone": "ccc ccc ccc ccc"}, {"facebook": "myfacebook"}}
Or ... something else.
This db is for use in a given country by customers only trading domestically with client bases ranging from 3000-12000 accounts and then however many contacts per account - averages about 10 in current system.
My primary concern is user flexibility but performance is a key consideration in that. So I dunno, just do whatever and throw heaps of hardware at it ;)
Application is in C# if that makes any difference re: post query processing.
I would not go for the JSON blob. This will be nasty if you need to answer any queries like:-
Does anyone have me in their Facebook contacts?
What's the most popular type of social media contact?
You would be forced to parse the JSON for every record and be unable to create a simple index.
Your additional solution is nearly correct, however FormatterId would need to be on a AddressType table. What you have is not normalised as FormatterId would depend only on AddressTypeId. So you would have three tables:-
Contact
ContactAddress
AddressType
You haven't stated if you need to store two addresses of the same type against a single contact. e.g. if someone has two twitter accounts. Answering this question will allow you to define the correct primary key on ContactAddress. It would either be (ContactId, AddressTypeId) if you can only have one of each type per contact or create a synthenic key (ContactAddressId).
Well, I believe you have a table named contact
contact(contactid, contact details, other details)
and now you want to remove this contact details from the contact table because the contact details may contain digital address, phone number and all.
But the table you are considering
(ContactId, AddressTypeId, Address, FormatterId) is not in normal form and you can't uniquely identify a tuple until you read all the four columns which is bad and in this case indexing also not going to help you.
So better if you have if separate table for each type of the digital address, and have indexing on contactID
facebookdetails(contactid, rest of the details)
phonedetails(contactid, rest of the details)
And then the query can be join of all the tables, it will not degrade the performance.
Hope this will help :)

Logic to identify a person uniquely

I am working on a medical php application which will be implemented at national level.
It will be used by multiple hospitals and the patient record will be centralized i.e every hospital will be accessing and adding the patient records into same database.
I want that there should be only 1 record of a patient without any duplication. Simply speaking no hospital can again enter the 2nd record for same patient but in order to make it possible I need to know which criteria should we use which will remain fix throughout the entire lifetime of a patient. Only 2 are there in my mind i.e Name and Date of birth.
What other criterias can be there? I dont want to use mobile numbers and phone numbers etc. Moreover infants cant be having it. I need the criteria which will be there for every patient and unique.
Please give me your suggestions or any other better way to implement this functionality?
I'll take a shot because I've been involved in some data matching and validation, although not specifically in the medical industry. You haven't specified a particular country, just mentioned Asia, so I'll use an example from my home country of Australia just because I'm familiar with the rules and I believe the same would apply to many Asian countries:
We have a unique Medicare number used for health care, but it's not mandatory and while the free / discounted care means I expect 99%+ of people would have one you can't rely on it.
There is also a tax file number, likewise not mandatory even if you
work and people who have never had a job wouldn't normally have one.
You might be dealing with foreign people that aren't residents.
Drivers licenses are of course not mandatory to get healthcare.
It's perfectly legal to have "no fixed address". Plus some people will lie to get treatments and repeats of drugs etc. Not to mention many people move often.
Changing name is common in case of marriage / divorce and unless done
for illegal purposes someone can change their name just because they
don't like their original. Not to mention people use common substitutions for various things like Jim versus James.
Typing mistakes will be very common over a large dataset.
In short I think the 'perfect' scheme you are asking for is impossible. The best you can do is apply a weighting rule to find likely duplicates. Same name / date of birth / place of birth for example is an unlikely but possible event so show a warning to the data entry operator it's a likely duplicate and let them see the details of the likely duplicate. Even things like a drivers license number that should be unique may indicate that the original entry just had a data entry error, not a new duplicate.
From my experience the best thing is a report that lists likely duplicates that must be reviewed by someone higher up the chain, and give them an easy option to merge the duplicates. Then you can start to use more vague regex expressions that throw a few false positives that can be dismissed when a human reviews them. You can also refine the model over time to get the best match results.
Combination of name, date of birth, blood group, place of birth etc., can be tried.
You need to use some national-wide ID. Like Passport ID, or health insurance number.
Social Insurance Number with country.

Should I expose a user ID to public?

I have a form that reveals user IDs to public. I was wondering that is this dangerous. Personally I do not see anything bad about it. The ID is just used to reference a single database record.
If it were dangerous, Stack Overflow wouldn't be displaying user IDs in their URLs in order to make user profile lookups work: https://stackoverflow.com/users/104826/rfactor
Edit of seriousness of immense levels: if user IDs are themselves sensitive data; for example your primary keys for some reason happen to be social security numbers, that'll definitely be a security and privacy liability. If your user IDs are just auto-increment numbers though, you're clear.
Generally it's not a problem but it can give away hints on how active your site is, like how many users you have etc. If you consider this sensitive information or maybe even good marketing is completely up to you.
There's a story that this was one of the reasons the germans lost the WW2. They had sequential serial numbers from production written on each tank. By collecting id numbers from tanks taken out the british could estimate how many tanks the whole german army had and make new strategies from that.
I have found that exposing primary keys that identify physical entities can create headaches.
Imagine if two blood samples come into a laboratory and test results are generated for each sample. Many different kinds of test might be done and each record representing a test result will have the sample_id as a foreign key.
If you share the database ID with the customer and you discover that two samples were accidentally switched, you will have to update the foreign keys in all the detail records representing the tests. If you instead exposed some other unique name outside your system, you will just need to switch the two unique names on the sample records in the master table.
There are other advantages related to data migration and there are advantages when entities are represented in more than one database in which it is difficult to create records with identical database ID's.
In my experience it is always best to expose a unique identifier other than the primary key outside your system. It gives you more flexibility in resolving data mix-ups, dealing with data migration issues, and in otherwise future-proofing your system.
as For me ID is as dangerous as showing user name.
Exposing an user ID is not, in and of itself, bad. It depends on the level of privacy and security needed. If the user ID does not expose and cannot be tied to any other personal data that should otherwise be private, it may not be a problem.
But don't think that public user IDs can never be a problem.
Make sure you don't allow anyone to break in to any private data just by knowing user IDs. Facebook has had problems like that. Here's just one example. While revealing user IDs wasn't the whole story, it was part of the equation.
Will it hurt anything? Only you can decide that, and you should think that through.
But in general, it is poor form to display the User ID without having a business reason to do so. (Saves you work is probably not a good business reason.)
If it is a generated database id with no other meaning, it's not dangerous. Though I don't think revealing an id is elegant either. It's a technical detail and I can't understand why you would like to show it to users.

Is using Personally Identifiable Information (PII) as foreign keys discouraged in database design?

While clensing PII from test data I have been stuck with a challenging scenario: cascading the changes through the foreign key relationships in the data. Given the focus on privacy and regulations should this practice be discouraged? If the PII itself were not used in any key fashion a neat trick would be to just shuffle the columns.
There are some commercial tools available to address this problem but none of them seem to handle a large variety of databases well.
Sounds dangerous and stupid and inefficient. Keys should be synthetic ids.
HIPAA has a concept called the "Unique Patient Identifier" which can be used as we describe to link data: http://www.ncvhs.hhs.gov/app4.htm
Unique Patient Identifier eliminates
the need for the repetitive use and
disclosure of an individual's personal
identification information (i.e. name,
age, sex, race, marital status, place
of residence, etc.) for routine
internal and external communications
(e.g. orders, results, medication,
consultation, etc.) and protects the
privacy of the individual. It helps
preserve the patient anonymity while
facilitating communication and
information sharing. Healthcare is
fundamentally a multi-disciplinary
process. A Unique Patient Identifier
enables the integration and the
availability of critically needed
information from multi-disciplinary
sources and multiple care settings.
Therefore, the integrity and security
of the patient information depend on
the use of a reliable Unique Patient
Identifier.
The privacy issue hinges not so much on the identifier itself, but on the security and privacy of the data that the identifier is used to access, and how that access is controlled. My understanding is that typically this means that a system querying for information via a patient identifier should only get back information that can not be pieced together to reveal private information.
Essentially you would generate an artificial key for each person. Even though it is unique to the person, it is not personally identifying, unless you also were to release personally identifiable information along with it. For example, if you let people see only first names with a particular query, but also returned the artificial key, then they now know that artificial key 00003 is associated with first name Bob. now if you allow them to somehow go back and query with 00003 as criteria, and allow them access to the lastname, you can see how they can start to accumulate information. It is important that there be no way for an unauthorized user to get the artifical key and PII returned in the same query, since that would then make the artifical key itself PII. that's my interpretation at least.
Besides the HIPPA issues, another problem with using PII as a key is that it changes. People get new SSNs when they have their identities stolen. SSNs are also often miskeyed and thus relate the information for the wrong person (thinking more of data imports from other systems here). People (especially female people) often change their names. Differnt people also have the same name (and often, for this reason, databases hold incorrect SSN infomation for them as well becasue they match to the wrong SSN for that name) and thus very little PPI is in fact unique enough to be a key field. Further, PII should be stored in an encrypted field making it an even worse choice for a key field.

Resources