Obfuscating names with human readable names - obfuscation

I have to obfuscate human names with other unique human names
So say the twitter handle Jane Doe might be mapped to Sally Bates
This mapping must be unique: 1-to-1
I need to do this because of concerns regarding personally identifiable information.
Question:
Does anyone know of a software component(s) that does this?
Or is there a standing algorithm to do this?
Notes:
A permutation of the characters is probably not sufficient
The obfuscated name should be rememberable name like johnbrown, john smith are good nhoj is not ok
The obfuscated name should not be mappable by a someone who is not aware of the obfuscation strategy
The obfuscated name should be easily mappable back to the source name either by computation or by dictionary look up
Thanks.

Related

Common identifier for a person across disparate systems

Not sure which is the best Stack Exchange site for this, so will try my hand here.
I have a web application that stores user disciplinary data for organisations. Rather than clients enter their staff into multiple systems, some want to push the basic personnel data into ours (data such as First Name, Surname, DOB, Job Title etc) from their source (e.g. HR/ERP) databases.
Our clients are using a range of existing systems to store their data, such as Oracle, SAP, JD Edwards, etc.
I am familiar with the technical methods to get this data (e.g. web service, web API), but not for a case such as when a person's surname changes (e.g. Janet Smith gets married and becomes Janet Doe). Unless there is a unique identifier for that person across both systems, I can't see how that change can be managed reliably.
How is this process best-managed please? Is an additional field added to the destination database that contains the UID of the source data? Or, do both parties agree on a common field, e.g. employee number, that never changes?
This issue arises in many circumstances. One case is in the Texas school system where students are tracked longitudinally through numerous education subsystems. A social security number, providing a unique identifier in some cases (although not all) was considered too sensitive for use. Thus, a unique identifier has been generated for each student and staff member. This is part of the permanent information associated with each individual, regardless of employment change, location change, or name change.
This link describes the rationale for the unique id.
This link is the documentation on the Texas Student Data System (TSDS) unique identifier. You might find the XML examples at the end of the document of most interest. Much of the information involves submitting requests for an id where demographic information is needed for disambiguation.
Basically, something similar to a Java UUID as an extra field in the database should be sufficient to achieve your aim.
Hope this helps.
Yes, the UID is the only solution. This problem comes up in medical systems too, for example. Another is photos, I'm not sure which causes more problems!
I know approach which is using "external_id" field for that. Several external ids can be exploited in case of many systems.

Logic to identify a person uniquely

I am working on a medical php application which will be implemented at national level.
It will be used by multiple hospitals and the patient record will be centralized i.e every hospital will be accessing and adding the patient records into same database.
I want that there should be only 1 record of a patient without any duplication. Simply speaking no hospital can again enter the 2nd record for same patient but in order to make it possible I need to know which criteria should we use which will remain fix throughout the entire lifetime of a patient. Only 2 are there in my mind i.e Name and Date of birth.
What other criterias can be there? I dont want to use mobile numbers and phone numbers etc. Moreover infants cant be having it. I need the criteria which will be there for every patient and unique.
Please give me your suggestions or any other better way to implement this functionality?
I'll take a shot because I've been involved in some data matching and validation, although not specifically in the medical industry. You haven't specified a particular country, just mentioned Asia, so I'll use an example from my home country of Australia just because I'm familiar with the rules and I believe the same would apply to many Asian countries:
We have a unique Medicare number used for health care, but it's not mandatory and while the free / discounted care means I expect 99%+ of people would have one you can't rely on it.
There is also a tax file number, likewise not mandatory even if you
work and people who have never had a job wouldn't normally have one.
You might be dealing with foreign people that aren't residents.
Drivers licenses are of course not mandatory to get healthcare.
It's perfectly legal to have "no fixed address". Plus some people will lie to get treatments and repeats of drugs etc. Not to mention many people move often.
Changing name is common in case of marriage / divorce and unless done
for illegal purposes someone can change their name just because they
don't like their original. Not to mention people use common substitutions for various things like Jim versus James.
Typing mistakes will be very common over a large dataset.
In short I think the 'perfect' scheme you are asking for is impossible. The best you can do is apply a weighting rule to find likely duplicates. Same name / date of birth / place of birth for example is an unlikely but possible event so show a warning to the data entry operator it's a likely duplicate and let them see the details of the likely duplicate. Even things like a drivers license number that should be unique may indicate that the original entry just had a data entry error, not a new duplicate.
From my experience the best thing is a report that lists likely duplicates that must be reviewed by someone higher up the chain, and give them an easy option to merge the duplicates. Then you can start to use more vague regex expressions that throw a few false positives that can be dismissed when a human reviews them. You can also refine the model over time to get the best match results.
Combination of name, date of birth, blood group, place of birth etc., can be tried.
You need to use some national-wide ID. Like Passport ID, or health insurance number.
Social Insurance Number with country.

Is normalizing the gender table going too far?

I am not a database guy, but am trying to clean up another database. So my question is would normalizing the gender table be going too far?
User table:
userid int pk,
genderid char(1) fk
etc...
gender table:
genderid char(1) pk,
gender varchar(20)
Now at first it seemed silly to me, but then I considered it because i can then have a constant data source to populate from or bind from. I will be using WPF. If it was another framework I would probably avoid it, but what do you think?
Whether or not you choose to normalize your table structure to accomodate gender is going to depend on the requirements of your application and your business requirements.
I would normalize if:
You want to be able to manage the "description" of a gender in the database, and not in code.
This allows you to quickly change the description from Man/Woman to Male/Female, for example.
Your application currently must handle, or will possible handle in the future, localization requirements, i.e. being able to specify gender in different languages.
Your business requires that everything be normalized.
I would not normalize if:
You have a relatively simple application where you can easily manage the description of the gender in code rather than in the database.
You have tight programmatic control of the data going in and out of the gender field such that you can ensure consistency of the data in that field.
You only care about the gender field for information capture, meaning, you don't have a lot of programmatic need to update this field once it is set the first time.
I'm also not a database guy but I do it. It gives me the possibility to assure that only the genders are entered, that are valid (referencial integrity) and I can also use it to populate the selection control.
I can think of applications where I'd use different columns for sex and gender, have three values for sex (male/female/decline to state) and six for gender (male/female/transgendered male/transgendered female/asexual/decline to state). Granted, I live in San Francisco, where there's an level of public discussion of transgender issues that much of the rest of the world is behind the curve on.
The point is: without a compelling reason to think otherwise, I'd assume that any simplifying assumption I made about demographics was limited and parochial. The cost of breaking sex out to its own table is small now and expensive later. I wouldn't avoid the small cost on the basis of an assumption.
Well, your company might have a requirement that, if possible, everything be normalized.
Also, depending on the business & data, you might need to include transgenders as well which would create 3+ genders (I don't know how many there are, haven't checked)
I'll remark on another aspect: sorting. Normally, 'M' sorts after 'F'; in a project one time, a database table had a gender field with either of those two values. There was a desire to be able to sort results on the gender (census data) and a further preference to have 'M' appear before 'F'. My solution was to add a separate lookup table, assigning the Male value an ID of 0, and Female an ID of 1. So queries on the main table could easily be sorted on the new genderID field.
Just thought I'd throw an opinion in here. #Ben McCormack has a great answer with a minor caveat: Regarding localization, there are sometimes better ways of handling this than having the values defined directly in your database.
For example, you mention WPF. With .Net you have various localization resources that are much better suited to managing differences in whether to emit "Male" or "Samec" (Czech).
By letting the built in localization features take care of this you don't have to worry about having multiple database records defining the exact same thing.. which could complicate reporting.
That said, I'd suggest that you might want to consider if "gender" is really what you are after. Gender is defined as "a set of characteristics distinguishing between male and female".
On the face of it this sounds like your standard Male/Female options; but it's not. Gender is much more complicated than that as it needs context in order to have meaning. For example, in the context of a relationship a Male (by Sex) could have one of several "genders": Masculine, Feminine or even Neutral. This is regardless of what sex their partner is.
In the context of just an individual, a Male (by Sex) might be Masculine, Feminine, Neutral, Transgender, Intersex or any of a number of other options acceptable to the person filling out the form.
At least one person commented that Gender is necessary in order to determine the honorific used in mailings. I'd suggest that there is no relationship between gender and those honorifics. For example, a Female (by Sex) might want to be addressed as Ms/Miss/Mrs/Dr/Madam/Professor or even Mr if they are in the process of, or have completed, surgery to become "Male". That list is by no means all inclusive and in any event it's much better to allow that person to select how they want to be addressed.
Which leads me to my last item: Before collecting any piece of data you should have a defined reason for having it. My company specializes in data collection through online forms. One of the things we do is look at what our clients are asking for and go field by field to determine if the data is even used anywhere.
More often than not an entity (company/governmental/etc) asks for far more information than they care about. This can have additional consequences in the event the data is lost, stolen or simply viewed by unauthorized individuals. Further there is a burden placed on the person filling out the forms for each field they are asked to complete.
I bring this up because "Gender" is almost never needed for any normal system. Instead, Sex is a better qualifier and even then it has little value. Exempting dating sites and governmental census.
Yes. I think that You can use enum in code and bind eventuatly to it.
null - unknow ;
0 - male ;
1 - female;
or you can use bool type to define this
null - unknow; true - male; false - female

Table and column naming conventions when plural and singular forms are odd or the same

In my search I found mostly arguments for whether to use plurality in database naming conventions, and ways to handle it in either case. I have decided I prefer plural table names, so I don't want to argue that.
I need to represent an animal's species and genus and so on in a database. The plural and singular form for 'species' are the same, and the plural of 'genus' is 'genera'.
I'm using Microsoft's Entity Data Model, by the way. My concern is mainly about whether I'll have problems later on depending on my naming choices.
I think I can get by with:
Table: Genera | Column: Genus
But I'm unsure how I should handle:
Table: Species | Column: Species
If I really wanted to be lazy about this I'd just name them 'species > specie' and 'genuses > genus', but I would prefer to read them in their correct forms.
Any advice would be appreciated.
I would go for Genera/Genus and Species/Species. That's how you say it in English, so why using an incorrect form of the word?
I generally avoid have a column name that is the same as a table name because it can be confusing to human readers. The database engine knows whether it expects a table name or column name in any given context, I don't recall that ever being a problem. (Is there some context where either would be valid? I can't think of one.)
That said, if you run into this issue, it indicates to me that you have a poorly chosen name for one or the other. Species makes good sense as a table name: this table contains data about a species. So if a field in that table is called "species" ... what about the species? Presumably everything in the table is about a species. I'd guess it was probably some sort of identifier and not, say, the number of chromosomes or method of reproduction. But is it an ID number? An abbreviation? The common name? The binomial nomenclature name? Etc. If it's, say, the common name, I'd call it "common_name" and not "species".
By the way, another naming convention you should decide on is whether column names that could be ambiguous if taken out of context should have names that specify the context, or whether you use the table name to eliminate the ambiguity. For example, you could have many things that have a "name". You could call any such field simply "name", and if there's ambiguity, qualify it, like "species.name", "laboratory.name", etc. Or you could give each field a unique name, like "species_name", "laboratory_name", etc. That's one of those questions that I think has no definitively right answer, just pros and cons and make a decision and be consistent.

creating a address database

I am re-creating a part of my company’s database because it does not meet future needs.
Currently we have mainly a flat file and some disjoined tables that were never fully realized.
My way of thinking is we have a table for each category except maybe the zips table, which may serve as a connect it all together table.
Please refer to image below:
Database Diagram http://www.freeimagehosting.net/uploads/248cc7e884.jpg
One thing I am thinking of is removing the zip table and just putting the zip code in the zipstocities table since the zip code is almost unique and then indexing the table on the zip code. The only downside is zip code has to be a varchar to take care of zip codes with leading zeros. Just want to know if there is a flaw in my logic.
I don't know the US ZIPcode and territorial devision system well, but I assume it's somewhat like the German one.
A state has many counties.
A county has many cities.
A city has many zip codes.
Hence I would use the following schema.
ZipCodes CityZipCodes
------------ ---------------- Cities
ZipCode (PK) <─── ZipCode (PK)(FK) -----------
City (PK)(FK) ───> CityId (PK)
Name
County (FK) ───┐
│
│
Counties │
------------- │
States CountyId (PK) <───┘
----------------- Name
StateId (PK) <─── State (FK)
Name
Abbreviation
Fixed for multiple cities per ZIP code.
One thing you should be aware of is that not all cities are in counties. In Virginia you are in either a city or county but never both.
Looking at the diagram you have, the state table is the only one of the 4 outside tables that is really necessary. Lookup tables with just an ID and a single value aren't worth the effort. These relationships are designed to make a single value in the main table (ziptocities) refer to a set of related data in the lookup table (states).
You'll need to ask yourself why you care about counties. In many states in the US, they have little importance beyond tradition and maps.
The other question will be how important will it be that the address be accurate? How many deaths will there be if important letters are not delivered in a timely manner (possibly many if the letter is about prescription drug recalls!)
You probably want to think about using data from the Postal Service, possibly using a product that corrects addresses. That way, when you get a good address, you'll be certain the mail can be delivered there - because the Postal Service will have said so!
There seem to be flaws in both your process and your logic.
I suggest that you stop thinking about tables and relationships for a moment. Instead, think about facts. Make a list of valid addresses that your database needs to support. Many surprises await you.
Don't confuse an address with a mailing label. They're not at all the same thing. Consider modeling carriers, too. In the US, whether an address is valid depends on the carrier. For example, my PO box is a valid address when the carrier is the USPS, but not when the carrier is UPS.
To save time, you might try browsing some international address formats on bitboost.
Will your logic work if two countries happen to have the same zip code? These two would be pointing to different cities in that case. here are some points to consider
Do you want to use zipcode as a kind
of primary key into address? (at
lease the city, state and country
fields). In that case, you can have
zipcode, city,state,country in one
table. Create indexes on city, state
etc.. (you have a functional
dependency of the form
zipcode->country,state,city . This
as i said may not be true across
countries.
If auto populating is
your only concern, create a
materialized view and use it.
I would recommend reading 'Data Model patterns' by David C. Hay.
But not every person who has a valid medical claim is required by law to remain in the US until the claim is settled. People move.
San Francisco is a city in California; it's not a city in Alabama. Does your design prevent nonsense entries like "San Francisco, AL"?

Resources