Is normalizing the gender table going too far?

Is normalizing the gender table going too far? - wpf

I am not a database guy, but am trying to clean up another database. So my question is would normalizing the gender table be going too far?
User table:
userid int pk,
genderid char(1) fk
etc...
gender table:
genderid char(1) pk,
gender varchar(20)
Now at first it seemed silly to me, but then I considered it because i can then have a constant data source to populate from or bind from. I will be using WPF. If it was another framework I would probably avoid it, but what do you think?

Whether or not you choose to normalize your table structure to accomodate gender is going to depend on the requirements of your application and your business requirements.
I would normalize if:
You want to be able to manage the "description" of a gender in the database, and not in code.
This allows you to quickly change the description from Man/Woman to Male/Female, for example.
Your application currently must handle, or will possible handle in the future, localization requirements, i.e. being able to specify gender in different languages.
Your business requires that everything be normalized.
I would not normalize if:
You have a relatively simple application where you can easily manage the description of the gender in code rather than in the database.
You have tight programmatic control of the data going in and out of the gender field such that you can ensure consistency of the data in that field.
You only care about the gender field for information capture, meaning, you don't have a lot of programmatic need to update this field once it is set the first time.

I'm also not a database guy but I do it. It gives me the possibility to assure that only the genders are entered, that are valid (referencial integrity) and I can also use it to populate the selection control.

I can think of applications where I'd use different columns for sex and gender, have three values for sex (male/female/decline to state) and six for gender (male/female/transgendered male/transgendered female/asexual/decline to state). Granted, I live in San Francisco, where there's an level of public discussion of transgender issues that much of the rest of the world is behind the curve on.
The point is: without a compelling reason to think otherwise, I'd assume that any simplifying assumption I made about demographics was limited and parochial. The cost of breaking sex out to its own table is small now and expensive later. I wouldn't avoid the small cost on the basis of an assumption.

Well, your company might have a requirement that, if possible, everything be normalized.
Also, depending on the business & data, you might need to include transgenders as well which would create 3+ genders (I don't know how many there are, haven't checked)

I'll remark on another aspect: sorting. Normally, 'M' sorts after 'F'; in a project one time, a database table had a gender field with either of those two values. There was a desire to be able to sort results on the gender (census data) and a further preference to have 'M' appear before 'F'. My solution was to add a separate lookup table, assigning the Male value an ID of 0, and Female an ID of 1. So queries on the main table could easily be sorted on the new genderID field.

Just thought I'd throw an opinion in here. #Ben McCormack has a great answer with a minor caveat: Regarding localization, there are sometimes better ways of handling this than having the values defined directly in your database.
For example, you mention WPF. With .Net you have various localization resources that are much better suited to managing differences in whether to emit "Male" or "Samec" (Czech).
By letting the built in localization features take care of this you don't have to worry about having multiple database records defining the exact same thing.. which could complicate reporting.
That said, I'd suggest that you might want to consider if "gender" is really what you are after. Gender is defined as "a set of characteristics distinguishing between male and female".
On the face of it this sounds like your standard Male/Female options; but it's not. Gender is much more complicated than that as it needs context in order to have meaning. For example, in the context of a relationship a Male (by Sex) could have one of several "genders": Masculine, Feminine or even Neutral. This is regardless of what sex their partner is.
In the context of just an individual, a Male (by Sex) might be Masculine, Feminine, Neutral, Transgender, Intersex or any of a number of other options acceptable to the person filling out the form.
At least one person commented that Gender is necessary in order to determine the honorific used in mailings. I'd suggest that there is no relationship between gender and those honorifics. For example, a Female (by Sex) might want to be addressed as Ms/Miss/Mrs/Dr/Madam/Professor or even Mr if they are in the process of, or have completed, surgery to become "Male". That list is by no means all inclusive and in any event it's much better to allow that person to select how they want to be addressed.
Which leads me to my last item: Before collecting any piece of data you should have a defined reason for having it. My company specializes in data collection through online forms. One of the things we do is look at what our clients are asking for and go field by field to determine if the data is even used anywhere.
More often than not an entity (company/governmental/etc) asks for far more information than they care about. This can have additional consequences in the event the data is lost, stolen or simply viewed by unauthorized individuals. Further there is a burden placed on the person filling out the forms for each field they are asked to complete.
I bring this up because "Gender" is almost never needed for any normal system. Instead, Sex is a better qualifier and even then it has little value. Exempting dating sites and governmental census.

Yes. I think that You can use enum in code and bind eventuatly to it.
null - unknow ;
0 - male ;
1 - female;
or you can use bool type to define this
null - unknow; true - male; false - female

Related

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.

That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Extendable database schema for contacts (social)

I have an old application that needs upgrading. Doesn't everything now days?
The existing DB schema consists of predefined fields like phone, fax, email. Obviously with the social explosion over the last 5-7 years (or longer depending on your country) end users need more control over creating contact cards the way they see fit rather than just what I think might be useful.
Im concerned here with "digital" addresses. i.e. One line type addresses. phone=ccc ccc ccc ccc etc
Since physical addresses are pretty standard in terms of requirements in this case users will have to use what they are given (location, postal, delivery) in order to keep the scope managable.
So I'm wondering what the best practice format for storing digital info is. To me it seems I have two choices:
A simple 4 field table (ContactId, AddressTypeId, Address, FormatterId)
1000, "phone", "ccc ccc ccc ccc", phoneformatter
1000, "facebook", "myfacebook", facebookformatter
This would then be JOINED anywhere it's need. The table would get massive though and the join performance would degrade over time i suspect.
A json blob that would require additional processing once read (ContactId, Addresses)
1000, {{"phone": "ccc ccc ccc ccc"}, {"facebook": "myfacebook"}}
Or ... something else.
This db is for use in a given country by customers only trading domestically with client bases ranging from 3000-12000 accounts and then however many contacts per account - averages about 10 in current system.
My primary concern is user flexibility but performance is a key consideration in that. So I dunno, just do whatever and throw heaps of hardware at it ;)
Application is in C# if that makes any difference re: post query processing.

I would not go for the JSON blob. This will be nasty if you need to answer any queries like:-
Does anyone have me in their Facebook contacts?
What's the most popular type of social media contact?
You would be forced to parse the JSON for every record and be unable to create a simple index.
Your additional solution is nearly correct, however FormatterId would need to be on a AddressType table. What you have is not normalised as FormatterId would depend only on AddressTypeId. So you would have three tables:-
Contact
ContactAddress
AddressType
You haven't stated if you need to store two addresses of the same type against a single contact. e.g. if someone has two twitter accounts. Answering this question will allow you to define the correct primary key on ContactAddress. It would either be (ContactId, AddressTypeId) if you can only have one of each type per contact or create a synthenic key (ContactAddressId).

Well, I believe you have a table named contact
contact(contactid, contact details, other details)
and now you want to remove this contact details from the contact table because the contact details may contain digital address, phone number and all.
But the table you are considering
(ContactId, AddressTypeId, Address, FormatterId) is not in normal form and you can't uniquely identify a tuple until you read all the four columns which is bad and in this case indexing also not going to help you.
So better if you have if separate table for each type of the digital address, and have indexing on contactID
facebookdetails(contactid, rest of the details)
phonedetails(contactid, rest of the details)
And then the query can be join of all the tables, it will not degrade the performance.
Hope this will help :)

Logic to identify a person uniquely

I am working on a medical php application which will be implemented at national level.
It will be used by multiple hospitals and the patient record will be centralized i.e every hospital will be accessing and adding the patient records into same database.
I want that there should be only 1 record of a patient without any duplication. Simply speaking no hospital can again enter the 2nd record for same patient but in order to make it possible I need to know which criteria should we use which will remain fix throughout the entire lifetime of a patient. Only 2 are there in my mind i.e Name and Date of birth.
What other criterias can be there? I dont want to use mobile numbers and phone numbers etc. Moreover infants cant be having it. I need the criteria which will be there for every patient and unique.
Please give me your suggestions or any other better way to implement this functionality?

I'll take a shot because I've been involved in some data matching and validation, although not specifically in the medical industry. You haven't specified a particular country, just mentioned Asia, so I'll use an example from my home country of Australia just because I'm familiar with the rules and I believe the same would apply to many Asian countries:
We have a unique Medicare number used for health care, but it's not mandatory and while the free / discounted care means I expect 99%+ of people would have one you can't rely on it.
There is also a tax file number, likewise not mandatory even if you
work and people who have never had a job wouldn't normally have one.
You might be dealing with foreign people that aren't residents.
Drivers licenses are of course not mandatory to get healthcare.
It's perfectly legal to have "no fixed address". Plus some people will lie to get treatments and repeats of drugs etc. Not to mention many people move often.
Changing name is common in case of marriage / divorce and unless done
for illegal purposes someone can change their name just because they
don't like their original. Not to mention people use common substitutions for various things like Jim versus James.
Typing mistakes will be very common over a large dataset.
In short I think the 'perfect' scheme you are asking for is impossible. The best you can do is apply a weighting rule to find likely duplicates. Same name / date of birth / place of birth for example is an unlikely but possible event so show a warning to the data entry operator it's a likely duplicate and let them see the details of the likely duplicate. Even things like a drivers license number that should be unique may indicate that the original entry just had a data entry error, not a new duplicate.
From my experience the best thing is a report that lists likely duplicates that must be reviewed by someone higher up the chain, and give them an easy option to merge the duplicates. Then you can start to use more vague regex expressions that throw a few false positives that can be dismissed when a human reviews them. You can also refine the model over time to get the best match results.

Combination of name, date of birth, blood group, place of birth etc., can be tried.

You need to use some national-wide ID. Like Passport ID, or health insurance number.

Social Insurance Number with country.

Does this database model make sense?

I am new speaking about modelling databases. But I give my best to learn as much as possible by my own. Therefore I want to ask you, whether my first attemp make sense for the following example:
So I modeled the database as followed:
The databse is about medicine. There are several medicine items which should be dosed depending on the age of the patient. Every medicine item can belong to one ore more groups (or none).
This is just a test case to show what I learned so far. So every tip to improve my skills is welcome!
Thanks a lot!

The relationtable table name is just a placeholder, right? It should be more descriptive, maybe dosage?
Something tells me that age ranges will greatly vary. Some medicines have different rules for children under 3 years, other under 5, 10, and so on. Instead of creating a separate table, just include two extra columns (start and end) in relationtable. It will be much easier to query and I won't consider this a denormalization.
Talking about age and dose tables - get rid of unit column and use normalized, fixed unit. Years for age and mg for doses. This will make querying much simpler. Don't be afraid to use floating numbers, e.g. 0.5 to represent six months.

I agree with what Tomasz write and would like to add:
If the relationtable is the correct way to go depends on some knowledge not contained in the table. It sounds strange that one medicine can be part of different groups and that the dosage depends on that relation. I would expect that a medicine can belong to different groups (resulting in a medicine2group mapping table) and that their exist different dosages depending on the age for a medicine (so you get dosage4age table, combining the existing age and dose tables. That new table would directly reference the medicine)
Which version is correct can not be told from the table alone.
As a rule of thumb: I get skeptical when a table without a proper name and concept links more then two other table. It is possible but often hints at concept hiding somewhere.
In order to check if the proposed model is correct, ask the business experts if the table is still correct if you replace Antibiotika with Superantibiotika in one of the first three rows. If it is, this means that the dosage does not depend on the group and should not be linked to it, so the model proposed by me would be more correct.
If the altered table is not correct, your model might be the better one, but I would listen carefully about the explanation why it isn't correct.

Should I expose a user ID to public?

I have a form that reveals user IDs to public. I was wondering that is this dangerous. Personally I do not see anything bad about it. The ID is just used to reference a single database record.

If it were dangerous, Stack Overflow wouldn't be displaying user IDs in their URLs in order to make user profile lookups work: https://stackoverflow.com/users/104826/rfactor
Edit of seriousness of immense levels: if user IDs are themselves sensitive data; for example your primary keys for some reason happen to be social security numbers, that'll definitely be a security and privacy liability. If your user IDs are just auto-increment numbers though, you're clear.

Generally it's not a problem but it can give away hints on how active your site is, like how many users you have etc. If you consider this sensitive information or maybe even good marketing is completely up to you.
There's a story that this was one of the reasons the germans lost the WW2. They had sequential serial numbers from production written on each tank. By collecting id numbers from tanks taken out the british could estimate how many tanks the whole german army had and make new strategies from that.

I have found that exposing primary keys that identify physical entities can create headaches.
Imagine if two blood samples come into a laboratory and test results are generated for each sample. Many different kinds of test might be done and each record representing a test result will have the sample_id as a foreign key.
If you share the database ID with the customer and you discover that two samples were accidentally switched, you will have to update the foreign keys in all the detail records representing the tests. If you instead exposed some other unique name outside your system, you will just need to switch the two unique names on the sample records in the master table.
There are other advantages related to data migration and there are advantages when entities are represented in more than one database in which it is difficult to create records with identical database ID's.
In my experience it is always best to expose a unique identifier other than the primary key outside your system. It gives you more flexibility in resolving data mix-ups, dealing with data migration issues, and in otherwise future-proofing your system.

as For me ID is as dangerous as showing user name.

Exposing an user ID is not, in and of itself, bad. It depends on the level of privacy and security needed. If the user ID does not expose and cannot be tied to any other personal data that should otherwise be private, it may not be a problem.
But don't think that public user IDs can never be a problem.
Make sure you don't allow anyone to break in to any private data just by knowing user IDs. Facebook has had problems like that. Here's just one example. While revealing user IDs wasn't the whole story, it was part of the equation.

Will it hurt anything? Only you can decide that, and you should think that through.
But in general, it is poor form to display the User ID without having a business reason to do so. (Saves you work is probably not a good business reason.)

If it is a generated database id with no other meaning, it's not dangerous. Though I don't think revealing an id is elegant either. It's a technical detail and I can't understand why you would like to show it to users.