Email-primary key, is it a bad idea? - database

I've read here that using email address field as a primary key for managing user database is a very bad idea.
How, and why? The book doesn't delve into the reasons.
How can using email field as a primary key for a table be so deleterious?
Are there some horrible long-term implications that I do not see?
Edit:
This question is about performance issues of string comparison, however, that does not concern me (at least for this question). I am interested in long-term implications of using email as a primary key.
From experience, does it generally cause problems in the future?

Well, I guess the most obvious (not performance-related) reason is that users may want (or need) to change their email addresses.
If the email address is the primary identifier for user accounts this can get confusing pretty quickly.
From a domain modeling view, email-addresses are commonly handled as attributes of persons/users, just as a user name is. While user name changes can probably be reasonably not allowed, email addresses are rather likely to change at some point (user loses access to the account, the organization that maintained the account retires, etc.).
Also, an email address does not need to be eternally assigned to the same real-life person. joe#example.com could be owned by "Joe Miller" in 2005, "Joe Carlos" in 2013, and by "Joeberto Joeman" from 2020 onwards.
This possible need for change is IMO the main reason why email addresses don't make good primary keys.

There are a few attributes you look for in a primary key.
The problems with "email address" are
it's not possible to guarantee it's unique - an email address may be used by a group of people at the same time, or different people over time.
it's not immutable - the same person may change emails over time; this would require you to update all the tables with foreign key relationships
it does not uniquely identify a person - one person may have multiple email addresses

Related

Why Do I need user id attribute?

I am currently trying to design a social network type of website and this is the class diagram
that I have so far
at the moment I have userId and username in separate tables because I wanted to normalize these tables but now I am not sure why do I need the userId attribute? I have done research and a lot of similar projects have this attribute but I don't get why? if the username is already going to uniquely identify a particular user.
By the way I am aware I have a problem with the requests table because at the moment with the attributes given I cannot identify a primary key
Thanks
Two big reasons I can think of:
Optimization. SQL databases typically perform far better when using integer primary keys than varchar ones. Lookup-something-by-user is one of the most common operations in this environment, so this has real performance implications. Many DBAs don't like GUID/UUIDs as PKs for exactly this reason.
Nothing dictates that a username must uniquely identify users. Case in point: Stack Exchange user handles don't have to be unique, and are freely editable.

Is it insecure to reveal a row's primary key to the user?

Why do many applications replace the primary key of a database with a seemingly random alternative id when revealing the record to the user?
My guess is that it prevents users from guessing other rows in the table. If so, isn't that just false sense of security?
I guess you are talking about surrogate keys here. One of the desired or supposed advantages of surrogate keys is that they aren't burdened by any external meaning or dependency on anything outside the database. So for example the surrogate key values can safely be reassigned or the key can be refactored or discarded without any consequences for users of the system.
Generally surrogate keys are kept hidden from users so that they don't acquire any such external dependencies. Being hidden from users was in fact part of the original definition of a surrogate key as proposed by E.F.Codd. If key values reside in the user's browser cache or favourites list then they aren't much use as "surrogates" any more. So that's one common reason why you will see one key used only inside the database and a different key for the same table made visible in the application.
I think it may depend on the type of application you are working with. I work with Enterprise software that is only used by the company I work for and is not generally available to the outside world. In this case, it is often critical to let the user see the surrogate key for people-related records because the information in the person table has no uniqueness. There can be two John Smiths (we actually have over 1000 of them) who are genuinely different people. They may even have the same business address and be different people (Sons are often named for fathers and work in the same medical practice for instance). So they need to refer to the surrogate key on forms and in reporting to ensure they are using the record they thought they wanted. OItherwise if they wanted to research further details about the John Smith that they saw in a report, how would they look it up in the aaplication without having to go through all 1000 to find the right one? Creating a fake id as well as the real one would be time consuming (we import millions of records at a time) and for no real gain since the data would not be visible outside our comapny application.
For a web app that is open to the general public, I can see where you might not want to show this information.

Should User and Address be in separate tables?

Currently my users table has the below fields
Username
Password
Name
Surname
City
Address
Country
Region
TelNo
MobNo
Email
MembershipExpiry
NoOfMembers
DOB
Gender
Blocked
UserAttempts
BlockTime
Disabled
I'm not sure if I should put the address fields in another table. I have heard that I will be breaking 3NF if I don't although I can't understand why. Can someone please explain?
There are several points that are definitely not 3NF; and some questionable ones in addition:
Could there could be multiple addresses per user?
Is an address optional or mandatory?
Does the information in City, Country, Region duplicate that in Address?
Could a user have multiple TelNos?
Is a TelNo optional or mandatory?
Could a user have multiple MobNos?
Is a MobNo optional or mandatory?
Could a user have multiple Emails?
Is an Email optional or mandatory?
Is NoOfMembers calculated from the count of users?
Can there be more than one UserAttempts?
Can there be more than one BlockTime per user?
If the answer to any of these questions is yes, then it indicates a problem with 3NF in that area. The reason for 3NF is to remove duplication of data; to ensure that updates, insertions and deletions leave the data in consistent form; and to minimise the storage of data - in particular there is no need to store data as "not yet known/unknown/null".
In addition to the questions asked here, there is also the question of what constitutes the primary key for your table - I would guess it is something to do with user, but name and the other information you give is unlikely to be unique, so will not suffice as a PK. (If you think name plus surname is unique are you suggesting that you will never have more than one John Smith?)
EDIT:
In the light of further information that some fields are optional, I would suggest that you separate out the optional fields into different tables, and establish 1-1 links between the new tables and the user table. This link would be established by creating a foreign key in the new table referring to the primary key of the user table. As you say none of the fields can have multiple values then they are unlikely to give you problems at present. If however any of these change, then not splitting them out will give you problems in upgrading the application and the data to support the application. You still need to address the primary key issue.
As long as every user has one address and every address belongs to one user, they should go in the same table (a 1-to-1 relationship). However, if users aren't required to enter addresses (an optional relationship) a separate table would be appropriate. Also, in the odd case that many users share the same address (e.g. they're convicts in the same prison), you have a 1-to-many relationship, in which case a separate table would be the way to go. EDIT: And yes, as someone pointed out in the comments, if users have multiple address (a 1-to-many the other way around), there should also be separate tables.
Just as point that I think might help someone in this question, I once had a situation where I put addresses right in the user/site/company/etc tables because I thought, why would I ever need more than one address for them? Then after we completed everything it was brought to my attention by a different department that we needed the possibility of recording both a shipping address and a billing address.
The moral of the story is, this is a frequent requirement, so if you think you ever might want to record shipping and billing addresses, or can think of any other type of address you might want to record for a user, go ahead and put it in a separate table.
In today's age, I think phone numbers are a no brainer as well to be stored in a separate table. Everyone has mobile numbers, home numbers, work numbers, fax numbers, etc., and even if you only plan on asking for one, people will still put two in the field and separate them by a semi-colon (trust me). Just something else to consider in your database design.
the point is that if you imagine to have two addresses for the same user in the future, you should split now and have an address table with a FK pointing back to the users table.
P.S. Your table is missing an identity to be used as PK, something like Id or UserId or DataId, call it the way you want...
By adding them to separate table, you will have a easier time expanding your application if you decide to later. I generally have a simple user table with user_id or id, user_name, first_name, last_name, password, created_at & updated_at. I then have a profile table with the other info.
Its really all preference though.
You should never group two different types of data in a single table, period. The reason is if your application is intended to be used in production, sooner or later different use-cases will come which will need you to higher normalised table structure.
My recommendation - Adhere to SOLID principles even in DB design.

Is using Personally Identifiable Information (PII) as foreign keys discouraged in database design?

While clensing PII from test data I have been stuck with a challenging scenario: cascading the changes through the foreign key relationships in the data. Given the focus on privacy and regulations should this practice be discouraged? If the PII itself were not used in any key fashion a neat trick would be to just shuffle the columns.
There are some commercial tools available to address this problem but none of them seem to handle a large variety of databases well.
Sounds dangerous and stupid and inefficient. Keys should be synthetic ids.
HIPAA has a concept called the "Unique Patient Identifier" which can be used as we describe to link data: http://www.ncvhs.hhs.gov/app4.htm
Unique Patient Identifier eliminates
the need for the repetitive use and
disclosure of an individual's personal
identification information (i.e. name,
age, sex, race, marital status, place
of residence, etc.) for routine
internal and external communications
(e.g. orders, results, medication,
consultation, etc.) and protects the
privacy of the individual. It helps
preserve the patient anonymity while
facilitating communication and
information sharing. Healthcare is
fundamentally a multi-disciplinary
process. A Unique Patient Identifier
enables the integration and the
availability of critically needed
information from multi-disciplinary
sources and multiple care settings.
Therefore, the integrity and security
of the patient information depend on
the use of a reliable Unique Patient
Identifier.
The privacy issue hinges not so much on the identifier itself, but on the security and privacy of the data that the identifier is used to access, and how that access is controlled. My understanding is that typically this means that a system querying for information via a patient identifier should only get back information that can not be pieced together to reveal private information.
Essentially you would generate an artificial key for each person. Even though it is unique to the person, it is not personally identifying, unless you also were to release personally identifiable information along with it. For example, if you let people see only first names with a particular query, but also returned the artificial key, then they now know that artificial key 00003 is associated with first name Bob. now if you allow them to somehow go back and query with 00003 as criteria, and allow them access to the lastname, you can see how they can start to accumulate information. It is important that there be no way for an unauthorized user to get the artifical key and PII returned in the same query, since that would then make the artifical key itself PII. that's my interpretation at least.
Besides the HIPPA issues, another problem with using PII as a key is that it changes. People get new SSNs when they have their identities stolen. SSNs are also often miskeyed and thus relate the information for the wrong person (thinking more of data imports from other systems here). People (especially female people) often change their names. Differnt people also have the same name (and often, for this reason, databases hold incorrect SSN infomation for them as well becasue they match to the wrong SSN for that name) and thus very little PPI is in fact unique enough to be a key field. Further, PII should be stored in an encrypted field making it an even worse choice for a key field.

Should I use a number or an email id to identify a user on website?

I have a web app where I register users based on their email id.
From a design/ease of use/flexibility point of view, should I assign a unique number to each user or identify user based on emailid?
Advantage of assigning unique number:
I can change the login itself at a later point without losing the data of the user(flexible).
Disadvantage:
I have to deal with numbers when using the sql command line(error prone).
Which is better? Do you see any other issues that need to be considered for either scheme?
The identity of your users should be unique and immutable. Choosing the email address as identity is not a good idea for several reasons:
The email is one facet of the user's identity that can change at any point in time.
You might decide to allow more than one emails.
You might decide to add other facets, like OpenID or Live ID, or even just old plain username.
There's nothing wrong with allowing multiple identityies to share the same email facet. It is a rare scenario, but not unheard of.
Normalizing the email address is hard and error prone, so you might have problems enforcing the uniqueness. (Are email addresses case sensitive? Do you ignore . or + inside emails? How do you compare non-english emails?)
Btw, using the email as a public representation of the user identity can be a security and privacy problem. Especially if some of your users are under 13 years. You will need a different public facet for the user identity.
Use both.
You have to add an id because you really don't want other tables to use the email address as a foreign key.
Make the email address unique so that you can still use it to identify a user with sql command line.
Unique number - ALWAYS!
But keep the number hidden from the user.
The user should be allowed to change their email. If this is used as the primary identifier then it can cause lots of complications when the key is used in multiple tables.
You should have another identifier other then the users email address which is not visible to the user and never changes. You should then enforce uniqueness on the email address so it can be used as a candidate key.
You will find that users will want to change their email address, or anything really which they can see, so you should as good practice have an identifier which cannot be changed.
Dealing with numbers in sql command object would not really be any more error prone then using the actual email address, if anything I would think it would be less error prone.
Your disadvantage is not a disadvantage. Using numbers with sql is not more or less a problem than using emails or anything else for the matter.
On the other hand your advantage is quite a strong one, you might want to associate users with each other, different emails with one user account, etc. and always using the email will make things harder.
Think also of urls including user identication, an ID is much easier to handle there than an email where you have to think about the proper url endocing.
So in favour of flexiblity and ease of use, I would strongly recommend a unique userID.
Just some points to consider.
How will you validate the email address?
How do you ensure that it is really unique (I don't always use my real address e.g. m.mouse at disney.com
I like to use a unique key generated by the database to identify the record and then add attributes which are out of my control separately
A person's email can change but the id will not
Unique numbers. As well as the reasons identified, I think it would be less error prone than using an email address. Shorter, no funny characters, easier to validate, etc.

Resources