Should primary key be constant int? - database

So in one of my school projects, we have to create a database for a pseudo e-commerce website. The project instructions asks us to implement our database with Boyce–Codd normal form, but I think there're some ambiguity about this normal form.
Let's say that we implement the entity Users like that :
Users(*email, username, password, some_other_fields)
(note: * meens primary key)
First of all, if I understood well the BCNF, this entity isn't BCNF. If the usernames are unique as well as emails, then we can also define this entity like this :
Users(*username, email, password, some_other_fields)
My first question is how to create this entity in Boyce–Codd normal form ?
Then I have another issue with this BCNF form : the missing id. Assuming an user can change his username and his email. The primary key will also be change. My issue is that I don't really have a temporal constant that define an element in my entity. This implies, for example, some issues about logging : assuming we log action from an user with the primary key, if foo#smth.com change his email to foo2#smth.com, we can have this kind of logs :
[foo#smth.com] : action xxx
[foo#smth.com] : action yyy
[foo2#smth.com] : action zzz
Then if we don't catch the email change, all our precedent logs means nothing : we don't know who is foo#smth.com.
Then, my second question is don't you think that using a temporal constant id (an integer for example) is more secure ?

Uniqueness is not enough for BCNF. BCNF stresses on Functional Dependency. That is, whether attributes are dependent on the key functionally.
In this case attributes cannot depend on the email. Emails can be changed, inactive, reclaimed by someone else. Therefore, being unique does not justify it enough to be a candidate key. Username may have a higher dependability if functionality restricts the user name to get changed.
Functional Dependency inherently depends on Functional Design. If the application you are creating the table for assumes that usernames will never be allowed to change, then the attributes can depend on username to be a candidate key. If the functional design does allow the username to be changed, then you need to introduce or combine a key that is both unique and functionally dependable.
In case of introduced additional unique ids, they are not 'inherently ' more 'secure' than username here. But they 'feel' or 'become' secure, because presumably the functionality and functional design do not expect the id to be changed. Again, if your functional design allows that id to be changed, then that will not remain secure. Eventually it all depends on your functionality, requirements, and how your attributes are expected to behave according to that functional spec.
If you must have to consider introducing an ID, being not satisfied with dependability of username, then instead of an int/integer, consider rather a GUID, for many many reasons such as the following:
int/interger are typically periodic, that is, they recycle after a limit of given platform, for example for 16 bit ints limit is 32767 to -32768. GUIDs may reappear too, but the chance is statistically much less significant.
for operations that take place at different subsystems and need to be synched later, non-unique ids may get created. Consider two shops of a chain that can register customer in offline mode, and later synches up at cloud. First store creates customer with an ID of say 3000, and second store does the same. When their data synch, you have to use a different composite key structure to accommodate that. GUIDs having higher chances to be unique, can solve them.

Related

Why Do I need user id attribute?

I am currently trying to design a social network type of website and this is the class diagram
that I have so far
at the moment I have userId and username in separate tables because I wanted to normalize these tables but now I am not sure why do I need the userId attribute? I have done research and a lot of similar projects have this attribute but I don't get why? if the username is already going to uniquely identify a particular user.
By the way I am aware I have a problem with the requests table because at the moment with the attributes given I cannot identify a primary key
Thanks
Two big reasons I can think of:
Optimization. SQL databases typically perform far better when using integer primary keys than varchar ones. Lookup-something-by-user is one of the most common operations in this environment, so this has real performance implications. Many DBAs don't like GUID/UUIDs as PKs for exactly this reason.
Nothing dictates that a username must uniquely identify users. Case in point: Stack Exchange user handles don't have to be unique, and are freely editable.

Is it insecure to reveal a row's primary key to the user?

Why do many applications replace the primary key of a database with a seemingly random alternative id when revealing the record to the user?
My guess is that it prevents users from guessing other rows in the table. If so, isn't that just false sense of security?
I guess you are talking about surrogate keys here. One of the desired or supposed advantages of surrogate keys is that they aren't burdened by any external meaning or dependency on anything outside the database. So for example the surrogate key values can safely be reassigned or the key can be refactored or discarded without any consequences for users of the system.
Generally surrogate keys are kept hidden from users so that they don't acquire any such external dependencies. Being hidden from users was in fact part of the original definition of a surrogate key as proposed by E.F.Codd. If key values reside in the user's browser cache or favourites list then they aren't much use as "surrogates" any more. So that's one common reason why you will see one key used only inside the database and a different key for the same table made visible in the application.
I think it may depend on the type of application you are working with. I work with Enterprise software that is only used by the company I work for and is not generally available to the outside world. In this case, it is often critical to let the user see the surrogate key for people-related records because the information in the person table has no uniqueness. There can be two John Smiths (we actually have over 1000 of them) who are genuinely different people. They may even have the same business address and be different people (Sons are often named for fathers and work in the same medical practice for instance). So they need to refer to the surrogate key on forms and in reporting to ensure they are using the record they thought they wanted. OItherwise if they wanted to research further details about the John Smith that they saw in a report, how would they look it up in the aaplication without having to go through all 1000 to find the right one? Creating a fake id as well as the real one would be time consuming (we import millions of records at a time) and for no real gain since the data would not be visible outside our comapny application.
For a web app that is open to the general public, I can see where you might not want to show this information.

Should I expose a user ID to public?

I have a form that reveals user IDs to public. I was wondering that is this dangerous. Personally I do not see anything bad about it. The ID is just used to reference a single database record.
If it were dangerous, Stack Overflow wouldn't be displaying user IDs in their URLs in order to make user profile lookups work: https://stackoverflow.com/users/104826/rfactor
Edit of seriousness of immense levels: if user IDs are themselves sensitive data; for example your primary keys for some reason happen to be social security numbers, that'll definitely be a security and privacy liability. If your user IDs are just auto-increment numbers though, you're clear.
Generally it's not a problem but it can give away hints on how active your site is, like how many users you have etc. If you consider this sensitive information or maybe even good marketing is completely up to you.
There's a story that this was one of the reasons the germans lost the WW2. They had sequential serial numbers from production written on each tank. By collecting id numbers from tanks taken out the british could estimate how many tanks the whole german army had and make new strategies from that.
I have found that exposing primary keys that identify physical entities can create headaches.
Imagine if two blood samples come into a laboratory and test results are generated for each sample. Many different kinds of test might be done and each record representing a test result will have the sample_id as a foreign key.
If you share the database ID with the customer and you discover that two samples were accidentally switched, you will have to update the foreign keys in all the detail records representing the tests. If you instead exposed some other unique name outside your system, you will just need to switch the two unique names on the sample records in the master table.
There are other advantages related to data migration and there are advantages when entities are represented in more than one database in which it is difficult to create records with identical database ID's.
In my experience it is always best to expose a unique identifier other than the primary key outside your system. It gives you more flexibility in resolving data mix-ups, dealing with data migration issues, and in otherwise future-proofing your system.
as For me ID is as dangerous as showing user name.
Exposing an user ID is not, in and of itself, bad. It depends on the level of privacy and security needed. If the user ID does not expose and cannot be tied to any other personal data that should otherwise be private, it may not be a problem.
But don't think that public user IDs can never be a problem.
Make sure you don't allow anyone to break in to any private data just by knowing user IDs. Facebook has had problems like that. Here's just one example. While revealing user IDs wasn't the whole story, it was part of the equation.
Will it hurt anything? Only you can decide that, and you should think that through.
But in general, it is poor form to display the User ID without having a business reason to do so. (Saves you work is probably not a good business reason.)
If it is a generated database id with no other meaning, it's not dangerous. Though I don't think revealing an id is elegant either. It's a technical detail and I can't understand why you would like to show it to users.

Should I use a number or an email id to identify a user on website?

I have a web app where I register users based on their email id.
From a design/ease of use/flexibility point of view, should I assign a unique number to each user or identify user based on emailid?
Advantage of assigning unique number:
I can change the login itself at a later point without losing the data of the user(flexible).
Disadvantage:
I have to deal with numbers when using the sql command line(error prone).
Which is better? Do you see any other issues that need to be considered for either scheme?
The identity of your users should be unique and immutable. Choosing the email address as identity is not a good idea for several reasons:
The email is one facet of the user's identity that can change at any point in time.
You might decide to allow more than one emails.
You might decide to add other facets, like OpenID or Live ID, or even just old plain username.
There's nothing wrong with allowing multiple identityies to share the same email facet. It is a rare scenario, but not unheard of.
Normalizing the email address is hard and error prone, so you might have problems enforcing the uniqueness. (Are email addresses case sensitive? Do you ignore . or + inside emails? How do you compare non-english emails?)
Btw, using the email as a public representation of the user identity can be a security and privacy problem. Especially if some of your users are under 13 years. You will need a different public facet for the user identity.
Use both.
You have to add an id because you really don't want other tables to use the email address as a foreign key.
Make the email address unique so that you can still use it to identify a user with sql command line.
Unique number - ALWAYS!
But keep the number hidden from the user.
The user should be allowed to change their email. If this is used as the primary identifier then it can cause lots of complications when the key is used in multiple tables.
You should have another identifier other then the users email address which is not visible to the user and never changes. You should then enforce uniqueness on the email address so it can be used as a candidate key.
You will find that users will want to change their email address, or anything really which they can see, so you should as good practice have an identifier which cannot be changed.
Dealing with numbers in sql command object would not really be any more error prone then using the actual email address, if anything I would think it would be less error prone.
Your disadvantage is not a disadvantage. Using numbers with sql is not more or less a problem than using emails or anything else for the matter.
On the other hand your advantage is quite a strong one, you might want to associate users with each other, different emails with one user account, etc. and always using the email will make things harder.
Think also of urls including user identication, an ID is much easier to handle there than an email where you have to think about the proper url endocing.
So in favour of flexiblity and ease of use, I would strongly recommend a unique userID.
Just some points to consider.
How will you validate the email address?
How do you ensure that it is really unique (I don't always use my real address e.g. m.mouse at disney.com
I like to use a unique key generated by the database to identify the record and then add attributes which are out of my control separately
A person's email can change but the id will not
Unique numbers. As well as the reasons identified, I think it would be less error prone than using an email address. Shorter, no funny characters, easier to validate, etc.

What is a good KISS description of Boyce-Codd normal form?

What is a KISS (Keep it Simple, Stupid) way to remember what Boyce-Codd normal form is and how to take a unnormalized table and BCNF it?
Wikipedia's info: not terribly helpful for me.
Chris Date's definition is actually quite good, so long as you understand what he means:
Each attribute
Your data must be broken into separate, distinct attributes/columns/values which do not depend on any other attributes. Your full name is an attribute. Your birthdate is an attribute. Your age is not an attribute, it depends on the current date which is not part of your birthdate.
must represent a fact
Each attribute is a single fact, not a collection of facts. Changing one bit in an attribute changes the whole meaning. Your birthdate is a fact. Is your full name a fact? Well, in some cases it is, because if you change your surname your full name is different, right? But to a genealogist you have a surname and a family name, and if you change your surname your family name does not change, so they are separate facts.
about the key,
One attribute is special, it's a key. The key is an attribute that must be unique for all information in your data and must never change. Your full name is not a key because it can change. Your Social Insurance Number is not a key because they get reused. Your SSN plus birthdate is not a key, even if the combination can never be reused, because an attribute cannot be a combination of two facts. A GUID is a key. A number you increment and never reuse is a key.
the whole key,
The key alone must be sufficient [and necessary!] to identify your values; you cannot have the same data represented by different keys, nor can a subset of the key columns be sufficient to identify the fact.
Suppose you had an address book with a GUID key, name and address values. It is OK to have the same name appearing twice with different keys if they represent different people and are not the "same data".
If Mary Jones in accounting changes her name to Mary Smith, Mary Jones in Sales does not change her name as well.
On the other hand, if Mary Smith and John Smith have the same street address and it really is the same place, this is not allowed. You have to create a new key/value pair with the street address and a new key.
You are also not allowed to use the key for this new single street address as a value in the address book since now the same street address key would be represented twice.
Instead, you have to make a third key/value pair with values of the address book key and the street address key; you find a person's street address by matching their book key and address key in this group of values.
and nothing but the key
There must be nothing other than the key that identifies your values. For example, if you are allowed an address of "The Taj Mahal" (assuming there is only one) you are not allowed a city value in the same record,
since if you know the address you would also know the city. This would also open up the possibility of there being more than one Taj Mahal in a different city.
Instead, you have to again create a secondary Location key with unique values like the Taj, the White House in DC, and so on, and their cities.
Or forbid "addresses" that are unique to a city.
So help me, Codd.
Here are some helpful excerpts from the Wikipedia page on Third Normal Form:
Bill Kent defines Third Normal Form this way:
Each non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
Requiring that non-key attributes be
dependent on "the whole key" ensures
that a table is in 2NF; further
requiring that non-key attributes be
dependent on "nothing but the key"
ensures that the table is in 3NF.
Chris Date adapts Kent's mnemonic to define Boyce-Codd Normal Form:
"Each attribute must represent a fact
about the key, the whole key, and
nothing but the key." Here the
requirement is concerned with every
attribute in the table, not just
non-key attributes.
This comes into play when a table has multiple compound candidate keys, and an attribute within one candidate keys has a dependency on a part of another candidate key. Third Normal Form wouldn't prohibit this, because it excludes key attributes. But BCNF applies the rule to key attributes as well.
As for how to make a table satisfy BCNF, you need to represent the extra dependency, with another attribute and possibly by splitting attributes into another table.
I googled "boyce codd normal form" and after wikipedia this is the second result. My textbook gives a very simple definition in terms of relational database management systems:
The left side of every nontrivial FD must be a superkey.
-"Database Systems The Complete Book" by Garcia-Molina, Ullman and Widom.
The best informal answer I've read is that, in BCNF, every "arrow" in every functional dependency is an "arrow" out of a candidate key. I don't recall the source, but it was probably something Chris Date wrote.
Basically Boyce-Codd is "fifth normal form". It is visually recognizable by the existance of "Attributive entities" in the data model, for things like Types (e.g. roles, status, process state, location-type, phone-type, etc).
The attributive entities (sub-subtypes) are lists of finite sets of values that further categorize a class level entity. So you may have a phone-type ('mobile', ' desk', 'VOIP') email account type ('business', 'personal', 'gaming'), role (project manager, data modeler, super model) etc.
Another morphological clue is the existance of super-types, (aka. master-classes, super-classes, meta-entities) such as Parties (subtypes being company, person, etc.).
It's basically Taxonomy gone wild (..no the video is not that exciting) to the atomic or leaf-level; see Bill Karwin's comment above for a more technical explanation.
Boyce-Codd level models are essentially highly detailed logical models, derived from more simplistic business-based conceptual models. **They are typically NOT implemented ver batim in the PHYSICAL model, because PDM optimization for performance (or functional simplicity) may result in the super-types and attributive entities being managed as drop-down lists in UIs, or in behind the scenes logic in the application, or in database constraints and methods to enforce referential integrity. (i.e. they may end up as look-up tables in the PDM schema, or they may be handled by code and not represented in the database).
So - why do them if they may not end up in the PDM? For the same reason you build a good 3NF model before you 'optimize', so that the database structure reflects the real world and is hence more stable than the typical kludges we inherit and have to do heroic acts to make work as our business/clients requirements change.
Often times it is easiest to listen to your gut and this will come naturally. Generally speaking, if you meet 3NF you have met BCNF. This doesn't cover detailed analysis of an ERD or have examples but there are thirteen rules according to Codd. I find it best to follow these rules but always remember there is no one correct way to do things so follow them loosely. So regarding the RDBMS, here are the rules:
http://www.87android.com/12-rules-of-relational-database-model-by-codd/
This may not answer the question directly, but if you are asking about how to get to BCNF or an easy way to remember it then you don't understand normalization well enough. That is of no concern though. Relational databases take many forms and very few are done well. The best thing you can do is know what it means to be relational, follow the rules above, and do not worry about the level of normalization. The process of normalization eliminates the duplication of data. Each level more so by moving into migration of functional dependencies. Keep that in mind and you will be fine, your gut and intellect will do the rest.

Resources