NULL permitted in Primary Key - why and in which DBMS? - database

Further to my question "Why to use ´not null primary key´ in TSQL?"...
As I understood from other discussions, some RDBMS (for example SQLite, MySQL) permit "unique" NULL in the primary key.
Why is this allowed and how might it be useful?
Background: I believe it is beneficial for communication with colleagues and database professionals to know the differences in fundamental concepts, approaches and their implementations in different DBMS.
Notes
MySQL is rehabilitated and returned to the "NOT NULL PK" list.
SQLite has been added (thanks to Paul Hadfield) to "NULL PK" list:
For the purposes of determining the uniqueness of primary key values, NULL values are considered distinct from all other values, including other NULLs.
If an INSERT or UPDATE statement attempts to modify the table content so that two or more rows feature identical primary key values, it is a constraint violation. According to the SQL standard, PRIMARY KEY should always imply NOT NULL. Unfortunately, due to a long-standing coding oversight, this is not the case in SQLite.
Unless the column is an INTEGER PRIMARY KEY SQLite allows NULL values in a PRIMARY KEY column. We could change SQLite to conform to the standard (and we might do so in the future), but by the time the oversight was discovered, SQLite was in such wide use that we feared breaking legacy code if we fixed the problem.
So for now we have chosen to continue allowing NULLs in PRIMARY KEY columns. Developers should be aware, however, that we may change SQLite to conform to the SQL standard in future and should design new programs accordingly.
— SQL As Understood By SQLite: CREATE TABLE

Suppose you have a primary key containing a nullable column Kn.
If you want to have a second row rejected on the ground that in that second row, Kn is null and the table already contains a row with Kn null, then you are actually requiring that the system would treat the comparison "row1.Kn = row2.Kn" as giving TRUE (because you somehow want the system to detect that the key values in those rows are indeed equal). However, this comparison boils down to the comparison "null = null", and the standard already explicitly specifies that null doesn't compare equal to anything, including itself.
To allow for what you want, would thus amount to SQL deviating from its own principles regarding the treatment of null. There are innumerable inconsistencies in SQL, but this particular one never got past the committee.

I don't know whether older versions of MySQL differ on this, but as of modern versions a primary key must be on columns that are not null. See the manual page on CREATE TABLE: "A PRIMARY KEY is a unique index where all key columns must be defined as NOT NULL. If they are not explicitly declared as NOT NULL, MySQL declares them so implicitly (and silently)."

As far as relational database theory is concerned:
The primary key of a table is used to uniquely identify each and every row in the table
A NULL value in a column indicates that you don't konw what the value is
Therefore, you should never use the value of "I don't know" to uniquely identify a row in a table.
Depending upon the data you are modelling, a "made up" value can be used instead of NULL. I've used 0, "N/A", 'Jan 1, 1980', and similar values to represent dummy "known to be missing" data.
Most, if not all, DB engines do allow for a UNIQUE constraint or index, which does allow for NULL column values, though (ideally) only one row may be assigned the value null (otherwise it wouldn't be a unique value). This can be used to support the irritatingly pragmatic (but occasionally necessary) situations that don't fit neatly into relational theory.

Well, it could allow you to implement the Null Object Pattern natively within the database. So if you were using something similar in code, which interacted very closely with the DB, you could just look up the object corresponding to the key without having to special-case a null check.
Now whether this is worthwhile functionality I'm not sure, but it's really a question of whether the pros of disallowing null pkeys in absolutely all cases outweigh the cons of obstructing someone who (for better or worse) actually wants to use null keys. This would only be worth it if you could demonstrate some non-trivial improvement (such as faster key lookup) from being able to guarantee that keys are non-null. Some DB engines would show this, others might not. And if there aren't any real pros from forcing this, why artificially restrict your clients?

As discussed in other answers, NULL was intended to mean "the information that should go in this column is unknown". However, it is also frequently used to indicate an alternative meaning of "this attribute does not exist". This is a particularly useful interpretation when looking at timestamp fields that are interpreted as the time some particular event occurred, in which case NULL is often used to indicate that the event has not yet occurred.
It is a problem that SQL doesn't support this interpretation very well -- for this to work properly, it really needs to have a separate value (something like "never") that doesn't behave as null does ("never" should be equal to "never" and should compare as higher than all other values). But as SQL lacks this notion, and there is no convenient way to add it, using null for this purposes is often the best choice.
This leaves the problem that when a timestamp of an event that may have not occurred should be part of the primary key of a table (a common requirement perhaps being the use of a natural key along with a deletion timestamp when using soft deletion with a requirement for the ability to recreate the item after deletion) you really want the primary key to have a nullable column. Alas, this is not allowed in most databases, and instead you have to resort to an artificial primary key (e.g. a row sequence number) and a UNIQUE constraint for what should otherwise have been your actual primary key.
An example scenario, in order to clarify this: I have a users table. As I require each user to have a distinct username, I decide to use username as the primary key. I want to support user deletion, but as I need to track the existence of users historically for auditing purposes I use soft deletion (in the first version of the schema, I add a 'deleted' flag to the user, and ensure that the deleted flag is checked in all queries where only active users are expected).
An additional requirement, however, is that if a username is deleted, it should be available for new users to register. An attractive way to achieve this would be to have the deleted flag change to a nullable timestamp (where nulls indicate that the user has not been deleted) and put this in the primary key. Were primary keys to allow nullable columns, this would have the following effect:
Creating a new user with an existing username when that user's deleted column is null would be denied as a duplicate key entry
Deleting a user changes its key (which requires changes to cascade to foreign keys that reference the user, which is suboptimal but if deletions are rare is acceptable) so that the deleted column is a timestamp for the when the deletion occurred
Now a new user (which would have a null deleted timestamp) can be successfully created.
However, this cannot actually be achieved with standard SQL, so instead one must use a different primary key (probably a generated numeric user id in this case) and use a UNIQUE constraint to enforce the uniqueness of (username,deleted).

Having primary key null can be beneficial in some scenarios. In one of my projects I used this feature during synchronisation of databases: one on server and many on different users devices. Considering the fact that not all users have access to the Internet all the time, I decided that only the main database will be able to give ids to my entities. SQLite has its own mechanism for numbering rows. Had I used additional id field I would use more bandwith. Having null as id not only notifies me that an entity has been created on clients device when he hadn't access to the Internet, but also decreases code complexity. The only drawback is that on clients device I can't get an entity by it's id unless it was previously synchronised with main database. However thats not an issue since my user cares for entities for their parameters, not their unique id.

Related

Unique Key Constraint Issue - MySQL, SQL Server, Oracle, Postgres

I have a employee table which has 10 columns and have to create a unique key constraint for id, name, address, mobile
In the above case address might can come as null and mobile can come as null. However when they comes the uniqueness should be maintained.
first, i created a unique constraint by combining all the above keys and following is observed.
Actual behaviour in MySQL.
001-Thiagu-NULL-900000 - Accepted
001-Thiagu-NULL-900000 - Accepted
001-Thiagu-0001-900000 - Accepted
001-Thiagu-0001-900000 - Rejected - Duplicate Record
Expected behaviour in all the databases
001-Thiagu-NULL-900000 - Accepted
001-Thiagu-NULL-900000 - Rejected - Duplicate Record
001-Thiagu-0001-900000 - Accepted
001-Thiagu-0001-900000 - Rejected - Duplicate Record
Basically the similar should be considered for duplication no matter whether the value exist as NULL or Not.
To overcome this problem i dropped the idea of combining and creating unique by adding columns to the unique constraint and come up with a new column of string type with unique constraint.
One each insert of the record i manually construct and give the value on any insert so that uniqueness will be maintained.
Is that would be the right approach or any other way to fix in the above first approach which i am not sure.
The created constraint should work for MySQL, SQL Server, Oracle and Postgres.
In SQL, null never equals null. That's not a bug, that's a feature. NULL IS NOT DISTINCT FROM NULL is true, but key declarations employ '=' [in the equivalent longhands], not IS NOT DISTINCT FROM. The 'key' constraint that you want should employ IS NOT DISTINCT FROM, therefore you cannot get there by declaring keys.
The next option would be a CHECK constraint, but products are unlikely to support CHECK constraints accessing other rows than the one being inserted.
The next option would be to create an ASSERTION, but no product supports that [reliably], essentially for the same reason as why they don't support cross-row CHECK constraints.
The next option is to enforce this in a stored procedure, but then you're likely to bump into [some of] the products only talking their proprietary dialect of SQL/PSM language.
The next option is application code.

How to store a "primary" record

Suppose I have the following tables
Companies
--CompanyID
--CompanyName
and
Locations
--LocationID
--CompanyID
--LocationName
Every company has at least one location. I want to track a primary location for each company (and yes, every company will have exactly one primary location). What's the best way to set this up? Add a primaryLocationID in the Companies table?
Add a primaryLocationID in the Companies table?
Yes, however that creates a circular reference which could prevent you from inserting new data:
One way to resolve this chicken-and-egg problem is to simply leave Company.PrimaryLocationID NULL-able, so you can temporarily disable one of the circular FKs. This unfortunately means the database will enforce only "1:0..1", but not the strict "1:1" relationship (so you'll have to enforce it in the application code).
However, if your DBMS supports deferred constraints (such as Oracle or PostgreSQL), you can simply defer one of the FKs to break the cycle while the transaction is still in progress. By the end of the transaction both FKs have to be in place, resulting in a real "1:1" relationship.
The alternative solution is to have a flag in the Locations table that is set for a primary location, and NULL non-primary locations (note the U1, denoting a UNIQUE constraint, ensuring a company cannot have multiple primary locations):
CREATE TABLE Location (
LocationID INT PRIMARY KEY,
CompanyID INT NOT NULL, -- References Company table, not shown here.
LocationName VARCHAR(50) NOT NULL, -- Possibly UNIQUE?
IsPrimary INT CHECK (IsPrimary IS NULL OR IsPrimary = 1), -- Use a BIT or BOOLEAN if supported by your DBMS.
CONSTRAINT Locations_U1 UNIQUE (CompanyID, IsPrimary)
);
Unfortunately, this has some problems:
It can only guarantee up to "1:0..1" (but not the real "1:1") even on a DBMS that supports deferred constraints.
It requires an additional index (in support to the UNIQUE constraint). Every index brings certain overhead, mostly for INSERT/UPDATE/DELETE performance. Furthermore, secondary indexes in clustered tables contain copy of PK, which may make them "fatter" than expected.
It depends on ANSI-compliant composite UNIQUE constraints, that allow duplicated rows if any (but not necessarily all) of the fields are NULL. Unfortunately not all DBMSes follow the standard, so the above would not work out-of-box under Oracle or MS SQL Server (but would work under PostgreSQL and MySQL). You could use a filtered unique index instead of the UNIQUE constraint to work-around that, but not all DBMSes support that either.
The BaBL86's solution models M:N, while your requirement seems to be 1:N. Nonetheless, that model could be "coerced" into 1:N by placing a key on {LocationID} (and on {CompanyID, TypeOfLocation} to ensure there cannot be multiple locations of the same type for the same company), but is probably over-engineered for a simple "is primary" requirement.
I think your own solution is the best one - this ensures that every company can only have one primary location. By making it a NOT NULL column, you can even enforce that every company should have a primary location.
Using BaBL86's solution, you don't have those constraints: a company can have 0 - unlimited 'primary locations', which obviously shouldn't be possible.
Do note that, if you use foreign key constraints AND define primaryLocationID as a NOT NULL column, you'll run into problems, because you basically have a loop (Location points to Company, Company points to location). You cannot create a new Company (because it needs a primary location), nor can you create a new Location (because it needs a company).
I do it with pivot table:
CompanyLocations
--CompanyID
--LocationID
--TypeOfLocation (primary, office, warehouse etc.)
In this case you can select all locations and than use type as you like. If you create PrimaryLocationID - you're need two joins of one table and more complex logic. It's worst than this.

Primary keys for a table with a submission ID, a user ID and a submission order

I am building the database for a web application where users may submit data. Each datum is unique, and more than one user can submit the same datum.
It is important, from the application's standpoint, to know the order in which users submitted the datum.
I have made a table exclusively for this purpose. It has the following fields:
SubmissionID
UserID
SubmissionOrder
... # and other non-important ones
My question is, which attributes should I make primary keys?
SubmissionID and UserID would allow for duplicate SubmissionOrders for a (SubmissionID, UserID) pair.
SubmissionID and SubmissionOrder would allow the same user to submit the same thing twice.
UserID and SubmissionOrder... would limit the user considerably in terms of what he can submit :P
All three would allow duplicate SubmissionOrders for different UserIDs.
Is there another solution which I am not pondering?
Is this problem better solved at the application level? With triggers?
Thank you for your time!
PS: Some technical details which I doubt you'll find useful:
The application is written in PHP
The database runs on sqlite3
The order in which things happen is just a little fuzzy on most SQL platforms. As far as I know, no SQL platform guarantees both these two requirements.
There must be no ties.
Earlier submissions must always look like they're earlier than later submissions.
With a timestamp column, earlier submissions always look like they're earlier than later submissions. But it's easily possible to have two "submissions" with the same timestamp value, no matter how fine the resolution of your dbms's timestamp.
With a serial number (or autoincrementing number), there will be no ties. But SQL platforms don't guarantee that a row having serial number 2 committed before a row having serial number 3. That means that, if you record both a serial number and a timestamp, you're liable to find pairs of rows where the row that has the lower serial number also has the later timestamp.
But SQLite isn't quite SQL, so it might be an exception to this general rule.
In SQLite, any process that writes to the database locks the whole database. I think that locking the database means you can rely on rowid() to be temporally increasing.
So I think you're looking at a primary key of {SubmissionID, UserID}, where rowid() determines the submission order. But I could be wrong.
In regards to your specific question :
It is important, from the application's standpoint, to know the order in which users submitted the datum
I think there are 2 alternative options than a combined field primary key:
(1) Create an additional column - a single integer (auto-increment) primary key.
or
(2) Create a timestamp field and save the date/time the data was input.
So you have the submitted data (SubmissionId?), which you want to allow duplicates on (so that another user can submit duplicate data), but not duplicates for a single user. That calls for a unique constraint on (SubmissionId, UserId).
Your next requirement is that you "know the order in which users submitted the datum". It's unclear if that's true for all submissions, or only the submissions that have duplicates*. Solving the general case (all submissions) solves the specific - so we'll go with that.
Since this is an ordering problem, something SQL doesn't really deal with, you'll need to add an attribute that will give you absolute ordering. The 2 standard choices are an autoincrement, or a timestamp. We'll pick an autoincrement so that you don't have to worry about ties. I assume that SubmissionOrder is your placeholder for that. Since we used an autoincrement column, it's guaranteed to be unique. So, we have another unique constraint on (SubmissionOrder).
These unique constraints are now your candidate keys. Pick one as your primary key, and use a unique constraint for the other.
* Your comments about duplicate SubmissionOrder confuses the issue a bit - suggesting that SubmissionOrder is only unique to a SubmissionId. Assuming you have application logic to create the next SubmissionOrder, this is a valid (though slightly harder) alternative. Your unique constraint would then end up being on (SubmissionId, SubmissionOrder).

default value vs. null for foreign key

I have a question about using null vs. default value for foreign key columns in database. I found a lot of opposite opinions about null vs. default values when designing databases but not exactly for foreign keys (what are main pros and cons).
Currently I'm designing a new database which will store a lot of data for different web applications and other systems with different data access approaches (ORM, stored procedures) and I want to implement general rules on the lowest level as possible (database). (So not to worry about this rules later in applications).
To give you an example let's say that I have a table of users User with foreign key column for his nationality NationalityID which is a primary key CountryID for table Country.
Now I have two/three options:
A: I allow NationalityID column (and all other similar foreign key columns in database) to be null and just stick with common approach of checking always and everywhere for null (applying rules in application)
or
B: I assign a default value for every foreign key to be let's say "-1" and put in every relation table additional column with "-1" as a key and all other data as "No data" (for this example in Country table I put column with CountryID of "-1" and for CountryName I set "No data"). So every time I will want to know users nationality I will always get result without additional code rules (no need for me to check if it's null or not).
or
C: I can disallow null value for foreign keys. But this is really something what I want to avoid. (I need to have an option to store at least basic data (users name) if not the additional data (users nationality))
So is B good approach or not? What am I missing here? Do I lose more that I gain with this approach? Which problems could I have (in addition to be careful to always have additional column in relational tables with ID value of "-1" which says there is "No data")?
What is your good/bad experience with foreign key default values?
thank you
If you normalize this won't be an issue.
Instead of putting nationality in the USER table, make a User_Nationality table that links users to Country_ID in the other table.
If they have an entry in that lookup table, great. If not, you don't need to store a NULL or default value for it.
You need to enforce FK relationships, and allowing NULL goes against that. You also don't want to make up information that may not be accurate just to populate a field, which negates the point of requiring the field in the first place.
Use lookup tables and you can bypass that entirely.
This will also allow you to change your mind and choose one of your options down the road.
If you use views, you can choose to treat missing data as a NULL or a default value without needing to alter the underlying data.
Personally, I would feel that even if you have a non-entry entry in your database with a key of -1, you would still be performing a check to see whether you want to display 'No Data' or not for each individual field.
I would stick to NULLs. NULL is meant to mean the absence of data, which is the case here.
B is a terrible approach. It is easier to remeber to handle nulls than to have to figure out what magic number you used and then you still have to handle them. Use number 1. But I like JNKs idea best.
I suggest option D. If not all users have a defined nationality then that information doesn't belong in the user table. Create a table called UserNationality keyed on UserId.
I like your B solution. Maybe it will be possible to map the values into other entities, so you have Country, and NullCountry that extends Country and is mapped to row with id=-1 and have special code in its methods to make it easy to handle special cases.
One problem is probably that it will be harder to do outer joins on that foreign key.
EDIT: no, there should be no problem with outer joins, because there would be no need to do outer joins.

Char(4) as Primary key

Classic database table design would include an tableId int index(1,1) not null which results in an auto-increment int32 id field.
However, it could be useful to give these numbers some meaning, and I wanted to know what people thought about using a Char(4) field for an table containing enumerables.
Specifically I was thinking of a User Role table which had data like;
"admn" - "Administrator"
"edit" - "Editor".
I could then reference these 'codes' in my code.
Update
It makes more sense when writing code to see User.IsInRole("admin") rather than User.IsInRole(UserRoles.Admin) where Admin is an int that needs to be updated/synchronised if you ever rebuild your database.
An id field (not associated with the data) is called a surrogate key. They have their advantages and disadvantages. You can see a list of those on this Wikipedia article. Personally I feel that people overuse them and have forgotten (or have never learned) how to properly normalise a database structure.
I always tend to use a surrogate primary key in my tables.
That is, a key that has no meaning in the business domain. A primary key is just an administrative piece of data that is required by the database ...
What would be the advantage of using 'admn' as primary key in this case ?
No. No, no, no, no, and no.
Keys are not data. Keys do not have meaning. That way when meaning changes, keys do not change.
Keys do not have encoded meaning. Encoded meaning is not even potentially possibly maybe useful unless you have an algorithm for decoding it.
Since there's no way to get from "admn" to "Aministrator" without a lookup, and since the real meaning, "Administrator" sits right next to the SEKRET ENKODED "useful" key, why would I ever look at the key instead of the real data right next to it in the table?
If you want an abbreviated form, an enum-like name, or what have you, then call it that and realize it's data. That's perfectly acceptable. create table( id int not null primary key, abbv char(4), name varchar(64));
But it's not a key, it doesn't hash like a integer key, it takes up four character compares and a look for the null terminator to compare it to "edtr", as opposed to one subtraction to compare two integers. There's no decent way to generate a next key: what's next in the sequence ('admn', 'edtr', ?)?
So you've lost generate-ability, easy comparison, possibly size (if you could have used, day, a tinyint as your key), and all for an arbitrary value that's of no real use.
Use a synthetic key. If you really need an abbreviation, make that an attribute column.
A primary key is better if it's never going to change. You can change a primary key as long as you update all references to it but it's a bit of a pain.
Sometimes there's no natural non-changing column in a table and, in that case, a surrogate is useful.
But, if you have a natural non-changing value (like an employee ID that's never recycled or a set of roles that you never expect to change), it's better to use that.
Otherwise, you're introducing complexity to cater for something with a minuscule chance of happening.
That's only my opinion, my name isn't Codd or Date, so don't take this as gospel.
I think the answer is in your post. Your example of User.IsInRole("admin") will always return false as you have a primary key of char(4) and used "admn" as key for the administrator.
I would go for a surrogate Primary key which will never ever change and have the option for a 'functional primary key' to query certain roles which are used hardcoded in the code.
A key should preferrably not have any special meaning in itself. Conditions tend to change, so you may have to change a key if it's meaning changes, and changing keys is not something that you want to do.
Also, when the amount of information that you can put in the key is so limited, it's not much point of having it there.
You are comparing auto-increment vs. the fixed char key, but in that scenario you don't want an auto-incremented id.
There are different routes to go. One is to use an enum, that maps to int ids. These are not auto incremented and are the primary key of the table.
Hard code => database references are always a bad idea! (Except you set them before application start etc.)
Beside this: Should mappings from admn=>administrator really be done in the database?
And you can only store 23^4-(keywords) entries with a varchar4 with a LATIN 23char alphabet.
If you use a non-related number as primary key it's called Surrogate key and a class of person believe it's the best practice.
http://en.wikipedia.org/wiki/Surrogate_key
If you use "admn" as primary key then it's called Natural key and a different class of developers believe it's the best practice.
https://en.wikipedia.org/wiki/Natural_key
Both of them are never going to agree. It's a age old religious war (Surrogate Key vs Natural key). Therefore use what you are most comfortable with. Just know that there are a many that supports your view for everyone that disagrees --- no matter which one you chose.

Resources