Normally is it good practice to set all database columns as NOT NULL or not ? Justify your answer.
No. It's a good idea to set columns to NULL where appropriate.
I kind of disagree with the "where appropriate" rule. It is actually rather safe to set any column to be NOT NULL; and then later modify the columns to allow NULL values when you need them. On the other hand, if you allow NULL values first and then later decide you don't want to allow them, it can potentially be much more difficult to do this.
It may make your database table/column descriptions quite ugly if you do this excessively, but when in doubt, go ahead and restrict the data.
Relational theory has it that NULL is evil.
However, your question kind of referred to practice.
So, to the extent that you want your practices to conform to the heavenly ideals of theory, yes, avoid NULL as if it were the plague, Cholera and AIDS all-in-one.
To the extent that these crappy implementations called "SQL DBMSs" do not leave you any other choice, yes, (sniff) use them.
EDIT
Someone mentioned "business rules" as the guideline for "appropriateness" in the accepted answer, and some others upvoted that remark. That is total crap. Business rules can always do without NULLs and the only guideline to "appropriateness" is the very deficiencies of any SQL system that makes it a non-relational system to boot.
The inventor of the NULL reference (1965) recently called it his "billion-dollar mistake": https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/
Languages such as Scala, SML, and Haskell are non-NULL by default: NULL is called "Option" or "Maybe" and require special syntax and checks.
Since the time databases were invented, allowing NULL by default has been considered more and more dangerous and undesirable. Should databases follow? Probably.
Go with NOT NULL when you can.
I'm a newbie and my answer may be totally asinine, but here's my personal take on the subject.
In my humble opinion, I don't see the problem with allowing ALL fields except primary/foreign keys to be nullable. I know many of you cringed as soon as I said that, and I'm sure I heard someone cry out, "Heretic! Burn him at the stake!" But here's my reasoning:
Is it really the job of the database to enforce rules about what values should and should not be permitted - except of course as needed to enforce things like referential integrity and to control storage consumption (by having things like max chars set)? Wouldn't it be easier and better to enforce all "null vs. not null" rules at the code level prior to storing the values in the database?
After all, it's the job of the code to validate all values prior to them being stored in the database anyway, right? So why should the database try to usurp the code's authority by also setting up rules about what values are valid? (In a way, using not null constraints except where absolutely necessary almost feels like a violation of the idea of "separation of concerns.") Furthermore, any time a constraint is enforced at the database level, it must necessarily be enforced at the code level also to prevent the code from "blowing up." So why do twice as much work?
At least for me, it seems like things work out better when my database is allowed to simply be a "dumb data storage container" because inevitably in the past when I've tried to use "NOT NULL" to enforce a business rule which made sense to me at the time, I end up wishing I hadn't and end up going back and removing the constraint.
Like I said, I realize I'm a newbie and if there's something I'm overlooking, let me know - and try not to butcher me up too bad :) Thanks.
If you can't know the value at insert time, you really must have a null allowed. For instance, suppose you havea record that includes two fields, begin date and end date. You know begin date when the record is inserted but not the end date. Creating a fake date to put in this field just to avoid nulls is dumb to say the least.
In real life at least as much harm is caused by forcing data entry into a field as by not forcing it. If you havea an email field and don't know the customer's email, then the user has to make something up to put into the required field. Likely what they make up may not be what you would want them to make up something like "thisistupid#ass.com". Sometimes this bad info gets provided back to the client or to a vendor in a data feed and your company looks really really stupid. I know as I process a lot of these feeds coming in from our customers. Nice things in the email field have included, "his secretary is the fat blonde", "this guy is a jerk" etc.
From my perspective, while it may be better for the database, it's not better for the user. Once you get into more interactive applications, you want to be able to persist the data in an interim state, so most of your fields will probably be null at that point.
It depends on what you're trying to do, but for many applications it's a good idea to avoid NULLs where possible — and the most foolproof way to do this is to use NOT NULL.
The problem is that the meaning of NULL is open to interpretation. It could mean “no value belongs here,” or it could mean “we haven't got the value yet, so we should keep asking the user for it.” If you are using it, you'll want to read up on SQL's 3-valued logic, and functions such as COALESCE, etc.
Nevertheless, as Cletus and others have said, if you use NULL appropriately it can be useful.
In business apps I was always removing my NOT NULLS because the users did not like being forced to enter data that they didn't know. It depends on the table but I set most of my fields to NULL and only set the bare minimum number of fields to NOT NULL.
If your data can actually BE "unknown", and it's important to record that fact, then yes, use a NULL. Bear in mind that sometimes you need to differentiate between "unknown" and "not relevant" - for example, a DateTime field in one of my databases can either be the SQL Server minimum date (not applicable), NULL (unknown), or any other date (known value).
For fields which don't really have business rules depending on them - I'm talking about "Comments", "Description", "Notes" columns here - then I set them to default to empty strings, as (a) it saves dealing with nulls, and (b) they are never "unknown" - they just aren't filled in, which logically is a known empty value.
E.g.:
CREATE TABLE Computer (
Id INT IDENTITY PRIMARY KEY
, Name NVARCHAR(16) NOT NULL
, ...[other fields]...
, Comments NVARCHAR(255) NOT NULL
CONSTRAINT DF_Computer_Comments DEFAULT (N'')
)
If you don't supply a value to Comments, it defaults to empty.
Short answer: it depends on what you are storing.
I can see a table (or two) having all NOT NULLS or all NULLS. But an entire database?
Only for columns where not having a value doesn't make any sense.
Nulls can be very handy; for one thing, they compress beautifully. They can be a nasty surprise when you don't expect them, though, so if you can't have a Student without a First Name -- make that column NOT NULL. (Middle names, on the other hand... maybe you want to have a default empty string, maybe not -- decent arguments both ways)
You should not forget to set not null where needed, use check constraints if applicable, not forget about unique constraints, create proper indexes and brush your teeth after every meal and before going to bed:)
In most cases you can use not null and you should use not null. It is easier to change not null->null than in opposite direction, but for example in Oracle empty string is treated as null, so it is obvious that you can't use it all the time.
What's the alternative?
I found this question as a result of a discussion at work. Our question was:
Should we have a nullable foreign key or an association table with unique constraints?
The context was that sometimes there is an association and sometimes there isn't. (EG: Unplanned vs. planned schedules)
For me, a combination of nullable foreign key with a 'set field to null on delete' was equivalent to the association table but had two advantages:
More understandable (the schema was already complex)
Easier to find 'unplanned' schedules with an 'xxx is null' query (vs. not exists query)
In summary, sometimes 'null' (the absence of information) actually means something. Try to have non-null, but there are exceptions.
FWIW, we were using Scala / Squeryl so, in code, the field was an 'Option' and quite safe.
My take is that if you want to have flexible and "ambiguous" tables to some extent, just use NoSQL, as it is precisely built for that purpose. Otherwise, having a NULL value in a row is just acceptable as it maybe some piece of optional data, like Address 2, or home phone number and that kind of things.
In my opinion, making Foreign keys nullable break one of the main reasons we use relational databases. As you want your data to be as tightly related and consistent as possible.
It depends (on the datatype)
Think about this, If the immediate technology that interacts with database is Python I shall make everything NOT NULL with a proper DEFAULT.
However the above makes sense if the column is VARCHAR with default as empty string.
What about NUMERIC, It is hard to come up with default values where NULL can convey more details other than simply set to DEFAULT=0
For BOOLEAN still NULL makes some sense, and so on.
Similar argument can be carried out for various datatypes like spatial data types.
IMO, using NULLable option must be minimized. The application should designate a suitable value for the "non-existent" state. In Peoplesoft I think, the application puts a 0 for Numericals and a space for Char columns where a value does not exist.
One could argue why the so-called suitable value couldn't be NULL.
Because SQL implementation treats nulls totally differently.
For e.g.
1 = NULL and 0 = NULL both result in false!
NULL = NULL is false!
NULL value in GROUP BY and other aggregate functions also create unexpected results.
Related
here is a simple question to which I would like an answer to:
We have a member table. Each member practices one, many or no sports. Initially we (the developers) created a [member] table, a [sports] table and a [member_sports] table, just as we have always done.
However our client here doesn't like this and wants to store all the sports that the member practices in a single varchar column, separated with a special character.
So if:
1 is football
2 is tennis
3 is ping-pong
4 is swimming
and I like swimming and ping-pong, my favourite sports will be stored into the varchar column as:
x3,x4
Now we don't want to just walk up to the client and claim that his system isn't right. We would like to back it up with proof that the operation to fetch the sports from [member_sports] is more efficient than simply storing the fields as a varchar.
Is there any documentation that can back our claims? Help!
Ask your client if they care about storing accurate information1 rather than random strings.
Then set them a series of challenges. First, ensure that the sport information is in the correct "domain". For the member_sports table, that is:
sport_id int not null
^
|--correct type
For their "store everything in a varchar column" solution, I guess you're writing a CHECK constraint. A regex would probably help here but there's no native support for regex in SQL Server - so you're either bodging it or calling out to a CLR function to make sure that only actual int values are stored.
Next, we not only want to make sure that the domain is correct but that the sports are actually defined in your system. For member_sports, that's:
CONSTRAINT FK_Member_Sports_Sports FOREIGN KEY (Sport_ID) references Sports (Sport_ID)
For their "store everything in a varchar column" I guess this is going to be a far more complex CHECK constraint using UDFs to query other tables. It's going to be messy and procedural. Plus if you want to prevent a row from being removed from sports while it's still referenced by any member, you're talking about a trigger on the sports table that has to query every row in members2`.
Finally, let's say that it's meaningless for the same sport to be recorded for a single member multiple times. For member_sports, that is (if it's not the PK):
CONSTRAINT UQ_Member_Sports UNIQUE (Member_ID,Sport_ID)
For their "store everything in a varchar column" it's another horrifically procedural UDF called from a CHECK constraint.
Even if the varchar variant performed better (unlikely since you need to be ripping strings apart and T-SQL's string manipulation functions are notoriously weak (see above re: regex)) for certain values of "performs better", how do they propose that the data is meaningful and not nonsense?
Writing the procedural variants that can also cope with nonsense is an even more challenging endeavour.
In case it's not clear from the above - I am a big fan of Declarative Referential Integrity (DRI). Stating what you want versus focussing on mechanisms is a huge part of why SQL appeals to me. You construct the right DRI and know that your data is always correct (or, at least, as you expect it to be)
1"The application will always do this correctly" isn't a good answer. If you manage to build an application and related database in which nobody ever writes some direct SQL to fix something, I guess you'll be the first.
But in most circumstances, there's always more than one application, and even if the other application is a direct SQL client only employed by developers, you're already beyond being able to trust that the application will always act correctly. And bugs in applications are far more likely than bugs in SQL database engine's implementations of constraints, which have been tested far more times than any individual application's attempt to enforce constraints.
2Let alone the far more likely query - find all members who are associated with a particular sport. A second index on member_sports makes this a trivial query3. No indexes help the "it's somewhere in this string" solution and you're looking at a table scan with no indexing opportunities.
3Any index that has sport_id first should be able to satisfy such a query.
Suppose you have 2 columns in a table: FixedSalary and HourlyRate. The value for one of these columns is always going to be missing because an employee cannot be paid both a fixed salary and an hourly rate.
Now, suppose I have another table HairColor column. You're updating a row for a patient who is bald. The value will be missing.
For both examples, we can either use a Null value or something like "N/A" or "Not applicable". I was wondering, is there a best (database design) practice regarding these situations, where there is a clearly lack of value, but that lack of value is for a very clear logical reason?
As usual, it depends. Here, it depends on what message you want conveyed between the application that writes the data, and the application that reads it.
NULL normally conveys the message "there is no value at this location". That's often all the reader needs to know. In particular, in the case of hourly versus salaried, a well formed reader app should not be particularly surprised that one of them is missing, in every instance.
A special value to indicate "not applicable" is only really needed when the reader of the data is expected to behave one way when the data is missing because it's not applicable, and a different way when it's missing for a different reason, like a user omission on an input form. This could be the case for hair color, but it depends on your case.
The best practice is to analyze your case, and design accordingly.
The best practice is the Null value
EDIT
Null value is preferable, because that allows the database engine and client programs to automatically handle this value properly, without writing custom code for that. (think about for example sorting values)
As per my database knowledge, Null or null is a keyword. Where as Not Applicable or N/A is user defined. So it all depends on how you like to parse the result set.
What would you prefer, rs.getString(COLUMN_NAME); returning null or N/A.
According to me, null would be nice, because, it is easy to check for != null.
I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).
For me, the classic wisdom is to store enum values (OrderStatus, UserTypes, etc) as Lookup tables in your db. This lets me enforce data integrity in the database, preventing false or null values, etc.
However more and more, this feels like unnecessary duplication to me. Not only do I have to create tables for these values (or have an unwieldy central lookup table), but if I want to add a value, i have to remember to add it to 2 (or more, counting production, testing, live db's) and things can get out of sync easily.
Still I have a hard time letting go of lookup tables.
I know there are probably certain scenarios where one had an advantage over the other, but what are your general thoughts?
I've done both, but I now much prefer defining them as in classes in code.
New files cost nothing, and the benefits that you seek by having it in the database should be handled as business rules.
Also, I have an aversion to holding data in a database that really doesn't change. And it seems an enum fits this description. It doesn't make sense for me to have a States lookup table, but a States enum class makes sense to me.
If it has to be maintained I would leave them in a lookup table in the DB. Even if I think they won't need to be maintained I would still go towards a lookup table so that if I am wrong it's not a big deal.
EDIT:
I want to clarify that if the Enum is not part of the DB model then I leave it in code.
I put them in the database, but I really can't defend why I do that. It just "seems right". I guess I justify it by saying there's always a "right" version of what the enums can be by checking the database.
Schema dependencies should be stored in the database itself to ensure any changes to your architecture can be easily perform transparently to the app..
I prefer enums as it enforces early binding of values in code, so that exceptions aren't caused by missing values
It's also helpful if you can use code generation that can bring in the associations of the integer columns to an enumeration type, so that in business logic you only have to deal with easily memorable enumeration values.
Consider it a form of documentation.
If you've already documented the enum constants properly in the code that uses the dB, do you really need a duplicate set of documentation (to use and maintain)?
I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?
You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.
If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.
If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.