Data Warehouse design - Handling NULL and empty values in the OLTP - sql-server

I am creating a DW for an OLTP that is creaking somewhat.
A problem I'm faced with is that there isn't much data integrity in the OLTP database. An example would be a Suburb field.
This suburb field is a free text field on the OLTP UI which means we've got values in the field, plus we've got empty strings and we've got NULL values.
How would we usually handle this? The scenarios I've come up with are:
Import data as is (not ideal)
In my ETL process, treat any empty string the same as a NULL and replace that with the word 'Unknown' in the DW
Import both empty strings and NULL's as empty strings in the DW
Just FYI, I'm using the Microsoft BI stack (SQL Server, SSIS, SSAS, SSRS)

The short answer is, it depends on what NULL and empty strings mean in the source system.
This general question (handling NULL) has been discussed a lot, e.g. here, here, here etc. I think the most important point to remember is that a data warehouse is just a database; it may have a very specific type of schema and be designed for one purpose, but it's still just a database and any general advice on NULL still applies.
(As a side note, I sometimes prefer to talk about a "reporting database" rather than a "data warehouse", because it keeps things in perspective. Some DBAs and developers start making plans for huge server farms and multi-year ETL projects as soon as they hear the words "data warehouse", but in the end it's just a reporting database.)
Anyway, it isn't completely clear where you want to use NULL but it looks like it may be an attribute on a dimension.
I (probably) wouldn't use any of your three approaches, but it depends on the meaning of your data. Importing the data as-is is not useful because part of the value of a data warehouse is that the data has been cleaned and is consistent, which makes querying and comparing data along other dimensions much easier.
Replacing empty strings with 'Unknown' may or may not be correct: what does an empty string mean in the source system? There's a big difference between "it means there's no suburb" and "it means we don't know if there's a suburb". Assuming that an empty string means "no suburb" and NULL means "unknown" then I would import the empty strings as they are, but replace NULL with 'Unknown'. The main reason for doing that is that if the Suburb field will be used as a filter condition in a report, it's easier for users (and possibly your reporting tool) to work with a non-NULL value like 'UNKNOWN'. And if there is no consistency in the source system and you don't know what empty strings and NULLs mean, then you need to clarify that first and ideally fix the source system too (another benefit of a DWH is that it helps to identify inconsistencies and data handling errors in source systems).
Your last idea to convert NULLs to empty strings is the same issue: what does a NULL actually mean in the source system? If it means "no suburb" then replacing it with an empty string is probably a good idea, but if it means something else then you should handle it as something else.
So to summarize, my preference would be to import empty strings as-is, and convert NULL to 'UNKNOWN', but I can't be sure that this actually makes sense in your case. There's no single answer to this question because it all depends on your specific data and what it means. But there's no problem with using NULL in a data warehouse (or any other database) as long as you do it consistently and with a clear understanding of how the source systems handle data.

Semantically, NULL would usually mean undefined/unknown. Whereas, "" empty string would mean that the value is known to be empty. In your suburb example, NULL could mean that it is not known whether there is a suburb for the given record, while "" could mean there is for sure no suburb for the given record.
If the meaning of NULL and "" are identical in your situation, it is best to normalize both values to same thing (say "") before importing to DW to make it easier to do your reports later (so as not to have NULL = 50 and "" = 34 and having to add them together).

Related

How to handle NOT NULL SQL Server columns in Access forms elegantly?

I have an MS Access front-end linked to a SQL Server database.
If some column is required, then the natural thing to do is to include NOT NULL in that column's definition (at the database level). But that seems to create problems on the Access side. When you bind a form to that table, the field bound to that column ends up being pretty un-user-friendly. If the user erases the text from that field, they will not be able to leave the field until they enter something. Each time they try to leave the field while it's blank, they will get this error:
You tried to assign the Null value to a variable that is not a Variant data type.
That's a really terrible error message - even for a developer, let alone the poor user. Luckily, I can silence it or replace it with a better message with some code like this:
Private Sub Form_Error(DataErr As Integer, Response As Integer)
If DataErr = 3162 Then
Response = acDataErrContinue
<check which field is blank>
MsgBox "<some useful message>"
End If
End Sub
But that's only a partial fix. Why shouldn't the user be able to leave the field? No decent modern UI restricts focus like that (think web sites, phone apps, desktop programs - anything, really). How can we get around this behavior of Access with regard to required fields?
I will post the two workarounds I have found as an answer, but I am hoping there are better ways that I have overlooked.
Rather than changing backend table definitions or trying to "trick" Access with out-of-sync linked table definitions, instead just change the control(s) for any "NOT NULL" column from a bound to an unbound field (i.e. Clear the ControlSource property and change the control name--by adding a prefix for example--to avoid annoying collisions with the underlying field name.).
This solution will definitely be less "brittle", but it will require you to manually add binding code to a number of other Form events. To provide a consistent experience as other Access controls and forms, I would at least implement Form_AfterInsert(), Form_AfterUpdate(), Form_BeforeInsert(), Form_BeforeUpdate(), Form_Current(), Form_Error(), Form_Undo().
P.S. Although I do not recall seeing such a poorly-worded error message before, the overall behavior described is identical for an Access table column with Required = True, which is the Access UI equivalent of NOT NULL column criteria.
I would suggest if you can simply change all tables on sql server to allow nulls for those text columns. For bit, number columns default them to 0 sql server side. While our industry tends to suggest to avoid nulls, and many a developer ALSO wants to avoid nulls, so they un-check the allow nulls SQL server side. The problem is you can never run away and avoid tons of nulls anyway. Take a simple query of say customers and their last invoice number + invoice total. But of course VERY common would be to include customers that not bought anything in that list (customers without ivoices yet, or customers without any of a gazillion possible cases where the child record(s) don't yet exist. I find about 80% or MORE of my quires in a typical application are LEFT joins. So that means any parent record without child records will return ALL OF those child columns as null. You going to work with, and see, and HAVE to deal with tons and tons of nulls in a application EVEN if you table designs NEVER allow nulls. You cannot avoid them - you simply cannot run away from those nasty nulls.
Since one will see lots of nulls in code and any sql query (those VERY common left joins), then by far and away the best solution is to simply allow and set all text columns as allowing nulls. I can also much state that if an application designer does not put their foot down and make a strong choice to ALWAYS use nulls, then the creeping in of both NULLS and ZLS data is a much worse issue to deal with.
The problem and issue becomes very nasty and painful if one does not have control or one cannot make this choice.
At the end of the day, Access simply does not work with SQL server and the choice of allowing ZLS columns.
For a migration to sql server (and I been doing them for 10+ years), it is without question that going will nulls for all text columns is by far and away the most easy choice here.
So I recommend that you not attempt to code around this issue but simply change all your sql tables to default to and allow nulls for empty columns.
The result of above may require some minor modifications to the application, but the pain and effort is going to be far less then attempting to fix or code around Access poor support (actually non support) of ZLS columns when working with SQL server.
I will also note that this suggesting is not a great suggestion, but it is simply the best suggestion given the limitations of how Access works with SQL server. Some database systems (oracle) do have a overall setting that says every null is to be converted to ZLS and thus you don't have to care about say this:
select * from tblCustomers where (City is null) or (City is = "")
As above shows, the instant you allow both ZLS and nulls into your application is the SAME instant that you created a huge monster mess. And the scholarly debate about nulls being un-defined is simply a debate for another day.
If you are developing with Access + SQL server, then one needs to adopt a standard approach - I recommend that approach simply is that all text columns are set to allows nulls, and date columns. For numbers and bit columns, default them to 0.
This comes down to which is less pain and work.
Either attempet some MAJOR modifications to the application and say un-bind text columns (that can be a huge amount of work).
Or
Simply assume and set all text columns to allow nulls. It is the lessor of a evil in this case, and one has to conform to the bag of tools that has been handed to you.
So I don't have a workaround, but only a path and course to take that will result in the least amount of work and pain. That least pain road is to go with allowing nulls. This suggestion will only work of course if one can make that choice.
The two workarounds I have come up with are:
Don't make the database column NOT NULL and rely exclusively on Access forms for data integrity rather than the database. Readers of that table will be burdened with an ambiguous column that will not contain nulls in practice (as long as the form-validation code is sound) but could contain nulls in theory due to the way the column is defined within the database. Not having that 100% guarantee is bothersome but may be good enough in reality.
Verdict: easy but sloppy - proceed with caution
Abuse the fact that Access' links to external tables have to be refreshed manually. Make the column NULL in SQL Server, refresh the link in Access, and then make the column NOT NULL again in SQL Server - but this time, don't refresh the link in Access.
The result is that Access won't realize the field is NOT NULL and, therefore, will leave the user alone. They can move about the form as desired without getting cryptic error 3162 or having their focus restricted. If they try to save the form while the field is still blank, they will get an ODBC error stemming from the underlying database. Although that's not desirable, it can be avoided by checking for blank fields in Form_BeforeUpdate() and providing the user with an intelligible error message instead.
Verdict: better for data integrity but also more of a pain to maintain, sort of hacky/astonishing, and brittle in that if someone refreshes the table link, the dreaded error / focus restriction will return - then again, that worst-case scenario isn't catastrophic because the consequence is merely user annoyance, not data-integrity problems or the application breaking

What is the best solution to store a volunteers availability data in access 2016 [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

is delimiting data in a database field ok

Is delimiting data in a database field something that would be ok to do?
Something like
create table column_names (
id int identity (1,1) PRIMARY KEY,
column_name varchar(5000)
);
and then storing data in it as follows
INSERT INTO column_names (column_name) VALUES ('stocknum|name|price');
No. this is bad:
in order to create new queries you have to track down how things are stored.
queries that join on price or name or stocknum are going to be nasty
the database can't assign data types to the data or validate it
you can't create constraints on any of this data now
Basically you're subverting the RDBMS' scheme for handling things and making up your own, so you're limiting how much the RDBMS tools can help you and you've made the system harder to understand for new people.
The only possible advantage of this kind of system that I can think of is that it can serve as a workaround to avoid dealing with a totally impossible DBA who vetoes all schema changes regardless of merit. Which can happen, unfortunately.
Of course there's an exception to everything. I'm currently on a project with audit-logging requirements that are pretty stringent. the logging is done to a database, we're using delimited fields for storing the fields because the application is never going to interact with this data, it gets written once and left alone.
Almost certainly not.
It violates principles of normalization. The data stored in a particular row of a particular column should be atomic-- you shouldn't be able to parse the data into smaller component parts.
It makes it substantially more difficult to get acceptable performance. Every piece of code that queries this table will need to know how to parse the data which is generally going to mean that more data needs to be read off disk and potentially sent over the network to the client. Every query that has to parse this data is going to have to be more complex which tends to cause grief for the query optimizer. Concatenated data cannot generally be indexed effectively for searches-- you'd have to do something like a full-text index with custom delimiters rather than a nice standard index on a character string. And if you ever have to update one of the delimited values (i.e. because a product name changes), those updates are going to have to scan every row in the table, parse the data, decide whether to actually update the row, and then update a ton of rows.
It makes the application much more brittle. What happens when someone decides to include a | character in the name attribute, for example? Even if you specify an optional enclosure in the spec (i.e. | is allowed if the entire token is enclosed in double quotes), what fraction of the bits of code that actually parse this column are going to implement and test that correctly?

what's best way to leave empty database cells?

I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).

Is it good practice to set all database columns as NOT NULL?

Normally is it good practice to set all database columns as NOT NULL or not ? Justify your answer.
No. It's a good idea to set columns to NULL where appropriate.
I kind of disagree with the "where appropriate" rule. It is actually rather safe to set any column to be NOT NULL; and then later modify the columns to allow NULL values when you need them. On the other hand, if you allow NULL values first and then later decide you don't want to allow them, it can potentially be much more difficult to do this.
It may make your database table/column descriptions quite ugly if you do this excessively, but when in doubt, go ahead and restrict the data.
Relational theory has it that NULL is evil.
However, your question kind of referred to practice.
So, to the extent that you want your practices to conform to the heavenly ideals of theory, yes, avoid NULL as if it were the plague, Cholera and AIDS all-in-one.
To the extent that these crappy implementations called "SQL DBMSs" do not leave you any other choice, yes, (sniff) use them.
EDIT
Someone mentioned "business rules" as the guideline for "appropriateness" in the accepted answer, and some others upvoted that remark. That is total crap. Business rules can always do without NULLs and the only guideline to "appropriateness" is the very deficiencies of any SQL system that makes it a non-relational system to boot.
The inventor of the NULL reference (1965) recently called it his "billion-dollar mistake": https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/
Languages such as Scala, SML, and Haskell are non-NULL by default: NULL is called "Option" or "Maybe" and require special syntax and checks.
Since the time databases were invented, allowing NULL by default has been considered more and more dangerous and undesirable. Should databases follow? Probably.
Go with NOT NULL when you can.
I'm a newbie and my answer may be totally asinine, but here's my personal take on the subject.
In my humble opinion, I don't see the problem with allowing ALL fields except primary/foreign keys to be nullable. I know many of you cringed as soon as I said that, and I'm sure I heard someone cry out, "Heretic! Burn him at the stake!" But here's my reasoning:
Is it really the job of the database to enforce rules about what values should and should not be permitted - except of course as needed to enforce things like referential integrity and to control storage consumption (by having things like max chars set)? Wouldn't it be easier and better to enforce all "null vs. not null" rules at the code level prior to storing the values in the database?
After all, it's the job of the code to validate all values prior to them being stored in the database anyway, right? So why should the database try to usurp the code's authority by also setting up rules about what values are valid? (In a way, using not null constraints except where absolutely necessary almost feels like a violation of the idea of "separation of concerns.") Furthermore, any time a constraint is enforced at the database level, it must necessarily be enforced at the code level also to prevent the code from "blowing up." So why do twice as much work?
At least for me, it seems like things work out better when my database is allowed to simply be a "dumb data storage container" because inevitably in the past when I've tried to use "NOT NULL" to enforce a business rule which made sense to me at the time, I end up wishing I hadn't and end up going back and removing the constraint.
Like I said, I realize I'm a newbie and if there's something I'm overlooking, let me know - and try not to butcher me up too bad :) Thanks.
If you can't know the value at insert time, you really must have a null allowed. For instance, suppose you havea record that includes two fields, begin date and end date. You know begin date when the record is inserted but not the end date. Creating a fake date to put in this field just to avoid nulls is dumb to say the least.
In real life at least as much harm is caused by forcing data entry into a field as by not forcing it. If you havea an email field and don't know the customer's email, then the user has to make something up to put into the required field. Likely what they make up may not be what you would want them to make up something like "thisistupid#ass.com". Sometimes this bad info gets provided back to the client or to a vendor in a data feed and your company looks really really stupid. I know as I process a lot of these feeds coming in from our customers. Nice things in the email field have included, "his secretary is the fat blonde", "this guy is a jerk" etc.
From my perspective, while it may be better for the database, it's not better for the user. Once you get into more interactive applications, you want to be able to persist the data in an interim state, so most of your fields will probably be null at that point.
It depends on what you're trying to do, but for many applications it's a good idea to avoid NULLs where possible — and the most foolproof way to do this is to use NOT NULL.
The problem is that the meaning of NULL is open to interpretation. It could mean “no value belongs here,” or it could mean “we haven't got the value yet, so we should keep asking the user for it.” If you are using it, you'll want to read up on SQL's 3-valued logic, and functions such as COALESCE, etc.
Nevertheless, as Cletus and others have said, if you use NULL appropriately it can be useful.
In business apps I was always removing my NOT NULLS because the users did not like being forced to enter data that they didn't know. It depends on the table but I set most of my fields to NULL and only set the bare minimum number of fields to NOT NULL.
If your data can actually BE "unknown", and it's important to record that fact, then yes, use a NULL. Bear in mind that sometimes you need to differentiate between "unknown" and "not relevant" - for example, a DateTime field in one of my databases can either be the SQL Server minimum date (not applicable), NULL (unknown), or any other date (known value).
For fields which don't really have business rules depending on them - I'm talking about "Comments", "Description", "Notes" columns here - then I set them to default to empty strings, as (a) it saves dealing with nulls, and (b) they are never "unknown" - they just aren't filled in, which logically is a known empty value.
E.g.:
CREATE TABLE Computer (
Id INT IDENTITY PRIMARY KEY
, Name NVARCHAR(16) NOT NULL
, ...[other fields]...
, Comments NVARCHAR(255) NOT NULL
CONSTRAINT DF_Computer_Comments DEFAULT (N'')
)
If you don't supply a value to Comments, it defaults to empty.
Short answer: it depends on what you are storing.
I can see a table (or two) having all NOT NULLS or all NULLS. But an entire database?
Only for columns where not having a value doesn't make any sense.
Nulls can be very handy; for one thing, they compress beautifully. They can be a nasty surprise when you don't expect them, though, so if you can't have a Student without a First Name -- make that column NOT NULL. (Middle names, on the other hand... maybe you want to have a default empty string, maybe not -- decent arguments both ways)
You should not forget to set not null where needed, use check constraints if applicable, not forget about unique constraints, create proper indexes and brush your teeth after every meal and before going to bed:)
In most cases you can use not null and you should use not null. It is easier to change not null->null than in opposite direction, but for example in Oracle empty string is treated as null, so it is obvious that you can't use it all the time.
What's the alternative?
I found this question as a result of a discussion at work. Our question was:
Should we have a nullable foreign key or an association table with unique constraints?
The context was that sometimes there is an association and sometimes there isn't. (EG: Unplanned vs. planned schedules)
For me, a combination of nullable foreign key with a 'set field to null on delete' was equivalent to the association table but had two advantages:
More understandable (the schema was already complex)
Easier to find 'unplanned' schedules with an 'xxx is null' query (vs. not exists query)
In summary, sometimes 'null' (the absence of information) actually means something. Try to have non-null, but there are exceptions.
FWIW, we were using Scala / Squeryl so, in code, the field was an 'Option' and quite safe.
My take is that if you want to have flexible and "ambiguous" tables to some extent, just use NoSQL, as it is precisely built for that purpose. Otherwise, having a NULL value in a row is just acceptable as it maybe some piece of optional data, like Address 2, or home phone number and that kind of things.
In my opinion, making Foreign keys nullable break one of the main reasons we use relational databases. As you want your data to be as tightly related and consistent as possible.
It depends (on the datatype)
Think about this, If the immediate technology that interacts with database is Python I shall make everything NOT NULL with a proper DEFAULT.
However the above makes sense if the column is VARCHAR with default as empty string.
What about NUMERIC, It is hard to come up with default values where NULL can convey more details other than simply set to DEFAULT=0
For BOOLEAN still NULL makes some sense, and so on.
Similar argument can be carried out for various datatypes like spatial data types.
IMO, using NULLable option must be minimized. The application should designate a suitable value for the "non-existent" state. In Peoplesoft I think, the application puts a 0 for Numericals and a space for Char columns where a value does not exist.
One could argue why the so-called suitable value couldn't be NULL.
Because SQL implementation treats nulls totally differently.
For e.g.
1 = NULL and 0 = NULL both result in false!
NULL = NULL is false!
NULL value in GROUP BY and other aggregate functions also create unexpected results.

Resources