In SQL server if you have nullParam=NULL in a where clause, it always evaluates to false. This is counterintuitive and has caused me many errors. I do understand the IS NULL and IS NOT NULL keywords are the correct way to do it. But why does SQL server behave this way?
Think of the null as "unknown" in that case (or "does not exist"). In either of those cases, you can't say that they are equal, because you don't know the value of either of them. So, null=null evaluates to not true (false or null, depending on your system), because you don't know the values to say that they ARE equal. This behavior is defined in the ANSI SQL-92 standard.
EDIT:
This depends on your ansi_nulls setting. if you have ANSI_NULLS off, this WILL evaluate to true. Run the following code for an example...
set ansi_nulls off
if null = null
print 'true'
else
print 'false'
set ansi_nulls ON
if null = null
print 'true'
else
print 'false'
How old is Frank? I don't know (null).
How old is Shirley? I don't know (null).
Are Frank and Shirley the same age?
Correct answer should be "I don't know" (null), not "no", as Frank and Shirley might be the same age, we simply don't know.
Here I will hopefully clarify my position.
That NULL = NULL evaluate to FALSE is wrong. Hacker and Mister correctly answered NULL.
Here is why. Dewayne Christensen wrote to me, in a comment to Scott Ivey:
Since it's December, let's use a
seasonal example. I have two presents
under the tree. Now, you tell me if I
got two of the same thing or not.
They can be different or they can be equal, you don't know until one open both presents. Who knows? You invited two people that don't know each other and both have done to you the same gift - rare, but not impossible §.
So the question: are these two UNKNOWN presents the same (equal, =)? The correct answer is: UNKNOWN (i.e. NULL).
This example was intended to demonstrate that "..(false or null, depending on your system).." is a correct answer - it is not, only NULL is correct in 3VL (or is ok for you to accept a system which gives wrong answers?)
A correct answer to this question must emphasize this two points:
three-valued logic (3VL) is counterintuitive (see countless other questions on this subject on Stackoverflow and in other forum to make sure);
SQL-based DBMSes often do not respect even 3VL, they give wrong answers sometimes (as, the original poster assert, SQL Server do in this case).
So I reiterate: SQL does not any good forcing one to interpret the reflexive property of equality, which state that:
for any x, x = x §§ (in plain English: whatever the universe of discourse, a "thing" is always equal to itself).
.. in a 3VL (TRUE, FALSE, NULL). The expectation of people would conform to 2VL (TRUE, FALSE, which even in SQL is valid for all other values), i.e. x = x always evaluate to TRUE, for any possible value of x - no exceptions.
Note also that NULLs are valid " non-values " (as their apologists pretend them to be) which one can assign as attribute values(??) as part of relation variables. So they are acceptable values of every type (domain), not only of the type of logical expressions.
And this was my point: NULL, as value, is a "strange beast". Without euphemism, I prefer to say: nonsense.
I think that this formulation is much more clear and less debatable - sorry for my poor English proficiency.
This is only one of the problems of NULLs. Better to avoid them entirely, when possible.
§ we are concerned about values here, so the fact that the two presents are always two different physical objects are not a valid objection; if you are not convinced I'm sorry, it is not this the place to explain the difference between value and "object" semantics (Relational Algebra has value semantics from the start - see Codd's information principle; I think that some SQL DBMS implementors don't even care about a common semantics).
§§ to my knowledge, this is an axiom accepted (in a form or another, but always interpreted in a 2VL) since antiquity and that exactly because is so intuitive. 3VLs (is a family of logics in reality) is a much more recent development (but I'm not sure when was first developed).
Side note: if someone will introduce Bottom, Unit and Option Types as attempts to justify SQL NULLs, I will be convinced only after a quite detailed examination that will shows of how SQL implementations with NULLs have a sound type system and will clarify, finally, what NULLs (these "values-not-quite-values") really are.
In what follow I will quote some authors. Any error or omission is
probably mine and not of the original authors.
Joe Celko on SQL NULLs
I see Joe Celko often cited on this forum. Apparently he is a much respected author here. So, I said to myself: "what does he wrote about SQL NULLs? How does he explain NULLs numerous problems?". One of my friend has an ebook version of Joe Celko's SQL for smarties: advanced SQL programming, 3rd edition. Let's see.
First, the table of contents. The thing that strikes me most is the number of times that NULL is mentioned and in the most varied contexts:
3.4 Arithmetic and NULLs 109
3.5 Converting Values to and from NULL 110
3.5.1 NULLIF() Function 110
6 NULLs: Missing Data in SQL 185
6.4 Comparing NULLs 190
6.5 NULLs and Logic 190
6.5.1 NULLS in Subquery Predicates 191
6.5.2 Standard SQL Solutions 193
6.6 Math and NULLs 193
6.7 Functions and NULLs 193
6.8 NULLs and Host Languages 194
6.9 Design Advice for NULLs 195
6.9.1 Avoiding NULLs from the Host Programs 197
6.10 A Note on Multiple NULL Values 198
10.1 IS NULL Predicate 241
10.1.1 Sources of NULLs 242
...
and so on. It rings "nasty special case" to me.
I will go into some of these cases with excerpts from this book, trying to limit myself to the essential, for copyright reasons. I think these quotes fall within "fair use" doctrine and they can even stimulate to buy the book - so I hope that no one will complain (otherwise I will need to delete most of it, if not all). Furthermore, I shall refrain from reporting code snippets for the same reason. Sorry about that. Buy the book to read about datailed reasoning.
Page numbers between parenthesis in what follow.
NOT NULL Constraint (11)
The most important column constraint is the NOT NULL, which forbids
the use of NULLs in a column. Use this constraint routinely, and remove
it only when you have good reason. It will help you avoid the
complications of NULL values when you make queries against the data.
It is not a value; it is a marker that holds a place where a value might go.
Again this "value but not quite a value" nonsense. The rest seems quite sensible to me.
(12)
In short, NULLs cause a lot of irregular features in SQL, which we will discuss
later. Your best bet is just to memorize the situations and the rules for NULLs
when you cannot avoid them.
Apropos of SQL, NULLs and infinite:
(104) CHAPTER 3: NUMERIC DATA IN SQL
SQL has not accepted the IEEE model for mathematics for several reasons.
...
If the IEEE rules for math were allowed in
SQL, then we would need type conversion rules for infinite and a way to
represent an infinite exact numeric value after the conversion. People
have enough trouble with NULLs, so let’s not go there.
SQL implementations undecided on what NULL really means in particular contexts:
3.6.2 Exponential Functions (116)
The problem is that logarithms are undefined when (x <= 0). Some SQL
implementations return an error message, some return a NULL and DB2/
400; version 3 release 1 returned *NEGINF (short for “negative infinity”)
as its result.
Joe Celko quoting David McGoveran and C. J. Date:
6 NULLs: Missing Data in SQL (185)
In their book A Guide to Sybase and SQL Server, David McGoveran
and C. J. Date said: “It is this writer’s opinion than NULLs, at least as
currently defined and implemented in SQL, are far more trouble than
they are worth and should be avoided; they display very strange and
inconsistent behavior and can be a rich source of error and confusion.
(Please note that these comments and criticisms apply to any system
that supports SQL-style NULLs, not just to SQL Server specifically.)”
NULLs as a drug addiction:
(186/187)
In the rest of this book, I will be urging you not to use
them, which may seem contradictory, but it is not. Think of a NULL
as a drug; use it properly and it works for you, but abuse it and it can ruin
everything. Your best policy is to avoid NULLs when you can and use
them properly when you have to.
My unique objection here is to "use them properly", which interacts badly with
specific implementation behaviors.
6.5.1 NULLS in Subquery Predicates (191/192)
People forget that a subquery often hides a comparison with a NULL.
Consider these two tables:
...
The result will be empty. This is counterintuitive, but correct.
(separator)
6.5.2 Standard SQL Solutions (193)
SQL-92 solved some of the 3VL (three-valued logic) problems by adding
a new predicate of the form:
<search condition> IS [NOT] TRUE | FALSE | UNKNOWN
But UNKNOWN is a source of problems in itself, so that C. J. Date,
in his book cited below, reccomends in chapter 4.5. Avoiding Nulls in SQL:
Don't use the keyword UNKNOWN in any context whatsoever.
Read "ASIDE" on UNKNOWN, also linked below.
6.8 NULLs and Host Languages (194)
However, you should know how NULLs are handled when they have
to be passed to a host program. No standard host language for
which an embedding is defined supports NULLs, which is another
good reason to avoid using them in your database schema.
(separator)
6.9 Design Advice for NULLs (195)
It is a good idea to declare all your base tables with NOT NULL
constraints on all columns whenever possible. NULLs confuse people
who do not know SQL, and NULLs are expensive.
Objection: NULLs confuses even people that know SQL well,
see below.
(195)
NULLs should be avoided in FOREIGN KEYs. SQL allows this “benefit
of the doubt” relationship, but it can cause a loss of information in
queries that involve joins. For example, given a part number code in
Inventory that is referenced as a FOREIGN KEY by an Orders table, you
will have problems getting a listing of the parts that have a NULL. This is
a mandatory relationship; you cannot order a part that does not exist.
(separator)
6.9.1 Avoiding NULLs from the Host Programs (197)
You can avoid putting NULLs into the database from the Host Programs
with some programming discipline.
...
Determine impact of missing data on programming and reporting:
Numeric columns with NULLs are a problem, because queries
using aggregate functions can provide misleading results.
(separator)
(227)
The SUM() of an empty set is always NULL. One of the most common
programming errors made when using this trick is to write a query that
could return more than one row. If you did not think about it, you might
have written the last example as: ...
(separator)
10.1.1 Sources of NULLs (242)
It is important to remember where NULLs can occur. They are more than
just a possible value in a column. Aggregate functions on empty sets,
OUTER JOINs, arithmetic expressions with NULLs, and OLAP operators
all return NULLs. These constructs often show up as columns in
VIEWs.
(separator)
(301)
Another problem with NULLs is found when you attempt to convert
IN predicates to EXISTS predicates.
(separator)
16.3 The ALL Predicate and Extrema Functions (313)
It is counterintuitive at first that these two predicates are not the same in SQL:
...
But you have to remember the rules for the extrema functions—they
drop out all the NULLs before returning the greater or least values. The
ALL predicate does not drop NULLs, so you can get them in the results.
(separator)
(315)
However, the definition in the standard is worded in the
negative, so that NULLs get the benefit of the doubt.
...
As you can see, it is a good idea to avoid NULLs in UNIQUE
constraints.
Discussing GROUP BY:
NULLs are treated as if they were all equal to each other, and
form their own group. Each group is then reduced to a single
row in a new result table that replaces the old one.
This means that for GROUP BY clause NULL = NULL does not
evaluate to NULL, as in 3VL, but it evaluate to TRUE.
SQL standard is confusing:
The ORDER BY and NULLs (329)
Whether a sort key value that is NULL is considered greater or less than a
non-NULL value is implementation-defined, but...
... There are SQL products that do it either way.
In March 1999, Chris Farrar brought up a question from one of his
developers that caused him to examine a part of the SQL Standard that
I thought I understood. Chris found some differences between the
general understanding and the actual wording of the specification.
And so on. I think is enough by Celko.
C. J. Date on SQL NULLs
C. J. Date is more radical about NULLs: avoid NULLs in SQL, period.
In fact, chapter 4 of his SQL and Relational Theory: How to Write Accurate
SQL Code is titled "NO DUPLICATES, NO NULLS", with subchapters
"4.4 What's Wrong with Nulls?" and "4.5 Avoiding Nulls in SQL" (follow the link:
thanks to Google Books, you can read some pages on-line).
Fabian Pascal on SQL NULLs
From its Practical Issues in Database Management - A Reference
for the Thinking Practitioner (no excerpts on-line, sorry):
10.3 Pratical Implications
10.3.1 SQL NULLs
... SQL suffers from the problems inherent in 3VL as well as from many
quirks, complications, counterintuitiveness, and outright errors [10, 11];
among them are the following:
Aggregate functions (e.g., SUM(), AVG()) ignore NULLs (except for COUNT()).
A scalar expression on a table without rows evaluates incorrectly to NULL, instead of 0.
The expression "NULL = NULL" evaluates to NULL, but is actually invalid in SQL; yet ORDER BY treats NULLs as equal (whatever they precede or follow "regular" values is left to DBMS vendor).
The expression "x IS NOT NULL" is not equal to "NOT(x IS NULL)", as is the case in 2VL.
...
All commercially implemented SQL dialects follow this 3VL approach, and, thus,
not only do they exibits these problems, but they also have spefic implementation
problems, which vary across products.
The answers here all seem to come from a CS perspective so I want to add one from a developer perspective.
For a developer NULL is very useful. The answers here say NULL means unknown, and maybe in CS theory that's true, don't remember, it's been a while. In actual development though, at least in my experience, that happens about 1% of the time. The other 99% it is used for cases where the value is not UNKNOWN but it is KNOWN TO BE ABSENT.
For example:
Client.LastPurchase, for a new client. It is not unknown, it is known that he hasn't made a purchase yet.
When using an ORM with a Table per Class Hierarchy mapping, some values are just not mapped for certain classes.
When mapping a tree structure a root will usually have Parent = NULL
And many more...
I'm sure most developers at some point wrote WHERE value = NULL,
didn't get any results, and that's how they learned about IS NULL syntax. Just look how many votes this question and the linked ones have.
SQL Databases are a tool, and they should be designed the way which is easiest for their users to understand.
Just because you don't know what two things are, does not mean they're equal. If when you think of NULL you think of “NULL” (string) then you probably want a different test of equality like Postgresql's IS DISTINCT FROM AND IS NOT DISTINCT FROM
From the PostgreSQL docs on "Comparison Functions and Operators"
expression IS DISTINCT FROM expression
expression IS NOT DISTINCT FROM expression
For non-null inputs, IS DISTINCT FROM is the same as the <> operator. However, if both inputs are null it returns false, and if only one input is null it returns true. Similarly, IS NOT DISTINCT FROM is identical to = for non-null inputs, but it returns true when both inputs are null, and false when only one input is null. Thus, these constructs effectively act as though null were a normal data value, rather than "unknown".
Maybe it depends, but I thought NULL=NULL evaluates to NULL like most operations with NULL as an operand.
At technet there is a good explanation for how null values work.
Null means unknown.
Therefore the Boolean expression
value=null
does not evaluate to false, it evaluates to null, but if that is the final result of a where clause, then nothing is returned. That is a practical way to do it, since returning null would be difficult to conceive.
It is interesting and very important to understand the following:
If in a query we have
where (value=#param Or #param is null) And id=#anotherParam
and
value=1
#param is null
id=123
#anotherParam=123
then
"value=#param" evaluates to null
"#param is null" evaluates to true
"id=#anotherParam" evaluates to true
So the expression to be evaluated becomes
(null Or true) And true
We might be tempted to think that here "null Or true" will be evaluated to null and thus the whole expression becomes null and the row will not be returned.
This is not so. Why?
Because "null Or true" evaluates to true, which is very logical, since if one operand is true with the Or-operator, then no matter the value of the other operand, the operation will return true. Thus it does not matter that the other operand is unknown (null).
So we finally have true=true and thus the row will be returned.
Note: with the same crystal clear logic that "null Or true" evaluates to true, "null And true" evaluates to null.
Update:
Ok, just to make it complete I want to add the rest here too which turns out quite fun in relation to the above.
"null Or false" evaluates to null, "null And false" evaluates to false. :)
The logic is of course still as self-evident as before.
MSDN has a nice descriptive article on nulls and the three state logic that they engender.
In short, the SQL92 spec defines NULL as unknown, and NULL used in the following operators causes unexpected results for the uninitiated:
= operator NULL true false
NULL NULL NULL NULL
true NULL true false
false NULL false true
and op NULL true false
NULL NULL NULL false
true NULL true false
false false false false
or op NULL true false
NULL NULL true NULL
true true true true
false NULL true false
The concept of NULL is questionable, to say the least. Codd introduced the relational model and the concept of NULL in context (and went on to propose more than one kind of NULL!) However, relational theory has evolved since Codd's original writings: some of his proposals have since been dropped (e.g. primary key) and others never caught on (e.g. theta operators). In modern relational theory (truly relational theory, I should stress) NULL simply does not exist. See The Third Manifesto. http://www.thethirdmanifesto.com/
The SQL language suffers the problem of backwards compatibility. NULL found its way into SQL and we are stuck with it. Arguably, the implementation of NULL in SQL is flawed (SQL Server's implementation makes things even more complicated due to its ANSI_NULLS option).
I recommend avoiding the use of NULLable columns in base tables.
Although perhaps I shouldn't be tempted, I just wanted to assert a corrections of my own about how NULL works in SQL:
NULL = NULL evaluates to UNKNOWN.
UNKNOWN is a logical value.
NULL is a data value.
This is easy to prove e.g.
SELECT NULL = NULL
correctly generates an error in SQL Server. If the result was a data value then we would expect to see NULL, as some answers here (wrongly) suggest we would.
The logical value UNKNOWN is treated differently in SQL DML and SQL DDL respectively.
In SQL DML, UNKNOWN causes rows to be removed from the resultset.
For example:
CREATE TABLE MyTable
(
key_col INTEGER NOT NULL UNIQUE,
data_col INTEGER
CHECK (data_col = 55)
);
INSERT INTO MyTable (key_col, data_col)
VALUES (1, NULL);
The INSERT succeeds for this row, even though the CHECK condition resolves to NULL = NULL. This is due defined in the SQL-92 ("ANSI") Standard:
11.6 table constraint definition
3)
If the table constraint is a check
constraint definition, then let SC be
the search condition immediately
contained in the check constraint
definition and let T be the table name
included in the corresponding table
constraint descriptor; the table
constraint is not satisfied if and
only if
EXISTS ( SELECT * FROM T WHERE NOT
( SC ) )
is true.
Read that again carefully, following the logic.
In plain English, our new row above is given the 'benefit of the doubt' about being UNKNOWN and allowed to pass.
In SQL DML, the rule for the WHERE clause is much easier to follow:
The search condition is applied to
each row of T. The result of the where
clause is a table of those rows of T
for which the result of the search
condition is true.
In plain English, rows that evaluate to UNKNOWN are removed from the resultset.
Because NULL means 'unknown value' and two unknown values cannot be equal.
So, if to our logic NULL N°1 is equal to NULL N°2, then we have to tell that somehow:
SELECT 1
WHERE ISNULL(nullParam1, -1) = ISNULL(nullParam2, -1)
where known value -1 N°1 is equal to -1 N°2
NULL isn't equal to anything, not even itself. My personal solution to understanding the behavior of NULL is to avoid using it as much as possible :).
The question:
Does one unknown equal another unknown?
(NULL = NULL)
That question is something no one can answer so it defaults to true or false depending on your ansi_nulls setting.
However the question:
Is this unknown variable unknown?
This question is quite different and can be answered with true.
nullVariable = null is comparing the values
nullVariable is null is comparing the state of the variable
The confusion arises from the level of indirection (abstraction) that comes about from using NULL.
Going back to the "what's under the Christmas tree" analogy, "Unknown" describes the state of knowledge about what is in Box A.
So if you don't know what's in Box A, you say it's "Unknown", but that doesn't mean that "Unknown" is inside the box. Something other than unknown is in the box, possibly some kind of object, or possibly nothing is in the box.
Similarly, if you don't know what's in Box B, you can label your state of knowledge about the contents as being "Unknown".
So here's the kicker: Your state of knowledge about Box A is equal to your state of knowledge about Box B. (Your state of knowledge in both cases is "Unknown" or "I don't know what's in the Box".) But the contents of the boxes may or may not be equal.
Going back to SQL, ideally you should only be able to compare values when you know what they are. Unfortunately, the label that describes a lack of knowledge is stored in the cell itself, so we're tempted to use it as a value. But we should not use that as a value, because it would lead to "the content of Box A equals the content of Box B when we don't know what's in Box A and/or we don't know what's in Box B.
(Logically, the implication "if I don't know what's in Box A and if I don't know what's in Box B, then what's in Box A = What's in Box B" is false.)
Yay, Dead Horse.
There are two sensible ways to handle NULL = NULL comparisons in a WHERE clause, and they boil down to "What do you mean by NULL?" One way assumes NULL means "unknown," and the other assumes NULL means "data does not exist." SQL has chosen a third way which is wrong all around.
The "NULL means unknown" solution: Throw an error.
Unknown = unknown should evaluate to 3VL null. But the output of a WHERE clause is 2VL: You either return the row or you don't. It's like being asked to divide by zero and return a number: There is no correct response. So you throw an error instead, and force the programmer to explicitly handle this situation.
The "NULL means no data" solution: Return the row.
No data = no data should evaluate to true. If I'm comparing two people, and they have the same first name, and the same last name, and neither has a middle name, then it is correct to say "These people have the same name."
The SQL solution: Don't return the row.
This is always wrong. If NULL means "unknown," then you don't know if the row should be returned or not, and you should not try to guess. If NULL means "no data," then you should return the row. Either way, silently removing the row is incorrect and will cause problems. It's the worst of both worlds.
Setting aside theory and speaking in practical terms, I'm with AlexDev: I have almost never encountered a case where "return the row" was not the desired result. However, "almost never" is not "never," and SQL databases often serve as the backbones of big important systems, so I can see a fair case for being rigorous and throwing an error.
What I cannot see is a case for silently coercing 3VL null into 2VL false. Like most silent type coercions, it's a rabid weasel waiting to be set loose in your system, and when the weasel finally jumps out and bites someone, you'll have the merry devil of a time tracking it back to its nest.
null is unknown in sql so we cant expect two unknowns to be same.
However you can get that behavior by setting ANSI_NULLS to Off(its On by Default)
You will be able to use = operator for nulls
SET ANSI_NULLS off
if null=null
print 1
else
print 2
set ansi_nulls on
if null=null
print 1
else
print 2
You work for the government registering information about citizens. This includes the national ID for every person in the country. A child was left at the door of a church some 40 years ago, nobody knows who their parents are. This person's father ID is NULL. Two such people exist. Count people who share the same father ID with at least one other person (people who are siblings). Do you count those two too?
The answer is no, you don’t, because we don’t know if they are siblings or not.
Suppose you don’t have a NULL option, and instead use some pre-determined value to represent “the unknown”, perhaps an empty string or the number 0 or a * character, etc. Then you would have in your queries that * = *, 0 = 0, and “” = “”, etc. This is not what you want (as per the example above), and as you might often forget about these cases (the example above is a clear fringe case outside ordinary everyday thinking), then you need the language to remember for you that NULL = NULL is not true.
Necessity is the mother of invention.
Just an addition to other wonderful answers:
AND: The result of true and unknown is unknown, false and unknown is false,
while unknown and unknown is unknown.
OR: The result of true or unknown is true, false or unknown is unknown, while unknown or unknown is unknown.
NOT: The result of not unknown is unknown
If you are looking for an expression returning true for two NULLs you can use:
SELECT 1
WHERE EXISTS (
SELECT NULL
INTERSECT
SELECT NULL
)
It is helpful if you want to replicate data from one table to another.
The equality test, for example, in a case statement when clause, can be changed from
XYZ = NULL
to
XYZ IS NULL
If I want to treat blanks and empty string as equal to NULL I often also use an equality test like:
(NULLIF(ltrim( XYZ ),'') IS NULL)
To quote the Christmas analogy again:
In SQL, NULL basically means "closed box" (unknown). So, the result of comparing two closed boxes will also be unknown (null).
I understand, for a developer, this is counter-intuitive, because in programming languages, often NULL rather means "empty box" (known). And comparing two empty boxes will naturally yield true / equal.
This is why JavaScript for example distinguishes between null and undefined.
Null isn't equal to anything including itself
Best way to test if an object is null is to check whether the object equals itself since null is the only object not equal to itself
const obj = null
console.log(obj==obj) //false, then it's null
Check this article
I may be erring towards pedantry here, but say I have a field in a database that currently has two values (but may contain more in future). I know I could name this as a flag (e.g. MY_FLAG) containing values 0 and 1, but should more values be required (e.g. 0,1,2,3,4), is it still correct to call the field a flag?
I seem to recall reading something previously, that a flag should always be binary, and anything else should be labelled more appropriately, but I may be mistaken. Does anyone know if my thinking is correct? If so, can you point me to any information on this please? My googling has turned nothing up!!
Thanks very much :o)
Flags are usually binary because when we say flag it means either it is up(1) or down(0).
Just like it is used in military to flag up and down in order to show the war-signs. The concept of flagging is taken from there.
Regarding what you are saying
"your words : values be required (e.g. 0,1,2,3,4)"
In such a situation use Enum. Enumerations are build for such cases or sometimes what we do is , we justify the meaning of these numeric values in comments or in separate file so that more memory could be saved(we use tinyInt or bit field). But never name such a situation Flag.
Flags have standard meaning that is either Up or Down. It doesn't mean that you will get error or something but it is not a good practice. Hope you get it.
It's all a matter of conventions and the ability to maintain your database/code effectively. Technically, you can have a column called my_flag defined as a varchar and hold values like "batman" and "barak obama".
By convention, flags are boolean. If you intend to have other values there, it's probably a better idea to call the column something else, like some_enum, or my_code.
Very occasionally, people talk about (for example) tri-state flags, but Wikipedia and most of the dictionary definitions that I read reserve "flag" for binary / two state uses1.
Of course, neither Wikipedia or any dictionary has the authority to say some English usage is "incorrect". "Correct" usage is really "conventional" usage; i.e. what other people say / write.
I would argue that saying or writing "tri-state flag" is unconventional, but it is unambiguous and serves its purpose of communicating a concept adequately. (And the usage can be justified ...)
1 - Most, but not all; see http://www.oxforddictionaries.com/definition/english/flag.
Don't call anything "flag". Or "count" or "mark" or "int" or "code". Name it like everything else in code: after what it means.
workday {mon..fri}
tall {yes,no}
zip_code {00000..99999}
state {AL..WY}
Notice that (something like) yes/no plays the 'flag' role of indicating a permanent dichotomy. (In lieu of boolean, which does that in the rest of the universe outside SQL). For when the specification/contract really is whether something is so. If a design might add more values you should use a different type.
Of course if you want to add more info to a name you can. Add distinctions that are meaningful if you can.
workday {monday..friday}
workday_abbrev {mon..fri}
is_tall {yes,no}
zip_plus_5 {00000-99..99999-99}
state_name {Alabama..Wyoming}
state_2 {AL..WY}
I am writing a Time table generator in java, using AI approaches to satisfy the hard constraints and help find an optimal solution. So far I have implemented and Iterative construction (a most-constrained first heuristic) and Simulated Annealing, and I'm in the process of implementing a genetic algorithm.
Some info on the problem, and how I represent it then :
I have a set of events, rooms , features (that events require and rooms satisfy), students and slots
The problem consists in assigning to each event a slot and a room, such that no student is required to attend two events in one slot, all the rooms assigned fulfill the necessary requirements.
I have a grading function that for each set if assignments grades the soft constraint violations, thus the point is to minimize this.
The way I am implementing the GA is I start with a population generated by the iterative construction (which can leave events unassigned) and then do the normal steps: evaluate, select, cross, mutate and keep the best. Rinse and repeat.
My problem is that my solution appears to improve too little. No matter what I do, the populations tends to a random fitness and is stuck there. Note that this fitness always differ, but nevertheless a lower limit will appear.
I suspect that the problem is in my crossover function, and here is the logic behind it:
Two assignments are randomly chosen to be crossed. Lets call them assignments A and B. For all of B's events do the following procedure (the order B's events are selected is random):
Get the corresponding event in A and compare the assignment. 3 different situations might happen.
If only one of them is unassigned and if it is possible to replicate
the other assignment on the child, this assignment is chosen.
If both of them are assigned, but only one of them creates no
conflicts when assigning to the child, that one is chosen.
If both of them are assigned and none create conflict, on of
them is randomly chosen.
In any other case, the event is left unassigned.
This creates a child with some of the parent's assignments, some of the mother's, so it seems to me it is a valid function. Moreover, it does not break any hard constraints.
As for mutation, I am using the neighboring function of my SA to give me another assignment based on on of the children, and then replacing that child.
So again. With this setup, initial population of 100, the GA runs and always tends to stabilize at some random (high) fitness value. Can someone give me a pointer as to what could I possibly be doing wrong?
Thanks
Edit: Formatting and clear some things
I think GA only makes sense if part of the solution (part of the vector) has a significance as a stand alone part of the solution, so that the crossover function integrates valid parts of a solution between two solution vectors. Much like a certain part of a DNA sequence controls or affects a specific aspect of the individual - eye color is one gene for example. In this problem however the different parts of the solution vector affect each other making the crossover almost meaningless. This results (my guess) in the algorithm converging on a single solution rather quickly with the different crossovers and mutations having only a negative affect on the fitness.
I dont believe GA is the right tool for this problem.
If you could please provide the original problem statement, I will be able to give you a better solution. Here is my answer for the present moment.
A genetic algorithm is not the best tool to satisfy hard constraints. This is an assigment problem that can be solved using integer program, a special case of a linear program.
Linear programs allow users to minimize or maximize some goal modeled by an objective function (grading function). The objective function is defined by the sum of individual decisions (or decision variables) and the value or contribution to the objective function. Linear programs allow for your decision variables to be decimal values, but integer programs force the decision variables to be integer values.
So, what are your decisions? Your decisions are to assign students to slots. And these slots have features which events require and rooms satisfy.
In your case, you want to maximize the number of students that are assigned to a slot.
You also have constraints. In your case, a student may only attend at most one event.
The website below provides a good tutorial on how to model integer programs.
http://people.brunel.ac.uk/~mastjjb/jeb/or/moreip.html
For a java specific implementation, use the link below.
http://javailp.sourceforge.net/
SolverFactory factory = new SolverFactoryLpSolve(); // use lp_solve
factory.setParameter(Solver.VERBOSE, 0);
factory.setParameter(Solver.TIMEOUT, 100); // set timeout to 100 seconds
/**
* Constructing a Problem:
* Maximize: 143x+60y
* Subject to:
* 120x+210y <= 15000
* 110x+30y <= 4000
* x+y <= 75
*
* With x,y being integers
*
*/
Problem problem = new Problem();
Linear linear = new Linear();
linear.add(143, "x");
linear.add(60, "y");
problem.setObjective(linear, OptType.MAX);
linear = new Linear();
linear.add(120, "x");
linear.add(210, "y");
problem.add(linear, "<=", 15000);
linear = new Linear();
linear.add(110, "x");
linear.add(30, "y");
problem.add(linear, "<=", 4000);
linear = new Linear();
linear.add(1, "x");
linear.add(1, "y");
problem.add(linear, "<=", 75);
problem.setVarType("x", Integer.class);
problem.setVarType("y", Integer.class);
Solver solver = factory.get(); // you should use this solver only once for one problem
Result result = solver.solve(problem);
System.out.println(result);
/**
* Extend the problem with x <= 16 and solve it again
*/
problem.setVarUpperBound("x", 16);
solver = factory.get();
result = solver.solve(problem);
System.out.println(result);
// Results in the following output:
// Objective: 6266.0 {y=52, x=22}
// Objective: 5828.0 {y=59, x=16}
I would start by measuring what's going on directly. For example, what fraction of the assignments are falling under your "any other case" catch-all and therefore doing nothing?
Also, while we can't really tell from the information given, it doesn't seem any of your moves can do a "swap", which may be a problem. If a schedule is tightly constrained, then once you find something feasible, it's likely that you won't be able to just move a class from room A to room B, as room B will be in use. You'd need to consider ways of moving a class from A to B along with moving a class from B to A.
You can also sometimes improve things by allowing constraints to be violated. Instead of forbidding crossover from ever violating a constraint, you can allow it, but penalize the fitness in proportion to the "badness" of the violation.
Finally, it's possible that your other operators are the problem as well. If your selection and replacement operators are too aggressive, you can converge very quickly to something that's only slightly better than where you started. Once you converge, it's very difficult for mutations alone to kick you back out into a productive search.
I think there is nothing wrong with GA for this problem, some people just hate Genetic Algorithms no matter what.
Here is what I would check:
First you mention that your GA stabilizes at a random "High" fitness value, but isn't this a good thing? Does "high" fitness correspond to good or bad in your case? It is possible you are favoring "High" fitness in one part of your code and "Low" fitness in another thus causing the seemingly random result.
I think you want to be a bit more careful about the logic behind your crossover operation. Basically there are many situations for all 3 cases where making any of those choices would not cause an increase in fitness at all of the crossed-over individual, but you are still using a "resource" (an assignment that could potentially be used for another class/student/etc.) I realize that a GA traditionally will make assignments via crossover that cause worse behavior, but you are already performing a bit of computation in the crossover phase anyway, why not choose one that actually will improve fitness or maybe don't cross at all?
Optional Comment to Consider : Although your iterative construction approach is quite interesting, this may cause you to have an overly complex Gene representation that could be causing problems with your crossover. Is it possible to model a single individual solution as an array (or 2D array) of bits or integers? Even if the array turns out to be very long, it may be worth it use a more simple crossover procedure. I recommend Googling "ga gene representation time tabling" you may find an approach that you like more and can more easily scale to many individuals (100 is a rather small population size for a GA, but I understand you are still testing, also how many generations?).
One final note, I am not sure what language you are working in but if it is Java and you don't NEED to code the GA by hand I would recommend taking a look at ECJ. Maybe even if you have to code by hand, it could help you develop your representation or breeding pipeline.
Newcomers to GA can make any of a number of standard mistakes:
In general, when doing crossover, make sure that the child has some chance of inheriting that which made the parent or parents winner(s) in the first place. In other words, choose a genome representation where the "gene" fragments of the genome have meaningful mappings to the problem statement. A common mistake is to encode everything as a bitvector and then, in crossover, to split the bitvector at random places, splitting up the good thing the bitvector represented and thereby destroying the thing that made the individual float to the top as a good candidate. A vector of (limited) integers is likely to be a better choice, where integers can be replaced by mutation but not by crossover. Not preserving something (doesn't have to be 100%, but it has to be some aspect) of what made parents winners means you are essentially doing random search, which will perform no better than linear search.
In general, use much less mutation than you might think. Mutation is there mainly to keep some diversity in the population. If your initial population doesn't contain anything with a fractional advantage, then your population is too small for the problem at hand and a high mutation rate will, in general, not help.
In this specific case, your crossover function is too complicated. Do not ever put constraints aimed at keeping all solutions valid into the crossover. Instead the crossover function should be free to generate invalid solutions and it is the job of the goal function to somewhat (not totally) penalize the invalid solutions. If your GA works, then the final answers will not contain any invalid assignments, provided 100% valid assignments are at all possible. Insisting on validity in the crossover prevents valid solutions from taking shortcuts through invalid solutions to other and better valid solutions.
I would recommend anyone who thinks they have written a poorly performing GA to conduct the following test: Run the GA a few times, and note the number of generations it took to reach an acceptable result. Then replace the winner selection step and goal function (whatever you use - tournament, ranking, etc) with a random choice, and run it again. If you still converge roughly at the same speed as with the real evaluator/goal function then you didn't actually have a functioning GA. Many people who say GAs don't work have made some mistake in their code which means the GA converges as slowly as random search which is enough to turn anyone off from the technique.
A user has to complete ten steps to achieve a desired result. The ten steps can be completed in any order.
If there is a bug, the bug is dependent only on the steps that have been taken, not the order in which they were taken (i.e., the bug is path independent).
For example: If the user performs three steps in the order 10, 1, 2 and produces a bug the exact same bug will be produced if the user performs the same three steps in the order 1, 2, 10.
What is the maximum number of unique bugs this program can have?
You mean what is the number of distinct sets pickable from 10 elements? That's a powerset: 2**10.
Much later: some knowledgeable others have suggested that having no bugs should not be counted as a bug. Accordingly, I revise my count: 2**10 - 1.
hughdbrown's answer is correct, but there is another possible interpretation of the question. Suppose that a sequence of operations can never produce more than one bug (i.e. that it should just be counted as one bug). For example, if the operations (3,6,2) is a bug, then you shouldn't be allowed to count (3,6,2,5) as another bug. In that case, rather than finding the maximum possible number of subsets of {1,2,...,10}, you want to find the maximum number of possible subsets so that no one contains another. The answer to this version of the question is "10 choose 5"=252.
Edit: by the way, the result that says this is maximal is called Sperner's Theorem.
It depends entirely how many ways there are of doing each step. If you have a process that involves only one step, but there are multiple ways of doing that step, every step could have an associated bug.
There's also the misuse of functions, which you cant prevent against, which could be considered a bug. ie:
If a user was to think that
rm -rf /
was short for
remove media --really fast /
ie: eject all devices1
I would guess that would be a potential bug. Its user error really, but its still a singular thing that can occur that produces results other than that were wanted.
You could argue the above is a bit over the top, but ultimately, there is no limitation on the ways users can do things wrong.
When users are there, assume, anything that can go wrong, will.
The only problem with the above reasoning, is you have to prematurely delete powerful things so users don't hurt themselves, which leads to less effective tools for those who know how to use them. Like corks on forks sort of rationale.
The only way to solve this concern effectively is give newbs blunt objects to learn with, and then give them an option which takes away all the foam padding once they learn the ropes, so experienced users don't have to keep working with blunt tools, and don't have to deblunten every tool themself.
( If there are infinite numbers of possibly ways to do 1 step, I don't even want to begin to think of the numbers of ways to do 10 steps wrong )
1: If you don't know, this will erase lots of your hard drive and cause much pain. Don't do it.
One, a designer fault? :)