How are NULLs stored in a database? - database

I'm curious to know how NULLs are stored into a database ?
It surely depends on the database server but I would like to have an general idea about it.
First try:
Suppose that the server put a undefined value (could be anything) into the field for a NULL value.
Could you be very lucky and retrieve the NULL value with
...WHERE field = 'the undefined value (remember, could be anything...)'
Second try:
Does the server have a flag or any meta-data somewhere to indicate this field is NULL ?
Then the server must read this meta data to verify the field.
If the meta-data indicates a NULL value and if the query doesn't have "field IS NULL",
then the record is ignored.
It seems too easy...

MySql uses the second method. It stores an array of bits (one per column) with the data for each row to indicate which columns are null and then leaves the data for that field blank. I'm pretty sure this is true for all other databases as well.
The problem with the first method is, are you sure that whatever value you select for your data won't show up as valid data? For some values (like dates, or floating point numbers) this is true. For others (like integers) this is false.

On PostgreSQL, it uses an optional bitmap with one bit per column (0 is null, 1 is not null). If the bitmap is not present, all columns are not null.
This is completely separate from the storage of the data itself, but is on the same page as the row (so both the row and the bitmap are read together).
References:
http://www.postgresql.org/docs/8.3/interactive/storage-page-layout.html

The server typically uses meta information rather than a magic value. So there's a bit off someplace that specifies whether the field is null.
-Adam

IBM Informix Dynamic Server uses special values to indicate nulls. For example, the valid range of values for a SMALLINT (16-bit, signed) is -32767..+32767. The other value, -32768, is reserved to indicate NULL. Similarly for INTEGER (4-byte, signed) and BIGINT (8-byte, signed). For other types, it uses other special representations (for example, all bits 1 for SQL FLOAT and SMALLFLOAT - aka C double and float, respectively). This means that it doesn't have to use extra space.
IBM DB2 for Linux, Unix, Windows uses extra bytes to store the null indicators; AFAIK, it uses a separate byte for each nullable field, but I could be wrong on that detail.
So, as was pointed out, the mechanisms differ depending on the DBMS.

The problem with special values to indicate NULL is that sooner or later that special value will be inserted. For example, it will be inserted into a table specifying the special NULL indicators for different database servers
| DBServer | SpecialValue |
+--------------+--------------+
| 'Oracle' | 'Glyph' |
| 'SQL Server' | 'Redmond' |
;-)

Related

ORA-22835: Buffer too small and ORA-25137: Data value out of range

We are using a software that has limited Oracle capabilities. I need to filter through a CLOB field by making sure it has a specific value. Normally, outside of this software I would do something like:
DBMS_LOB.SUBSTR(t.new_value) = 'Y'
However, this isn't supported so I'm attempting to use CAST instead. I've tried many different attempts but so far these are what I found:
The software has a built-in query checker/validator and these are the ones it shows as invalid:
DBMS_LOB.SUBSTR(t.new_value)
CAST(t.new_value AS VARCHAR2(10))
CAST(t.new_value AS NVARCHAR2(10))
However, the validator does accept these:
CAST(t.new_value AS VARCHAR(10))
CAST(t.new_value AS NVARCHAR(10))
CAST(t.new_value AS CHAR(10))
Unfortunately, even though the validator lets these ones go through, when running the query to fetch data, I get ORA-22835: Buffer too small when using VARCHAR or NVARCHAR. And I get ORA-25137: Data value out of range when using CHAR.
Are there other ways I could try to check that my CLOB field has a specific value when filtering the data? If not, how do I fix my current issues?
The error you're getting indicates that Oracle is trying to apply the CAST(t.new_value AS VARCHAR(10)) to a row where new_value has more than 10 characters. That makes sense given your description that new_value is a generic audit field that has values from a large number of different tables with a variety of data lengths. Given that, you'd need to structure the query in a way that forces the optimizer to reduce the set of rows you're applying the cast to down to just those where new_value has just a single character before applying the cast.
Not knowing what sort of scope the software you're using provides for structuring your code, I'm not sure what options you have there. Be aware that depending on how robust you need this, the optimizer has quite a bit of flexibility to choose to apply predicates and functions on the projection in an arbitrary order. So even if you find an approach that works once, it may stop working in the future when statistics change or the database is upgraded and Oracle decides to choose a different plan.
Using this as sample data
create table tab1(col clob);
insert into tab1(col) values (rpad('x',3000,'y'));
You need to use dbms_lob.substr(col,1) to get the first character (from the default offset= 1)
select dbms_lob.substr(col,1) from tab1;
DBMS_LOB.SUBSTR(COL,1)
----------------------
x
Note that the default amount (= length) of the substring is 32767 so using only DBMS_LOB.SUBSTR(COL) will return more than you expects.
CAST for CLOB does not cut the string to the casted length, but (as you observes) returns the exception ORA-25137: Data value out of range if the original string is longert that the casted length.
As documented for the CAST statement
CAST does not directly support any of the LOB data types. When you use CAST to convert a CLOB value into a character data type or a BLOB value into the RAW data type, the database implicitly converts the LOB value to character or raw data and then explicitly casts the resulting value into the target data type. If the resulting value is larger than the target type, then the database returns an error.

How is Unicode (UTF-16) data that is out of collation stored in varchar column?

This is purely theoretical question to wrap my head around
Let's say I have Unicode cyclone (🌀 1F300) symbol. If I try to store it in varchar column that has default Latin1_General_CI_AS collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar...
The ways I can see this done:
Like javascript does for symbols out of Basic plane(BMP) where it stores them as 2 symbols (surrogate pairs), and then additional processing is needed to put them back together...
Just truncate the symbol, store first byte and drop the second.... (data is toast - you should have read the manual....)
Data is destroyed and nothing of use is saved... (data is toast - you should have read the manual....)
Some other option that is outside of my mental capacity.....
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10)) in both cases I got 0x3F3F.
3F in ascii is ? (question mark) e.g two question marks (??) that I also see when doing normal select * does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?
The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2)) which is also 0xF3F3 (as opposed to casting N'👽').
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.
Yes the data has gone.
Varchar requires less space, compared to NVarchar. But that reduction comes at a cost. There is no space for a Varchar to store Unicode characters (at 1 byte per character the internal lookup just isn't big enough).
From Microsoft's Developer Network:
...consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.
As you've spotted, unsupported characters are repalced with question marks.

Error in storing values in SQL database table

In my table, there is a column called zipcode whose datatype is int. And when I am storing a zipcode which starts with 0 (for eg. 08872), it is getting stored as 8872.
Can anybody explain me why is it happening?
An INT value is numeric - and numerically, 08872 and 8872 are identical - both represent the value 8872.
SQL Server will not store leading zeroes for numerical values. That's just the way it is.
Either store this as CHAR(5) instead, or handle the formatting (adding leading zeroes to your zip codes) on the frontend when you need to display it.

Migrating from sql server to Oracle varchar length issues

Im facing a strange issue trying to move from sql server to oracle.
in one of my tables i have column defined by NVARCHAR(255)
after reading a bit i understod that SQL server is counting characters when oracle count bytes.
So i defined my table in oracle as VARCHAR(510) 255*2 = 510
But when using sqlldr to load the data from a tab delimetered text file i get en error indicating some entries had exiceeded the length of this column.
after checking in the sql server using:
SELECT MAX(DATALENGTH(column))
FROM table
i get that the max data length is 510.
I do use Hebrew_CI_AS collationg even though i dont think it changes anything....
I checked in SQL Server also if any of the entries contains TAB but no... so i guess its not a corrupted data....
Any one have an idea?
EDIT
After further checkup i've noticed that the issue is due to the data file (in addition to the issue solved by #Justin Cave post.
I have changed the row delimeter to '^' since none of my data contains this character and '|^|' as column delimeter.
creating a control file as follows:
load data
infile data.txt "str '^'"
badfile "data_BAD.txt"
discardfile "data_DSC.txt"
into table table
FIELDS TERMINATED BY '|^|' TRAILING NULLCOLS
(
col1,
col2,
col3,
col4,
col5,
col6
)
The problem is that my data contain <CR> and sqlldr expecting a stream file there for fails on the <CR>!!!! i do not want to change the data since its a textual data (error messages for examples).
What is your database character set
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter LIKE '%CHARACTERSET'
Assuming that your database character set is AL32UTF8, each character could require up to 4 bytes of storage (though almost every useful character can be represented with at most 3 bytes of storage). So you could declare your column as VARCHAR2(1020) to ensure that you have enough space.
You could also simply use character length semantics. If you declare your column VARCHAR2(255 CHAR), you'll allocate space for 255 characters regardless of the amount of space that requires. If you change the NLS_LENGTH_SEMANTICS initialization parameter from the default BYTE to CHAR, you'll change the default so that VARCHAR2(255) is interpreted as VARCHAR2(255 CHAR) rather than VARCHAR2(255 BYTE). Note that the 4000-byte limit on a VARCHAR2 remains even if you are using character length semantics.
If your data contains line breaks, do you need the TRAILING NULLCOLS parameter? That implies that sometimes columns may be omitted from the end of a logical row. If you combine columns that may be omitted with columns that contain line breaks and data that is not enclosed by at least an optional enclosure character, it's not obvious to me how you would begin to identify where a logical row ended and where it began. If you don't actually need the TRAILING NULLCOLS parameter, you should be able to use the CONTINUEIF parameter to combine multiple physical rows into a single logical row. If you can change the data file format, I'd strongly suggest adding an optional enclosure character.
The bytes used by an NVARCHAR field is equal to two times the number of characters plus two (see http://msdn.microsoft.com/en-us/library/ms186939.aspx), so if you make your VARCHAR field 512 you may be OK. There's also some indication that some character sets use 4 bytes per character, but I've found no indication that Hebrew is one of these character sets.

SQL Server: Null VS Empty String

How are the NULL and Empty Varchar values stored in SQL Server. And in case I have no user entry for a string field on my UI, should I store a NULL or a '' ?
There's a nice article here which discusses this point. Key things to take away are that there is no difference in table size, however some users prefer to use an empty string as it can make queries easier as there is not a NULL check to do. You just check if the string is empty. Another thing to note is what NULL means in the context of a relational database. It means that the pointer to the character field is set to 0x00 in the row's header, therefore no data to access.
Update
There's a detailed article here which talks about what is actually happening on a row basis
Each row has a null bitmap for columns that allow nulls. If the row in
that column is null then a bit in the bitmap is 1 else it's 0.
For variable size datatypes the acctual size is 0 bytes.
For fixed size datatype the acctual size is the default datatype size
in bytes set to default value (0 for numbers, '' for chars).
the result of DBCC PAGE shows that both NULL and empty strings both take up zero bytes.
Be careful with nulls and checking for inequality in sql server.
For example
select * from foo where bla <> 'something'
will NOT return records where bla is null. Even though logically it should.
So the right way to check would be
select * from foo where isnull(bla,'') <> 'something'
Which of course people often forget and then get weird bugs.
The conceptual differences between NULL and "empty-string" are real and very important in database design, but often misunderstood and improperly applied - here's a short description of the two:
NULL - means that we do NOT know what the value is, it may exist, but it may not exist, we just don't know.
Empty-String - means we know what the value is and that it is nothing.
Here's a simple example:
Suppose you have a table with people's names including separate columns for first_name, middle_name, and last_name. In the scenario where first_name = 'John', last_name = 'Doe', and middle_name IS NULL, it means that we do not know what the middle name is, or if it even exists. Change that scenario such that middle_name = '' (i.e. empty-string), and it now means that we know that there is no middle name.
I once heard a SQL Server instructor promote making every character type column in a database required, and then assigning a DEFAULT VALUE to each of either '' (empty-string), or 'unknown'. In stating this, the instructor demonstrated he did not have a clear understanding of the difference between NULLs and empty-strings. Admittedly, the differences can seem confusing, but for me the above example helps to clarify the difference. Also, it is important to understand the difference when writing SQL code, and properly handle for NULLs as well as empty-strings.
An empty string is a string with zero length or no character.
Null is absence of data.
NULL values are stored separately in a special bitmap space for all the columns.
If you do not distinguish between NULL and '' in your application, then I would recommend you to store '' in your tables (unless the string column is a foreign key, in which case it would probably be better to prohibit the column from storing empty strings and allow the NULLs, if that is compatible with the logic of your application).
NULL is a non value, like undefined. '' is a empty string with 0 characters.
The value of a string in database depends of your value in your UI, but generally, it's an empty string '' if you specify the parameter in your query or stored procedure.
if it's not a foreign key field, not using empty strings could save you some trouble. only allow nulls if you'll take null to mean something different than an empty string. for example if you have a password field, a null value could indicate that a new user has not created his password yet while an empty varchar could indicate a blank password. for a field like "address2" allowing nulls can only make life difficult. things to watch out for include null references and unexpected results of = and <> operators mentioned by Vagif Verdi, and watching out for these things is often unnecessary programmer overhead.
edit: if performance is an issue see this related question: Nullable vs. non-null varchar data types - which is faster for queries?
In terms of having something tell you, whether a value in a VARCHAR column has something or nothing, I've written a function which I use to decide for me.
CREATE FUNCTION [dbo].[ISNULLEMPTY](#X VARCHAR(MAX))
RETURNS BIT AS
BEGIN
DECLARE #result AS BIT
IF #X IS NOT NULL AND LEN(#X) > 0
SET #result = 0
ELSE
SET #result = 1
RETURN #result
END
Now there is no doubt.
How are the "NULL" and "empty varchar" values stored in SQL Server.
Why would you want to know that? Or in other words, if you knew the answer, how would you use that information?
And in case I have no user entry for a string field on my UI, should I store a NULL or a ''?
It depends on the nature of your field. Ask yourself whether the empty string is a valid value for your field.
If it is (for example, house name in an address) then that might be what you want to store (depending on whether or not you know that the address has no house name).
If it's not (for example, a person's name), then you should store a null, because people don't have blank names (in any culture, so far as I know).

Resources