Why can't I put a constraint on nvarchar(max)? - sql-server

Why can't I create a constraint on an nvarchar(max) column? SQL Server will not allow me to put a unique constraint on it. But, it allows me to create a unique constraint on an nvarchar(100) column.
Both of these columns are NOT NULL. Is there any reason I can't add a constraint to the nvarchar(max) column?

nvarchar(max) is really a different data type from nvarchar(integer-length). It's characteristics are more like the deprecated text data type.
If nvarchar(max) value becomes too large, like text, it will be stored outside the row (a row is constrained to 8000 bytes maximum) and a pointer to it is stored in the row itself. You cannot efficiently index such a large field and the fact that data can be stored somewhere else further complicates searching and scanning the index.
A unique constraint requires an index to be enforced and as a result, SQL Server designers decided to disallow creating a unique constraint on it.

Because MAX is really big (231-1 bytes) and could lead to a server meltdown if the server had to check for uniqueness on multi-megabyte-sized entries.
From the documentation on Create Index, I would assume this holds true for unique constraints as well.
The maximum allowable size of the
combined index values is 900 bytes.
EDIT: If you really needed uniqueness, you could, potentially approximate it by computing a hash of the data and storing that in a unique index. Even a large hash would be small enough to fit in an indexable column. You'd have to figure out how to handle collisions -- perhaps manually check on collisions and pad the data (changing the hash) if an errant collision is found.

A unique constraint is actually an index, and nvarchar(max) cannot be used as a key in an index.

Related

Converting Varbinary to Char and again to Varbinary for Primary key

I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:
SELECT
CONVERT(VARBINARY(32),
CONVERT( CHAR(32),
HASHBYTES('MD5',
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
),
2)
)
FROM database.schema.table
I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?
Short Version
This code pads a hash with 0x20 bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway
Long Version
Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.
Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.
And even worse, the concatenation expression :
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
Will create the same initial string for multiple different column values.
On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.
This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.
Using a GUID generated with NEWID() would provide a key value without collisions. Using NEWSEQUENTIALID() would generate sequential GUID values eliminating fragmentation and once again allowing range searches.
An even better solution would be to just create a PK from the two columns :
ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);
Or just add an IDENTITY-generated ID column. A bigint is large enough to handle all scenarios :
Create ThatTable (
ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
...
)
If the intention was to ignore spaces in column values there are better options:
The easiest solution would be to clean up the values when inserting them.
A CHECK constraint can be added to each column to ensure the columns can't have leading or trailing spaces.
An INSTEAD OF trigger can be used to trim them.
Computed, persisted columns can be added that trim the originals, eg Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keys
As for what it does:
It generates deprecation warnings (MD5 is deprecated)
It pads the MD5 hash with 0x20 bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.
You can check the results by hashing any value. The following queries
select hashbytes('md5','banana')
----------------------------------
0x72B302BF297A228A75730123EFEF7C41
select cast(hashbytes('md5','banana') as char(32))
--------------------------------
r³¿)z"Šus#ïï|A
A space in ASCII is the byte 0x20. Casting to binary replaces spaces with 0x20, not 0x00
select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020
If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00. The result is no better than the original though
select cast(hashbytes('md5','banana') as binary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000
To get a real 32-byte hash, SHA2_256 can be used :
select hashbytes('sha2_256','banana')
------------------------------------------------------------------
0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E

NVARCHAR or INT Primary Key with UNIQUE constraint

I have alphanumeric data with a max length of 20 characters. I'm going to store this data in a column with type NVARCHAR(20).
These data are CODES and must be unique, so I decided to make it a primary key column.
But, asking another question, someone has "suggested" me to use an INT column as primary key.
What do you think? An INT primary key and add a column with an UNIQUE constraint or my current design?
I think I'm adding a new column that I'm not going to use, because I need the NVARCHAR(20) column to search, and avoid duplicates. In other words, 99% of my where clause will have that NVARCHAR column.
I am a strong fan of numeric, synthetic primary keys. Something like the key you want can be declared unique and be an attribute of key.
Here are some reasons:
Numeric keys occupy 4 or 8 bytes and are of fixed length. This is more efficient for building indexes.
Numeric keys are often shorter than string keys. This saves space for foreign key references.
Synthetic keys are usually inserted using auto-incrementing columns. This let's you know the insert order. Note: In some applications, knowing the order may be a drawback, but those are unusual.
If the value of the unique string changes, you only have to change the value in one place -- rather than in every table with a foreign key reference. And, if you leave out a foreign key reference, then the database integrity is at risk.
If a row is identified by multiple keys, then a single numeric key is more efficient.
A synthetic key can help in maintaining security.
These are just guidelines -- your question is why synthetic numeric keys are a good idea. There are alternative issues. For instance, if space usage is a really big concern, for instance, then the additional space for a numeric key plus a unique index may overrule other concerns.

PK Index fragmentation on IDENTITY columns with VARBINARY(MAX) column in table

I have a some tables(Table A and Table B) with a BIGINT with IDENTITY specification as primary key.
In those tables I have 2 VARBINARY(MAX) columns. Updates and deletes are very rare.
They have with almost the same row count, Table B a bit less but have significant more data in the VARBINARY(MAX) columns.
I was surprised to see that the storage used by PK in Table B was much higher than the storage used by PK in Table A.
Doing some reading, correct me if I am wrong, on the subject clarified that is has some thing to do with the max row size around 8k. So the there is some paging going on with a byte reference which is then included in the index. Hence the larger storage used by PK in Table B. It is around 30 percent of the total size of the DB.
I was of the assumption that only the BIGINT was part of the index.
My question is whether there is a workaround for that? Any designs, techniques or hacks that can prevent this?
Regards
Vilma
A PK is a CLUSTERED index: the data is stored with the key. You can have only one clustered index per table, because the data can only be stored in one place. So any clustered index (such as a PK) will take up more space than a non-clustered index.
If there is more varbinary table in B, then I would expect the PK to take up more space.
However, since this varbinary is (MAX) then the initial thought is that only the data pointer should be stored with the key. However, if the row is small enough (i.e. < 8000 bytes) I imagine that SQL Server optimises the store/retrieve by keeping the data with the key, thus increasing the size of the index. I do not know that this happens, but was unable to find anything to say it doesn't; as an optimisation is seems reasonable.
Take that for what it's worth!

Is there a way to create a unique constraint on a column larger than 900 bytes?

I'm fairly new to SQL Server, so if anything I say doesn't make sense, there's a good chance I'm just confused by something. Anyway...
I have a simple mapping table. It has two columns, Before and After. All I want is a constraint that the Before column is unique. Originally it was set to be a primary key, but this created errors when the value was too large. I tried adding an ID column as a primary key and then adding UNIQUE to the Before column, but I have the same problem with the max length exceeding 900 bytes (I guess the constraint creates an index).
The only option I can think of is too change the id column to a checksum column and make that the primary key, but I dislike this option. Is there a different way to do this? I just need two simple columns.
The only way I can think of to guarantee uniqueness inside the database is to use an INSTEAD OF trigger. The link I provided to MSDN has an example for checking uniqueness. This solution will most likely be quite slow indeed, since you won't be able to index on the column being checked.
You could speed it up somewhat by using a computed column to create a hash, perhaps using the HASHBYTES function, of the Before column. You could then create a non-unique index on that hash column, and inside your trigger check for the negative case -- that is, check to see if a row with the same hash doesn't exist. If that happens, exit the trigger. In the case there is another row with the same hash, you could then do the more expensive check for an exact duplicate, and raise an error if the user enters a duplicate value. You also might be able to simplify your check by simply comparing both the hash value and the Before value in one EXISTS() clause, but I haven't played around with the performance of that solution.
(Note that the HASHBYTES function I referred to itself can hash only up to 8000 bytes. If you want to go bigger than that, you'll have to roll your own hash function or live with the collisions caused by the CHECKSUM() function)

SQL server - worth indexing large string keys?

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.
An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.
General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Resources