Confused about nvarchar limit - sql-server

I've got a quick question that I can't find an answer to anywhere. I frequently need to convert one type of database to another, therefor I'm writing a program to convert MS SQL Server databases back and forth. The problem I'm having is that I can't declare an nvarchar variable with a max length of above 4000. I get,
"The size (6000) given to the parameter 'description' exceeds the maximum allowed (4000)."
Yet that is clearly defined as an nvarchar(6000) in the original database, at least I think so because max_length is 6000, if you use max max_length is -1, right? I know I could just use nvarchar(max) but if I'm writing software that converts databases I want to stay as true to the original as possible.
Was the nvarchar max limit changed recently or is it some setting that I've missed?

The given size (6000) is in bytes where as when you give the length, it is in number of chars of unicode. The limit 4000 is because internal storage of nvarchar(xxxx) and nvarchar(max) is different. If you want more storage than 4000 char, use nvarchar(max).

Related

Choosing best datatype for numeric column in SQL Server

I have a table in SQL Server with large amount of data - around 40 million rows. The base structure is like this:
Title
type
length
Null distribution
Customer-Id
number
8
60%
Card-Serial
number
5
70%
-
-
-
-
-
-
-
-
Note
string-unicode
2000
40%
Both numeric columns are filled by numbers with specific length.
I have no idea which data type to choose to have a database in the smallest size and having good performance by indexing the customerId column. Refer to this Post if I choose CHAR(8), database consume 8 bytes per row even in null data.
I decided to use INT to reduce the database size and having good index, but null data will use 4 bytes per rows again. If I want to reduce this size, I can use VARCHAR(8), but I don't know, the system has good performance on setting index on this type or not. The main question is reducing database size is important or having good index on numeric type.
Thanks.
If it is a number - then by all means choose a numeric datatype!! Don't store your numbers as char(n) or varchar(n) !! That'll just cause you immeasurable grief and headaches later on.
The choice is pretty clear:
if you have whole numbers - use TINYINT, SMALLINT, INT or BIGINT - depending on the number range you need
if you need fractional numbers - use DECIMAL(p,s) for the best and most robust behaviour (no rounding errors like FLOAT or REAL)
Picking the most appropriate datatype is much more important than any micro-optimization for storage. Even with 40 million rows - that's still not a big issue, whether you use 4 or 8 bytes. Whether you use a numeric type vs. a string type - that makes a huge difference in usability and handling of your database!

SQL Server : create table columns for most efficient size

My SQL Server database was created & designed by a freelance developer.
I see the database getting quite big and I want to ensure that the column datatypes are the most efficient in preserving the size as small as possible.
Most columns were created as
VARCHAR (255), NULL
This covers those where they are
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
Alpha which will contain just 1 letter or are blank
Then there are a number of columns which are alphanumeric with a maximum of 10
alphanumeric characters with a maximum of 25.
There is one big alphanumeric column which can be up to 300 characters.
There has been an amendment for a column which show time taken in seconds to race an event. Under 1000 seconds and up to 2 decimal places
This is set as DECIMAL (18,2) NULL
The question is can I reduce the size of the database by changing the column data types, or was the original design, optimum for purpose?
You should definitely strive to use the most appropriate data types for all columns - and in this regard, that freelance developer did a very poor job - both from a point of consistency and usability (just try to sum up the numbers in a VARCHAR(255) column, or sort by their numeric value - horribly bad design...), but also from a performance point of view.
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
-> if you don't need any fractional decimal points (only whole numbers) - use INT
Alpha which will contain just 1 letter or are blank
-> in this case, I'd use a CHAR(1) (or NCHAR(1) if you need to be able to handle Unicode characters, like Hebrew, Arabic, Cyrillic or east asian languages). Since it's really only ever 1 character (or nothing), there's no need or point in using a variable-length string datatype, since that only adds at least 2 bytes of overhead per string stored
There is one big alphanumeric column which can be up to 300 characters.
-> That's a great candidate for a VARCHAR(300) column (or again: NVARCHAR(300) if you need to support Unicode). Here I'd definitely use a variable-length string type to avoid padding the column with spaces up to the defined length if you really want to store fewer characters.

Oracle pro*c indicator variables vs NVL in query execution

oracle-pro-c has recommended using indicator variables as "NULL flags" attached to host variables. As per documentation, we can associate every host variable with an optional indicator variable (short type). For example:
short indicator_var;
EXEC SQL SELECT xyz INTO :host_var:indicator_var
FROM ...;
We can also alternatively use NVL as documented in https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions105.htm, for example, as:
EXEC SQL SELECT NVL(TO_CHAR(xyz), '') INTO :host_var
FROM ...;
Which one is better in terms of performance?
Ah, Pro*C. It's a while back, over 20 years, but I think my memory serves me well here.
Using the indicator variables will be better in terms of performance, for two reasons:
The SQL is simpler, so there is less parsing and fewer bytes will be transferred over the network to the database server.
In Oracle in general, a "Null" value is encoded in 0 bytes. An empty string contains a length (N bytes) and storage (0 bytes). So a NULL value is encoded more efficiently in the returned result set.
Now in practice, you won't notice the difference much. But you asked :-)
In my experience NVL was much slower than indicator variables especially if nested (yes you can nest them) for INSERT or UPDATE of fields. it was a long time ago and I don't remember exactly the circumstances but I remember the performance gain was real.
On select it was not that obvious but using indicator variables allows also to detect cases when truncation happen.
If you use VARCHAR or UVARCHAR columns there is a third options to detect NULL/Empty strings in Oracle. The len field will be set to 0 and it means the value is empty. As Oracle does not distinguish between NULL and length 0 strings it is more or less the same.

Can I limit specific characters in a SQL Server column? Will it improve size and query speed?

I couldn't figure out the correct terminology for what I am asking so I apologize if this is in the wrong place or format.
I am rebuilding a database, call it aspsessionsv2. It consists of a single table with over 11 billion rows. Column 1 is a string and has no limits other than under 20 characters. The other columns all contain HEX data... so there isn't any reason for that field to store characters outside of A-F and 0-9. So...
Is there a way I can configure SQL Server to limit the field to those characters?
Will that reduce the overall size of the database?
Will that speed up queries to a database of this size?
What got me to thinking about this was WinRAR. I compressed a 50GB file containing only HEX characters down to 206MB. That blows my mind even though I understand how it works so I am curious if I can do the same "compression" in a way on a SQL Server database.
Thank you!
Been a little bit since I have asked a question. Here is some tech info: Windows Server 2008 R2, SQL Server 2008, 10 Columns, 11 Billion Rows
You could use a blob (binary large object), that would cut the size of the hexadecimal-data fields in half. Often hexadecimal encoding is used to circumvent character encoding issues.
You could also use a Base-64 encoding rather than a base-16 (hexadecimal) encoding; it would use 6 bits per character rather than 4, and only increase the storage relative to a blob 4:3 times, instead of increasing it 2-fold in the case of hexadecimal strings.
If you are using varchar or nvarchar to store strings of characters 0-9 and A-F, then you should really be using varbinary type instead. Each pair of hexadecimal characters represent one byte, so with varbinary each byte of data needs 1 byte on disk, with varchar each byte of data needs 2 bytes on disk, with nvarchar each byte of data needs 4 bytes on disk.
Having varbinary instead of varchar will reduce the overall size of the database and it will speed up queries, because less bytes need to be read from disk.
Hex values are just numbers so you are likely better off storing them as such. For example 123abc would convert nicely to 1194684 and would only require 4 bytes instead of 8 bytes (6 characters + 2 byte varchar overhead). So provided the number isn't going to go above 2147483647 you can store them all as int.
However, if you wanted to restrict the column to only containing the values 0-9 and a-f, then you could use a check constraint, something like this:
ALTER TABLE YourTable
ADD CONSTRAINT CK_YourTable_YourColumn CHECK (YourColumn NOT LIKE '%[^0-9a-z]%')

Is there any benefit to my rather quirky character sizing convention?

I love things that are a power of 2. I celebrated my 32nd birthday knowing it was the last time in 32 years I'd be able to claim that my age was a power of 2. I'm obsessed. It's like being some Z-list Batman villain, except without the colourful adventures and a face full of batarangs.
I ensure that all my enum values are powers of 2, if only for future bitwise operations, and I'm reasonably assured that there is some purpose (even if latent) for doing it.
Where I'm less sure, is in how I define the lengths of database fields. Again, I can't help it. Everything ends up being a power of 2.
CREATE TABLE Person
(
PersonID int IDENTITY PRIMARY KEY
,Firstname varchar(64)
,Surname varchar(128)
)
Can any SQL super-boffins who know about the internals of how stuff is stored and retrieved tell me whether there is any benefit to my inexplicable obsession? Is it more efficient to size character fields this way? Can anyone pop in with an "actually, what you're doing works because ....."?
I suspect I'm just getting crazier in my older age, but it'd be nice to know that there is some method to my madness.
Well, if I'm your coworker and I'm reading your code, I don't have to use SVN blame to find out who wrote it. That's kind of cool. :)
The only relevant powers of two are 512 and 4096, which is the default disk block size and memory page size respectively. If your total row-length crosses these boundaries, you might notice un-proportional jumps and dumps in performance if you look very closely. For example, if your row is 513 bytes long, you need to read twice as many blocks for a single row than for a row that is 512 bytes long.
The problem is calculating the proper row size, as the internal storage format is not very well documented.
Also, I do not know whether the SQL Server actually keeps the rows block aligned, so you might be out of luck there anyways.
With varchar, you only stored the number of characters + 2 for length.
Generally, the maximum row size is 8060
CREATE TABLE dbo.bob (c1 char(3000), c2 char(3000), c31 char(3000))
Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'bob' failed because the minimum row size would be 9007, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.
The power of 2 stuff is frankly irrational and that isn't good in a programmer...

Resources