How are varchar values stored in a SQL Server database?

How are varchar values stored in a SQL Server database? - sql-server

My fellow programmer has a strange requirement from his team leader; he insisted on creating varchar columns with a length of 16*2n.
What is the point of such restriction?
I can suppose that short strings (less than 128 chars for example) a stored directly in the record of the table and from this point of view the restriction will help to align fields in the record, larger strings are stored in the database "heap" and only the reference to this string is saved in the table record.
Is it so?
Is this requirement has a reasonable background?
BTW, the DBMS is SQL Server 2008.

Completely pointless restriction as far as I can see. Assuming standard FixedVar format (as opposed to the formats used with row/page compression or sparse columns) and assuming you are talking about varchar(1-8000) columns
All varchar data is stored at the end of the row in a variable length section (or in offrow pages if it can't fit in row). The amount of space it consumes in that section (and whether or not it ends up off row) is entirely dependant upon the length of the actual data not the column declaration.
SQL Server will use the length declared in the column declaration when allocating memory (e.g. for sort operations). The assumption it makes in that instance is that varchar columns will be filled to 50% of their declared size on average so this might be a better thing to look at when choosing a size.

I have heard of this practice before, but after researching this question a bit I don't think there is a practical reason for having varchar values in multiples of 16. I think this requirement probably comes from trying to optimize the space used on each page. In SQL Server, pages are set at 8 KB per page. Rows are stored in pages, so perhaps the thinking is that you could conserve space on the pages if the size of each row divided evenly into 8 KB (a more detailed description of how SQL Server stores data can be found here). However, since the amount of space used by a varchar field is determined by its actual content, I don't see how using lengths in multiples of 16 or any other scheme could help you optimize the amount of space used by each row on the page. The length of the varchar fields should just be set to whatever the business requirements dictate.
Additionally, this question covers similar ground and the conclusion also seems to be the same:
Database column sizes for character based data

You should always store the data in the data size that matches the data being stored. It is part of how the database can maintain integrity. For instance suppose you are storing email addresses. If your data size is the size of the maximum allowable emailaddress, then you will not be able to store bad data that is larger than that. That is a good thing. Some people want to make everything nvarchar(max) or varchar(max). However, this causes only indexing problems.
Personally I would have gone back to the person who make this requirement and asked for a reason. Then I would have presented my reasons as to why it might not be a good idea. I woul never just blindly implement something like this. In pushing back on a requirement like this, I would first do some research into how SQL Server organizes data on the disk, so I could show the impact of the requirement is likely to have on performance. I might even be surprised to find out the requirement made sense, but I doubt it at this point.

Related

Why not use nvarchar(max) also for small fields instead of nvarchar(123)

Why not use nvarchar(max) also for small fields instead of nvarchar(123).
Let us assume we do have not any values larger than 4000 Bytes.
Are there any difference in Terms of Performance when we have a nvarchar(max) also for smaller fields. Or why do People use then nvachar(SOME_FIX_VALUE)?

The most important reason is indexing.
Indexes can only be as large as 900 bytes. So with max you would never be able to put an index on the column.
This will cause issues with performance for many.
Another reason is to keep data consistency. A lot of databases communicate one way or the other with other systems and of course users. It might be via webservices, applications or similar.
And there a fixed length might be a business rule that field "region" can only be X letters long. This means if you use max you'll never have any inbuilt control regarding your data integrity and have to build additional security layers.
So while you add validation to the UI, what happens if an import causes issues, a manual scripting error etc.
Other reasons are that the data base engine handles variable text. For example data pages in SQL Server are 8KB pages. So it has to assume things when you start using variable text. For example check out: http://technet.microsoft.com/en-us/library/ms190969%28v=sql.105%29.aspx
But now we start becoming very technical and then you're properly better to take this to the database version of Stackoverflow.
The main reason for a coder/user is the index in my opinion.

Yes, there are difference. First, the varchar(max) columns could end up stored out of row, as a LOB. Second, you can fool the optimizer in thinking that there's lot more data than actually is, and in some cases produce suboptimal query plans.

If a table with varchar(max) columns went to 1,000,000 rows then that's a huge table and most of the disk space is wasted.

How can I store more than 8000 bytes of data inline in a SQL Server 2012 row?

I am trying to write a small blog engine. I would love to find a sample SQL Server schema to give me some ideas but have yet to find one.
I would like to have a blog table that allows me to store more than 8000 bytes of data. Can anyone tell me if a good way to do this would with two fields like this:
CREATE TABLE [Blog](
[BlogId] [int] IDENTITY(1,1) NOT NULL,
[BlogText1] [nvarchar](8000) NOT NULL,
[BlogText2] [nvarchar](8000),
..
What I was thinking was to store the text in two fields and have my application append the contents of the two fields when it was displaying the data and when storing data have the first xxx characters stored in BlogText1 and then any remainder stored in BlogText2.
Is this a reasonable thing to do or should I just use a nvarchar(max)?
If I use nvarchar(8000) how many characters can I fit into that?
What I am concerned about is the time it will take to retrieve a row. Am I correct in assuming that if I use nvarchar(max) it will take much longer to retrieve the row.

The short version - use NVARCHAR(MAX) until you identify that there is a definite performance problems to solve - attempting to manually split up large blog entries so that they are saved "inline" is almost certainly going to result in worse performance than leaving it up to SQL Server.
The long version - SQL Server stores data up in 8060 byte size chunks called pages. Normally the length of an individual column cannot exceed this size, however certain large-value types (e.g. TEXT) can be handled specially and their value replace with a 24-byte pointer to the actual data which is stored elsewhere (in the ROW_OVERFLOW_DATA allocation unit)
The NVARCHAR(MAX) data types actually provide a hybrid approach - in the case where the data is small enough the value is stored in the Data pages as it would be normally, however when the data is too large it is seamlessly converted into a large-value type for you. This generally means you get the best of both worlds.

Why do I have to set the max length of every single text column in the database?

Why is it that every RDBMS insists that you tell it what the max length of a text field is going to be... why can't it just infer this information form the data that's put into the database?
I've mostly worked with MS SQL Server, but every other database I know also demands that you set these arbitrary limits on your data schema. The reality is that this is not particulay helpful or friendly to work with becuase the business requirements change all the time and almost every day some end-user is trying to put a lot of text into that column.
Does any one with some inner working knowledge of a RDBMS know why we just don't infer the limits from the data that's put into the storage? I'm not talking about guessing the type information, but guessing the limits of a particular text column.
I mean, there's a reason why I don't use nvarchar(max) on every text column in the database.

Because computers (and databases) are stupid. Computers don't guess very well and, unless you tell them, they can't tell that a column is going to be used for a phone number or a copy of War and Peace. Obviously, the DB could be designed to so that every column could contain an infinite amount of data -- or at least as much as disk space allows -- but that would be a very inefficient design. In order to get efficiency, then, we make a trade-off and make the designer tell the database how much we expect to put in the column. Presumably, there could be a default so that if you don't specify one, it simply uses it. Unfortunately, any default would probably be inappropriate for the vast majority of people from an efficiency perspective.

This post not only answers your question about whether to use nvarchar(max) everywhere, but it also gives some insight into why databases historically didn't allow this.

It has to do with speed. If the max size of a string is specified you can optimize the way information is stored for faster i/o on it. When speed is key the last thing you want is a sudden shuffling of all your data just because you changed a state abbreviation to the full name.
With the max size set the database can allocate the max space to every entity in that column and regardless of the changes to the value no address space needs to change.

This is like saying, why can't we just tell the database we want a table and let it infer what type and how many columns we need from the data we give it.
Simply, we know better than the database will. Supposed you have a one in a million chance of putting a 2,000 character string into the database, most of the time, it's 100 characters. The database would probably blow up or refuse the 2k character string. It simply cannot know that you're going to need 2k length if for the first three years you've only entered 100 length strings.
Also, the length of the characters are used to optimize row placement so that rows can be read/skipped faster.

I think it is because the RDBMS use random data access. To do random data access, they must know which address in the hard disk they must jump into to fastly read the data. If every row of a single column have different data length, they can not infer what is the start point of the address they have to jump directly to get it. The only way is they have to load all data and check it.
If RDBMS change the data length of a column to a fixed number (for example, max length of all rows) everytime you add, update and delete. It is an extremely time consuming

What would the DB base its guess on? If the business requirements change regularly, it's going to be just as surprised as you. If there's a reason you don't use nvarchar(max), there's probably a reason it doesn't default to that as well...

check this tread http://www.sqlservercentral.com/Forums/Topic295948-146-1.aspx

For the sake of an example, I'm going to step into some quicksand and suggest you compare it with applications allocating memory (RAM). Why don't programmers ask for/allocate all the memory they need when the program starts up? Because often they don't know how much they'll need. This can lead to apps grabbing more and more memory as they run, and perhaps also releasing memory. And you have multiple apps running at the same time, and new apps starting, and old apps closing. And apps always want contiguous blocks of memory, they work poorly (if at all) if their memory is scattered all over the address space. Over time, this leads to fragmented memory, and all those garbage collection issues that people have been tearing their hair out over for decades.
Jump back to databases. Do you want that to happen to your hard drives? (Remember, hard drive performance is very, very slow when compared with memory operations...)

Sounds like your business rule is: Enter as much information as you want in any text box so you don't get mad at the DBA.
You don't allow users to enter 5000 character addresses since they won't fit on the envelope.
That's why Twitter has a text limit and saves everyone the trouble of reading through a bunch of mindless drivel that just goes on and on and never gets to the point, but only manages to infuriate the reader making them wonder why you have such disreguard for their time by choosing a self-centered and inhumane lifestyle focused on promoting the act of copying and pasting as much data as the memory buffer gods will allow...

Should Data types be sizes of powers of 2 in SQL Server?

What are good sizes for data types in SQL Server? When defining columns, i see data types with sizes of 50 as one of the default sizes(eg: nvarchar(50), binary(50)). What is the significance of 50? I'm tempted to use sizes of powers of 2, is that better or just useless?
Update 1
Alright thanks for your input guys. I just wanted to know the best way of defining the size of a datatype for a column.

There is no reason to use powers of 2 for performance etc. Data length should be determined by the size stored data.

Why not the traditional powers of 2, minus 1 such as 255...
Seriously, the length should match what you need and is suitable for your data.
Nothing else: how the client uses it, aligns to 32 bit word boundary, powers of 2, birthdays, Scorpio rising in Uranus, roll of dice...

The reason so many fields have a length of 50 is that SQL Server defaults to 50 as the length for most data types where length is an issue.
As has been said, the length of a field should be appropriate to the data that is being stored there, not least because there is a limit to the length of single record in SQL Server (it's ~8000 bytes). It is possible to blow past that limit.
Also, the length of your fields can be considered part of your documentation. I don't know how many times I've met lazy programmers who claim that they don't need to document because the code is self documenting and then they don't bother doing the things that would make the code self documenting.

You won't gain anything from using powers of 2. Make the fields as long as your business needs really require them to be - let SQL Server handle the rest.
Also, since the SQL Server page size is limited to 8K (of which 8060 bytes are available to user data), making your variable length strings as small as possible (but as long as needed, from a requirements perspective) is a plus.
That 8K limit is a fixed SQL Server system setting which cannot be changed.
Of course, SQL Server these days can handle more than 8K of data in a row, using so called "overflow" pages - but it's less efficient, so trying to stay within 8K is generally a good idea.
Marc

The size of a field should be appropriate for the data you are planning to store there, global defaults are not a good idea.

It's a good idea that the whole row fits into page several times without leaving too much free space.
A row cannot span two pages, an a page has 8096 bytes of free space, so two rows that take 4049 bytes each will occupy two pages.
See docs on how to calculate the space occupied by one row.
Also note that VAR in VARCHAR and VARBINARY stands for "varying", so if you put a 1-byte value into a 50-byte column, it will take but 1 byte.

This totally depends on what you are storing.
If you need x chars use x not some arbitrarily predefined amount.

Strategy for storing an string of unspecified length in Sql Server?

So a column will hold some text that beforehand I won't know how long the length of this string can be. Realistically 95% of the time, it will probably be between 100-500 chars, but there can be that one case where it will 10000 chars long. I have no control over the size of this string and never does the user. Besides varchar(max), what other strategy have you guys found useful? Also what are some cons of varchar(max)?

Varchar(max) in sqlserver 2005 is what I use.
SqlServer handles large string fields weirdly, in that if you specify "text" or a large varchar, but not max, it stores part of the bits in the record and the rest outside.
To my knowledge with varchar(max) it goes ahead and stores the entire contents out of the record, which makes it less efficient than a small text input. But its more efficient than a "text" field since it does not have to look up that information 2 times by getting part inline and the rest from a pointer.

One inelegant but effective approach would be to have two columns in your table, one a varchar big enough to cover your majority of cases, and another of a CLOB/TEXT type to store the freakishly large ones. When inserting/updating, you can get the size of your string, and store it in the appropriate column.
Like I say, not pretty, but it would give you the performance of varchar for the majority case, without breaking when you have larger values.

Have you considered using the BLOB type?
Also, out of curiosity, is you don't control the size of the string, and neither does the user, who does?

nvarchar(max) is definitely your best bet - as i'm sure you know it will only allocate the space required for the data you are actually storing per row, not the actual max of the datatype per row.
The only con i would see would be if you are constantly updating a row and it is switching from less than 8000 bytes to > 8000 bytes often in which case SQL will change the storage to a LOB and store a pointer to the data whenever you go over 8000 bytes. Changing back and forth would be expensive in this case, but you don't really have any other options in this case that I can see - so it's kind of a moot point.