Performance implication on nvarchar(4000)? - sql-server

I have a column that is declared as nvarchar(4000) and 4000 is SQL Server's limit on the length of nvarchar. it would not be used as a key to sort rows.
Are there any implication that i should be aware of before setting the length of the nvarchar field to 4000?
Update
I'm storing an XML, an XML Serialized Object to be exact. I know this isn't favorable but we will most likely never need to perform queries on it and this implementation dramatically decreases development time for certain features that we plan on extending. I expect the XML data to be 1500 characters long on average but there can be those exceptions where it can be longer than 4000. Could it be longer than 4000 characters? It could be but in very rare occasions, if it ever happens. Is this application mission critical? Nope, not at all.

SQL Server has three types of storage: in-row, LOB and Row-Overflow, see Table and Index Organization. The in-row storage is fastest to access. LOB and Row-Overflow are similar to each other, both slightly slower than in-row.
If you have a column of NVARCHAR(4000) it will be stored in row if possible, if not it will be stored in the row-overflow storage. Having such a column does not necesarily indicate future performance problems, but it begs the question: why nvarchar(4000)? Is your data likely to be always near 4000 characters long? Can it be 4001, how will your applicaiton handle it in this case? Why not nvarchar(max)? Have you measured performance and found that nvarchar(max) is too slow for you?
My recommendation would be to either use a small nvarchar length, appropiate for the real data, or nvarchar(max) if is expected to be large. nvarchar(4000) smells like unjustified and not tested premature optimisation.
Update
For XML, use the XML data type. It has many advantages over varchar or nvarchar, like the fact that it supports XML indexes, it supports XML methods and can actually validate the XML for a compliance to a specific schema or at least for well-formed XML compliance.
XML will be stored in the LOB storage, outside the row.
Even if the data is not XML, I would still recommend LOB storage (nvarchar(max)) for something of a length of 1500. There is a cost associated with retrieving the LOB stored data, but the cost is more than compensated by macking the table narrower. The width of a table row is a primary factor of performance, because wider tables fit less rows per page, so any operation that has to scan a range of rows or the entire table needs to fetch more pages into memory, and this shows up in the query cost (is actualy the driving factor of the overall cost). A LOB stored column only expands the size of the row with the width of a 'page id', which is 8 bytes if I remember correctly, so you can get much better density of rows per page, hence faster queries.

Are you sure that you'll actually need as many as 4000 characters? If 1000 is the practical upper limit, why not set it to that? Conversely, if you're likely to get more than 4000 bytes, you'll want to look at nvarchar(max).
I like to "encourage" users not use storage space too freely. The more space required to store a given row, the less space you can store per page, which potentially results in more disk I/O when the table is read or written to. Even though only as many bytes of data are stored as are necessary (i.e. not the full 4000 per row), whenever you get a bit more than 2000 characters of nvarchar data, you'll only have one row per page, and performance can really suffer.
This of course assumes you need to store unicode (double-byte) data, and that you only have one such column per row. If you don't, drop down to varchar.

Do you certainly need nvarchar or can you go with varchar? The limitation applies mainly to sql server 2k. Are you using 2k5 / 2k8 ?

Related

Redshift: Disadvantages of having a lot of nulls/empties in a large varchar column

I have a varchar column of max size 20,000 in my redshift table. About 60% of the rows will have this column as null or empty. What is the performance impact in such cases.
From this documentation I read:
Because Amazon Redshift compresses column data very effectively,
creating columns much larger than necessary has minimal impact on the
size of data tables. During processing for complex queries, however,
intermediate query results might need to be stored in temporary
tables. Because temporary tables are not compressed, unnecessarily
large columns consume excessive memory and temporary disk space, which
can affect query performance.
So this means query performance might be bad in this case. Is there any other disadvantage apart from this?
To store in redshift table, there is no significant performance degradation as suggested in documentation, compression encoding help in keeping the data compact.
Whereas when you query the column with null values, extra processing is required, for instance, using it in where clause. This might impact the performance of your query. So performance depends on your query.
EDIT (answer to your comment) - Redshift stores your each column in "blocks" and these blocks are sorted according to the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned. Query your disk space for the particular column and check size against other columns.
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Snowflake - Performance when column size is not specified

Currently we are using Snowflake DWH for our project. The columns defined in the tables are defined without any size specification. Not sure why it was done so, as this was done long back.
Will there be any performance hit with Snowflake DWH, when the size is not specified. For ex, by default the size of VARCHAR is 16777216 and for NUMBER is (38,0).
Will there be any performance hit because of leaving the size to default in Snowflake?
Actually, we're just about to add more info about it to our doc, coming very soon.
In short, the length for VARCHAR and precision ("15" in DECIMAL(15,2)for DECIMAL/NUMBER only work as constraints, and have no effect on performance. Snowflake automatically will detect the range of values and optimize storage and processing for it. The scale ("2" in DECIMAL(15,2)) for NUMBER and TIMESTAMP can influence storage and performance size though.

Is there a benefit to decreasing the size of my NVARCHAR columns

I have a SQL Server 2008 database that stores millions of rows. There are several NVARCHAR columns that will never exceed the current max length of the column, nor get close to it due to application constraints.
i.e.
The Address NVARCHAR field has a length of 50 characters, but it'll never exceed 32 characters.
Is there a performance benefit or space saving benefit to me reducing the size of the NVARCHAR column to what it's actual max length will be (i.e. in the case of the Address field, 32 characters). Or will it not make a difference since it's a variable length field?
Setting the number of characters in NVARCHAR is mainly for validation purposes. If there is some reason why you don't want the data to exceed 50 characters then the database will enforce that rule for you by not allowing extra data.
If the total row size exceeds a threshold then it can affect performance, so by restricting the length you could benefit by not allowing your row size to exceed that threshold. But in your case, that does not seem to matter.
The reason for this is that SQL Server can fit more rows onto a Page, which results in less disk I/O and more rows can be stored in memory.
Also, the maximum row size in SQL Server is 8KB as that is the size of a page and rows cannot cross page boundaries. If you insert a row that exceeds 8KB, the extra data will be stored in a row overflow page, which will likely have a negative affect on performance.
There is no expected performance or space saving benefit for reducing your n/var/char column definitions to their maximum length. However, there may be other benefits.
The column won't accidentally have a longer value inserted without generating an error (desirable for the "fail fast" characteristic of well-designed systems).
The column communicates to the next developer examining the table something about the data, that aids in understanding. No developer will be confused about the purpose of the data and have to expend wasted time determining if the code's field validation rules are wrong or if the column definition is wrong (as they logically should match).
If your column does need to be extended in length, you can do so with potential consequences ascertained in advance. A professional who is well-versed in databases can use the opportunity to see if upcoming values that will need the new column length will have a negative impact on existing rows or on query performance—as the amount of data per row affects the number of reads required to satisfy queries.

Changing char(8) to char(32) in SQL Server

We have a table T which contains a several char(8) columns (implicitly) which under some conditions need to be changed into something like char(64).
We don't want to waste space, so here is the question:
Is it an expensive operation from the RDBMS computational point of view (extending column data type)? We'd like to have this answered theoretically, no benchmarks. Does database need to rearrange the physical layout of the table because of this?
Yes, quite expensive - every single row in that table must be touched, modified, stored again, and all the non-clustered indices using any of those columns will need to be rebuilt.
Since it's CHAR(x), it's fixed-width - so changing its size results in every single column having to be modified. Also: with the change from 8 to 64 characters, there's a chance that some pages won't be able to hold all the rows anymore and page splits with all their overhead will occur.
If you increase the length of char datatype then data page split will be required and it will be costly in terms of CPU utilization and memory optimization and data compression.

How are varchar values stored in a SQL Server database?

My fellow programmer has a strange requirement from his team leader; he insisted on creating varchar columns with a length of 16*2n.
What is the point of such restriction?
I can suppose that short strings (less than 128 chars for example) a stored directly in the record of the table and from this point of view the restriction will help to align fields in the record, larger strings are stored in the database "heap" and only the reference to this string is saved in the table record.
Is it so?
Is this requirement has a reasonable background?
BTW, the DBMS is SQL Server 2008.
Completely pointless restriction as far as I can see. Assuming standard FixedVar format (as opposed to the formats used with row/page compression or sparse columns) and assuming you are talking about varchar(1-8000) columns
All varchar data is stored at the end of the row in a variable length section (or in offrow pages if it can't fit in row). The amount of space it consumes in that section (and whether or not it ends up off row) is entirely dependant upon the length of the actual data not the column declaration.
SQL Server will use the length declared in the column declaration when allocating memory (e.g. for sort operations). The assumption it makes in that instance is that varchar columns will be filled to 50% of their declared size on average so this might be a better thing to look at when choosing a size.
I have heard of this practice before, but after researching this question a bit I don't think there is a practical reason for having varchar values in multiples of 16. I think this requirement probably comes from trying to optimize the space used on each page. In SQL Server, pages are set at 8 KB per page. Rows are stored in pages, so perhaps the thinking is that you could conserve space on the pages if the size of each row divided evenly into 8 KB (a more detailed description of how SQL Server stores data can be found here). However, since the amount of space used by a varchar field is determined by its actual content, I don't see how using lengths in multiples of 16 or any other scheme could help you optimize the amount of space used by each row on the page. The length of the varchar fields should just be set to whatever the business requirements dictate.
Additionally, this question covers similar ground and the conclusion also seems to be the same:
Database column sizes for character based data
You should always store the data in the data size that matches the data being stored. It is part of how the database can maintain integrity. For instance suppose you are storing email addresses. If your data size is the size of the maximum allowable emailaddress, then you will not be able to store bad data that is larger than that. That is a good thing. Some people want to make everything nvarchar(max) or varchar(max). However, this causes only indexing problems.
Personally I would have gone back to the person who make this requirement and asked for a reason. Then I would have presented my reasons as to why it might not be a good idea. I woul never just blindly implement something like this. In pushing back on a requirement like this, I would first do some research into how SQL Server organizes data on the disk, so I could show the impact of the requirement is likely to have on performance. I might even be surprised to find out the requirement made sense, but I doubt it at this point.

Resources