Database column sizes for character based data

Database column sizes for character based data - sql-server

I've just come across a database where all the character based columns have their size specified in multiples of 8 (e.g. Username = varchar(32), Forename = nvarchar(128) etc.) Would this have any bearing on the performance of the database or is it completely pointless?
Thanks.
N.B. This is in a SQL 2000 database.

Since they are VARchars the actual space used is based on the content. I would start to worry if they were CHARs.
-Edoode

It's probably just a habit of some old school developer. As it's said - varchar is as long as it needs to be, so 32 or 33 doesn't matter when string length is for example 22.

"Would this have any bearing on the performance of the database or is it completely pointless?"
It has little bearing on performance (not none, but little). It can have an impact on spare space allocation, since the "largest" row size can be quite large with this scheme. Few rows per page can slow down large table retrievals.
The difference, however, is usually microscopic compared to improper indexing.
And getting the sizes "right" is not worth the time. In the olden days, old-school DBA's sweated over every character. Disks used to be expensive, and every byte had a real $$$ impact on cost.
Now that disk is so cheap, DBA time wasted in fussing over sizes costs more than the disk.

I've never seen it make a difference in the table storage, joins or basic operations.
But I have seen it make a difference where string processing is involved.
In SQL Server 2005, the only case I've seen where varchar size makes significant differences is when things are cast to varchar(MAX) or in UDFs. There appears to be some difference there - for instance, I have a system/process I'm porting with a number of key columns that have to be concatenated together into another pseudo keyfield (until I am able to refactor out this abomination) and the query performed significantly better casting the result as varchar(45) as soon as possible in the intermediate CTEs or nested queries.
I've seen cases where a UDF taking and returning a varchar(MAX) performs significantly more poorly than one taking and returning varchar(50), say. For instance, a trimming or padding function which someone (perhaps me!) was trying to make future proof. varchar(MAX) has it's place, but in my experience it can be dangerous to performance.
I do not think I profiled a difference between varchar(50) and varchar(1000).

Looks like premature optimisation to me.
I don't think this will make any performance change on the database.

Would you have the same concern if they were all multiples of 5 or 10, which is what happens in the normal case.?

Related

Efficiency difference between varchar(n) and varchar(m) in SQL Server -- this is NOT about varchar(n) and varchar(MAX)

If I have a column in SQL Server that I know will be about 10 characters usually. Is there any disadvantage in making it varchar(50) to cover all my cases? Any loss in efficiency, storage, computation / query pull between varchar(n) and varchar(n+x)?
Note: this is not a comparison between varchar(n) and varchar(MAX). There is another question already asked about that.
Related but different question:
difference between varchar(500) vs varchar(max) in sql server

Yes, oversizing "just in case" can lead to performance issues, namely memory grants that are too high because they are based on the assumption that every value is half-populated.
But it's up to you to balance that impact with other potential issues:
Making it "just right" can lead to memory grants that are too low if most/all values are more than half full.
Making it 5X just in case seems like a lot, but think about how likely it is that the business requirement will evolve over time. Changing the column later could be more disruptive; consider not just the column itself but all the parameter and variable definitions, indexes, constraints, etc. that might also be affected, as well as application code that would have to change to accommodate a larger string.
What you should do depends on how confident you are that this will always be varchar(10). If you're not very confident, you might consider making it varchar(20) or varchar(32) but constraining it to the current requirement of 10 (with params to insert procedures and an actual constraint). If the requirement changes later, you only have to remove the constraint, change the parameter definitions, and then the apps can change at their own pace to start supporting the longer value.

Why not use nvarchar(max) also for small fields instead of nvarchar(123)

Why not use nvarchar(max) also for small fields instead of nvarchar(123).
Let us assume we do have not any values larger than 4000 Bytes.
Are there any difference in Terms of Performance when we have a nvarchar(max) also for smaller fields. Or why do People use then nvachar(SOME_FIX_VALUE)?

The most important reason is indexing.
Indexes can only be as large as 900 bytes. So with max you would never be able to put an index on the column.
This will cause issues with performance for many.
Another reason is to keep data consistency. A lot of databases communicate one way or the other with other systems and of course users. It might be via webservices, applications or similar.
And there a fixed length might be a business rule that field "region" can only be X letters long. This means if you use max you'll never have any inbuilt control regarding your data integrity and have to build additional security layers.
So while you add validation to the UI, what happens if an import causes issues, a manual scripting error etc.
Other reasons are that the data base engine handles variable text. For example data pages in SQL Server are 8KB pages. So it has to assume things when you start using variable text. For example check out: http://technet.microsoft.com/en-us/library/ms190969%28v=sql.105%29.aspx
But now we start becoming very technical and then you're properly better to take this to the database version of Stackoverflow.
The main reason for a coder/user is the index in my opinion.

Yes, there are difference. First, the varchar(max) columns could end up stored out of row, as a LOB. Second, you can fool the optimizer in thinking that there's lot more data than actually is, and in some cases produce suboptimal query plans.

If a table with varchar(max) columns went to 1,000,000 rows then that's a huge table and most of the disk space is wasted.

Choosing SQL Server data types for maximum speed

I'm designing a database that will need to be optimized for maximum speed.
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
EDIT 1: Just to give you an example because I don't have a proper table structure yet:
I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
Because lines have unique numbers, I have thought of 3 examples of designs:
The PK is the line number of datatype: smallint
The PK is the line number of datatype: int
The PK is something different (identity for example) and the line number is stored in a different field.
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
I can detail this even more if you need. Thanks
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.

Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT.
The difference between INT/SMALLINT/TINYINT is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT values all fit within SMALLINT/TINYINT, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE object in SQL Server 2012. I wrote about SEQUENCE back in 2010 when the first public beta of SQL Server 2012 was released.

I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool

One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.

SQL Better performance: char(10) and trim or varchar(10)

I have a database that uses codes. Each code can be anywhere from two characters to ten characters long.
In MS SQL Server, is it better for performance to use char(10) for these codes and RTRIM them as they come in, or should I use varchar(10) and not have to worry about trimming the extra whitespace? I need to get rid of the whitespace because the codes will then be used in application logic for comparisons and what not.
As for the average code length, hard to tell exactly. Assume all codes are a random length between one and ten. Edit: A rough estimation is about 4.7 characters for the average length of a code.

I'd vote for varchar.
I say varchar to avoid the TRIM which would invalidate index usage (unless you use a computed column etc which defeats the purpose, no?).
Otherwise at length 10, it would be 50/50 but TRIM tips the balance towards varchar and wins out over the fixed length benefit

As a general rule, always favor smaller storage over extra CPU. Because the driving factor of database performance is always IO and smaller data records means more records per page and this in turn means fewer IO requests. The extra CPU involved in handling the variable length is not going to be a factor. Historically, in the dark ages of '80s and even in the '90s it may have been a measurable factor, but today is just noise. Because the CPU and memory access have increased tremendously, but the IO speed has stayed pretty much constant. That's why 'old books' advice does not apply today. Unless you have a constant field like char(2) or similar, just use varchar, always.

I'm confident that you wouldn't be able to tell a speed difference between the two.

Your requirements are a textbook definition of someone who needs to use varchar.
If you want to worry about performance, worry about DB design and writing good SQL. Char vs VarChar internals are well-optimized by the DB vendors.

In one old book I read that in general char is a better choice when for the most of the records the real string length is at least 60% of maximum; in your example - if more than half of all records have length 6 or greater. Otherwise, use varchar.

What are the main performance differences between varchar and nvarchar SQL Server data types?

I'm working on a database for a small web app at my school using SQL Server 2005.
I see a couple of schools of thought on the issue of varchar vs nvarchar:
Use varchar unless you deal with a lot of internationalized data, then use nvarchar.
Just use nvarchar for everything.
I'm beginning to see the merits of view 2. I know that nvarchar does take up twice as much space, but that isn't necessarily a huge deal since this is only going to store data for a few hundred students. To me it seems like it would be easiest not to worry about it and just allow everything to use nvarchar. Or is there something I'm missing?

Disk space is not the issue... but memory and performance will be.
Double the page reads, double index size, strange LIKE and = constant behaviour etc
Do you need to store Chinese etc script? Yes or no...
And from MS BOL "Storage and Performance Effects of Unicode"
Edit:
Recent SO question highlighting how bad nvarchar performance can be...
SQL Server uses high CPU when searching inside nvarchar strings

Always use nvarchar.
You may never need the double-byte characters for most applications. However, if you need to support double-byte languages and you only have single-byte support in your database schema it's really expensive to go back and modify throughout your application.
The cost of migrating one application from varchar to nvarchar will be much more than the little bit of extra disk space you'll use in most applications.

Be consistent! JOIN-ing a VARCHAR to NVARCHAR has a big performance hit.

nvarchar is going to have significant overhead in memory, storage, working set and indexing, so if the specs dictate that it really will never be necessary, don't bother.
I would not have a hard and fast "always nvarchar" rule because it can be a complete waste in many situations - particularly ETL from ASCII/EBCDIC or identifiers and code columns which are often keys and foreign keys.
On the other hand, there are plenty of cases of columns, where I would be sure to ask this question early and if I didn't get a hard and fast answer immediately, I would make the column nvarchar.

I hesitate to add yet another answer here as there are already quite a few, but a few points need to be made that have either not been made or not been made clearly.
First: Do not always use NVARCHAR. That is a very dangerous, and often costly, attitude / approach. And it is no better to say "Never use cursors" since they are sometimes the most efficient means of solving a particular problem, and the common work-around of doing a WHILE loop will almost always be slower than a properly done Cursor.
The only time you should use the term "always" is when advising to "always do what is best for the situation". Granted that is often difficult to determine, especially when trying to balance short-term gains in development time (manager: "we need this feature -- that you didn't know about until just now -- a week ago!") with long-term maintenance costs (manager who initially pressured team to complete a 3-month project in a 3-week sprint: "why are we having these performance problems? How could we have possibly done X which has no flexibility? We can't afford a sprint or two to fix this. What can we get done in a week so we can get back to our priority items? And we definitely need to spend more time in design so this doesn't keep happening!").
Second: #gbn's answer touches on some very important points to consider when making certain data modeling decisions when the path isn't 100% clear. But there is even more to consider:
size of transaction log files
time it takes to replicate (if using replication)
time it takes to ETL (if ETLing)
time it takes to ship logs to a remote system and restore (if using Log Shipping)
size of backups
length of time it takes to complete the backup
length of time it takes to do a restore (this might be important some day ;-)
size needed for tempdb
performance of triggers (for inserted and deleted tables that are stored in tempdb)
performance of row versioning (if using SNAPSHOT ISOLATION, since the version store is in tempdb)
ability to get new disk space when the CFO says that they just spent $1 million on a SAN last year and so they will not authorize another $250k for additional storage
length of time it takes to do INSERT and UPDATE operations
length of time it takes to do index maintenance
etc, etc, etc.
Wasting space has a huge cascade effect on the entire system. I wrote an article going into explicit detail on this topic: Disk Is Cheap! ORLY? (free registration required; sorry I don't control that policy).
Third: While some answers are incorrectly focusing on the "this is a small app" aspect, and some are correctly suggesting to "use what is appropriate", none of the answers have provided real guidance to the O.P. An important detail mentioned in the Question is that this is a web page for their school. Great! So we can suggest that:
Fields for Student and/or Faculty names should probably be NVARCHAR since, over time, it is only getting more likely that names from other cultures will be showing up in those places.
But for street address and city names? The purpose of the app was not stated (it would have been helpful) but assuming the address records, if any, pertain to just to a particular geographical region (i.e. a single language / culture), then use VARCHAR with the appropriate Code Page (which is determined from the Collation of the field).
If storing State and/or Country ISO codes (no need to store INT / TINYINT since ISO codes are fixed length, human readable, and well, standard :) use CHAR(2) for two letter codes and CHAR(3) if using 3 letter codes. And consider using a binary Collation such as Latin1_General_100_BIN2.
If storing postal codes (i.e. zip codes), use VARCHAR since it is an international standard to never use any letter outside of A-Z. And yes, still use VARCHAR even if only storing US zip codes and not INT since zip codes are not numbers, they are strings, and some of them have a leading "0". And consider using a binary Collation such as Latin1_General_100_BIN2.
If storing email addresses and/or URLs, use NVARCHAR since both of those can now contain Unicode characters.
and so on....
Fourth: Now that you have NVARCHAR data taking up twice as much space than it needs to for data that fits nicely into VARCHAR ("fits nicely" = doesn't turn into "?") and somehow, as if by magic, the application did grow and now there are millions of records in at least one of these fields where most rows are standard ASCII but some contain Unicode characters so you have to keep NVARCHAR, consider the following:
If you are using SQL Server 2008 - 2016 RTM and are on Enterprise Edition, OR if using SQL Server 2016 SP1 (which made Data Compression available in all editions) or newer, then you can enable Data Compression. Data Compression can (but won't "always") compress Unicode data in NCHAR and NVARCHAR fields. The determining factors are:
NCHAR(1 - 4000) and NVARCHAR(1 - 4000) use the Standard Compression Scheme for Unicode, but only starting in SQL Server 2008 R2, AND only for IN ROW data, not OVERFLOW! This appears to be better than the regular ROW / PAGE compression algorithm.
NVARCHAR(MAX) and XML (and I guess also VARBINARY(MAX), TEXT, and NTEXT) data that is IN ROW (not off row in LOB or OVERFLOW pages) can at least be PAGE compressed, but not ROW compressed. Of course, PAGE compression depends on size of the in-row value: I tested with VARCHAR(MAX) and saw that 6000 character/byte rows would not compress, but 4000 character/byte rows did.
Any OFF ROW data, LOB or OVERLOW = No Compression For You!
If using SQL Server 2005, or 2008 - 2016 RTM and not on Enterprise Edition, you can have two fields: one VARCHAR and one NVARCHAR. For example, let's say you are storing URLs which are mostly all base ASCII characters (values 0 - 127) and hence fit into VARCHAR, but sometimes have Unicode characters. Your schema can include the following 3 fields:
...
URLa VARCHAR(2048) NULL,
URLu NVARCHAR(2048) NULL,
URL AS (ISNULL(CONVERT(NVARCHAR([URLa])), [URLu])),
CONSTRAINT [CK_TableName_OneUrlMax] CHECK (
([URLa] IS NOT NULL OR [URLu] IS NOT NULL)
AND ([URLa] IS NULL OR [URLu] IS NULL))
);
In this model you only SELECT from the [URL] computed column. For inserting and updating, you determine which field to use by seeing if converting alters the incoming value, which has to be of NVARCHAR type:
INSERT INTO TableName (..., URLa, URLu)
VALUES (...,
IIF (CONVERT(VARCHAR(2048), #URL) = #URL, #URL, NULL),
IIF (CONVERT(VARCHAR(2048), #URL) <> #URL, NULL, #URL)
);
You can GZIP incoming values into VARBINARY(MAX) and then unzip on the way out:
For SQL Server 2005 - 2014: you can use SQLCLR. SQL# (a SQLCLR library that I wrote) comes with Util_GZip and Util_GUnzip in the Free version
For SQL Server 2016 and newer: you can use the built-in COMPRESS and DECOMPRESS functions, which are also GZip.
If using SQL Server 2017 or newer, you can look into making the table a Clustered Columnstore Index.
While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

For your application, nvarchar is fine because the database size is small. Saying "always use nvarchar" is a vast oversimplification. If you're not required to store things like Kanji or other crazy characters, use VARCHAR, it'll use a lot less space. My predecessor at my current job designed something using NVARCHAR when it wasn't needed. We recently switched it to VARCHAR and saved 15 GB on just that table (it was highly written to). Furthermore, if you then have an index on that table and you want to include that column or make a composite index, you've just made your index file size larger.
Just be thoughtful in your decision; in SQL development and data definitions there seems to rarely be a "default answer" (other than avoid cursors at all costs, of course).

Since your application is small, there is essentially no appreciable cost increase to using nvarchar over varchar, and you save yourself potential headaches down the road if you have a need to store unicode data.

Generally speaking; Start out with the most expensive datatype that has the least constraints. Put it in production. If performance starts to be an issue, find out what's actually being stored in those nvarchar columns. Is there any characters in there that wouldn't fit into varchar? If not, switch to varchar. Don't try to pre-optimize before you know where the pain is. My guess is that the choice between nvarchar/varchar is not what's going to slow down your application in the foreseable future. There will be other parts of the application where performance tuning will give you much more bang for the bucks.

For that last few years all of our projects have used NVARCHAR for everything, since all of these projects are multilingual. Imported data from external sources (e.g. an ASCII file, etc.) is up-converted to Unicode before being inserted into the database.
I've yet to encounter any performance-related issues from the larger indexes, etc. The indexes do use more memory, but memory is cheap.
Whether you use stored procedures or construct SQL on the fly ensure that all string constants are prefixed with N (e.g. SET #foo = N'Hello world.';) so the constant is also Unicode. This avoids any string type conversion at runtime.
YMMV.

I can speak from experience on this, beware of nvarchar. Unless you absolutely require it this data field type destroys performance on larger database. I inherited a database that was hurting in terms of performance and space. We were able to reduce a 30GB database in size by 70%! There were some other modifications made to help with performance but I'm sure the varchar's helped out significantly with that as well. If your database has the potential for growing tables to a million + records stay away from nvarchar at all costs.

I deal with this question at work often:
FTP feeds of inventory and pricing - Item descriptions and other text were in nvarchar when varchar worked fine. Converting these to varchar reduced file size almost in half and really helped with uploads.
The above scenario worked fine until someone put a special character in the item description (maybe trademark, can't remember)
I still do not use nvarchar every time over varchar. If there is any doubt or potential for special characters, I use nvarchar. I find I use varchar mostly when I am in 100% control of what is populating the field.

Why, in all this discussion, has there been no mention of UTF-8? Being able to store the full unicode span of characters does not mean one has to always allocate two-bytes-per-character (or "code point" to use the UNICODE term). All of ASCII is UTF-8. Does SQL Server check for VARCHAR() fields that the text is strict ASCII (i.e. top byte bit zero)? I would hope not.
If then you want to store unicode and want compatibility with older ASCII-only applications, I would think using VARCHAR() and UTF-8 would be the magic bullet: It only uses more space when it needs to.
For those of you unfamiliar with UTF-8, might I recommend a primer.

There'll be exceptional instances when you'll want to deliberately restrict the data type to ensure it doesn't contain characters from a certain set. For example, I had a scenario where I needed to store the domain name in a database. Internationalisation for domain names wasn't reliable at the time so it was better to restrict the input at the base level, and help to avoid any potential issues.

If you are using NVARCHAR just because a system stored procedure requires it, the most frequent occurrence being inexplicably sp_executesql, and your dynamic SQL is very long, you would be better off from performance perspective doing all string manipulations (concatenation, replacement etc.) in VARCHAR then converting the end result to NVARCHAR and feeding it into the proc parameter. So no, do not always use NVARCHAR!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight