sql server data fields nvchar(x) or nvarchar(max)

sql server data fields nvchar(x) or nvarchar(max) - sql-server

Is it good practice to set all text fields to nvarchar(MAX)? if im not sure what the size of the field will be
will it take up extra space

varchar(max) won't take extra space.
One dangerous thing about varchar(max) is that someone can potentially but a huge amount of data in it. This can hurt if a malicious user. or faulty program, enters a 1MB address line. No client or web application will expect that, and it might cause them to crash.
So, as a best practice, create varchar fields with a specified maximum length.

If you're not sure, and its for something like a comments field that may be quite large, a NVARCHAR(MAX) type is the way to go. If it is for something with a reasonable maximum, like a postal address, then something like NVARCHAR(100) might be better.
The space used is up to the maximum length of the value that is put into it.
What I mean by that is if you insert an 80 character string, then update it with a 40 character replacement, the allocated space may remain as 80 characters until the database is compacted, table is rebuilt, etc.

I'm not sure what you're storing in the columns but you may want to consider varchar instead of nvarchar. nvarchar stores unicode data which is used for multilingual data and takes up more space than varchar. If the field will only store English (for example) then varchar might be a better choice.

Related

Type for storage text in database

I would like to create table to storage short text.
What is the best type to use.
char
varchar
but i can't guarantee the text length. Any best practice ?

VARCHAR is the newer-style storage type for strings. Speaking from a MSSQL pov, VARCHAR offers many benefits over CHAR such as dynamic sizing, off-row storage for large values etc. I'd go with VARCHAR every time - unless I knew that the text was always going to be a fixed size.

If you are not sure about text length then i will suggest you should go with varchar.
you can just check,
http://bytes.com/topic/db2/answers/183253-varchar-vs-char

Char vs Varchar when not always populated

I have a database with a field that holds permit numbers associated with requests. The permit numbers are 13 digits, but a permit may not be issued.
With that said, I currently have the field defined as a char(13) that allows NULLs. I have been asked to change it to varchar(13) because char's, if NULL, still use the full length.
Is this advisable? Other than space usage, are there any other advantages or disadvantages to this?
I know in an ideal relational system, the permit numbers would be stored in another related table to avoid the use of NULLs, but it is what it is.

Well, if you don't have to use as much space, then you can fit more pages in memory. If you can do that, then your system will run faster. This may seem trivial, but I just recently tweaked the data types on a a table at a client that reduced the amount of reads by 25% and the CPU by about 20%.
As for which is easier to work with, the benefits David Stratton mentioned are noteworthy. I hate having to use trim functions in string building.

If the field should always be exactly 13 characters, then I'd probably leave it as CHAR(13).
Also, an interesting note from BOL:
If SET ANSI_PADDING is OFF when either
CREATE TABLE or ALTER TABLE is
executed, a char column that is
defined as NULL is handled as varchar.
Edit: How frequently would you expect the field to be NULL? If it will be populated 95% of the time, it's hardly worth it to make this change.

The biggest advantage (in general, not necessarily your specific case) I know of is that in code, if you use varchar, you don't have to use a Trim function every time you want it displayed. I run into this a lot when taking FirstName fields and LastName fields and combining them into a FullName. It's just annoying and makes the code less readable.

if your are using sql server 2008, you should look at Row Compression and perhaps sparse fields if the column is more ~60% nulls.
I would keep the datatype a char(13) if all of the populated fields use that amount.
Row Compression Information:
http://msdn.microsoft.com/en-us/library/cc280449.aspx
Sparse columns:
http://msdn.microsoft.com/en-us/library/cc280604.aspx

How BIG do you make your Nvarchar()

When designing a database, what decisions do you consider when deciding how big your nvarchar should be.
If i was to make an address table my gut reaction would be for address line 1 to be nvarchar(255) like an old access database.
I have found using this has got me in bother with the old 'The string would be truncated'. I know that this can be prevented by limiting the input box but if a user really has a address line one that is over 255 this should be allowed.
How big should I make my nvarchar(????)

My recommendation: make them just as big as you REALLY need them.
E.g. for a zip code column, 10-20 chars are definitely enough. Ditto for a phone number. E-Mails might be longer, 50-100 chars. Names - well, I usually get by with 50 chars, ditto for first names. You can always and easily extend fields if you really need to - that's no a big undertaking at all.
There's really no point in making all varchar/nvarchar fields as big as they can be. After all, a SQL Server page is fixed and limited to 8060 bytes per row. Having 10 fields of NVARCHAR(4000) is just asking for trouble.... (since if you actually try to fill them with too much data, SQL Server will barf at you).
If you REALLY need a really big field, use NVARCHAR/VARCHAR(MAX) - those are stored in your page, as long as they fit, and will be sent to "overflow" storage if they get too big.
NVARCHAR vs. VARCHAR: this really boils down to do you really need "exotic" characters, such as Japanese, Chinese, or other non-ASCII style characters? In Europe, even some of the eastern European characters cannot be represented by VARCHAR fields anymore (they will be stripped of their hachek (? spelling ?). Western European languages (English, German, French, etc.) are all very well served by VARCHAR.
BUT: NVARCHAR does use twice as much space - on disk and in your SQL Server memory - at all times. If you really need it, you need it - but do you REALLY ? :-) It's up to you.
Marc

I don't use nvarchar personally :-) I always use varchar.
However, I tend to use 100 for name and 1000 for comments. Trapping and dealing with longer strings is something the client can do, say via regex, so SQL only gets the data it expects.
You can avoid truncation errors be parameterising the calls, for example via stored procs.
If the parameter is defined as varchar(200), say, then truncation happens silently if you send > 200. The truncation error is thrown only for an INSERT or UPDATE statement: with parameters it won't happen.
The 255 "limit" for SQL Server goes back to 6.5 because vachar was limited to 255. SQL Server 7.0 + changed to 8000 and added support for unicode
Edit:
Why I don't use nvarchar: Double memory footprint, double index size, double disk size, simply don't need it. I work for a big Swiss company with offices globally so I'm not being parochial.
Also discussed here: varchar vs nvarchar performance
On further reflection, I'd suggest unicode appeals to client developers but as a developer DBA I focus on performance and efficiency...

It depends on what the field represents. If I'm doing a quick prototype I leave the defaults of 255. For anything like comments etc I'd probably put it to 1000.
The only way I'd make it smaller really is on things I definately know the siez of, zip codes or NI numbers etc.

For columns that you need to have certain constraints on - like names, emails, addresses, etc - you should put a reasonably high max length. For instance a first name of more than 50 characters seems a bit suspicious and an input above that size will probably contain more that just a first name. But for the initial design of a database, take that reasonable size and double it. So for first names, set it to 100 (or 200 if 100 is your 'reasonable size'). Then put the app in production, let the users play around for a sufficiently long time to gather data and then check the actual max(len(FirstName)). Are there any suspicious values there? Anything above 50 chars? Find out what's in there and see if it's actually a first name or not. If it's not, the input form probably needs better explanations/validations.
Do the same for comments; Set them to nvharchar(max) initially. Then come back when your database has grown enough for you to start optimizing performance. Take the max length of the comments, double it and you have a good max length for your column.

What is the difference between varchar and nvarchar?

Is it just that nvarchar supports multibyte characters? If that is the case, is there really any point, other than storage concerns, to using varchars?

An nvarchar column can store any Unicode data. A varchar column is restricted to an 8-bit codepage. Some people think that varchar should be used because it takes up less space. I believe this is not the correct answer. Codepage incompatabilities are a pain, and Unicode is the cure for codepage problems. With cheap disk and memory nowadays, there is really no reason to waste time mucking around with code pages anymore.
All modern operating systems and development platforms use Unicode internally. By using nvarchar rather than varchar, you can avoid doing encoding conversions every time you read from or write to the database. Conversions take time, and are prone to errors. And recovery from conversion errors is a non-trivial problem.
If you are interfacing with an application that uses only ASCII, I would still recommend using Unicode in the database. The OS and database collation algorithms will work better with Unicode. Unicode avoids conversion problems when interfacing with other systems. And you will be preparing for the future. And you can always validate that your data is restricted to 7-bit ASCII for whatever legacy system you're having to maintain, even while enjoying some of the benefits of full Unicode storage.

varchar: Variable-length, non-Unicode character data. The database collation determines which code page the data is stored using.
nvarchar: Variable-length Unicode character data. Dependent on the database collation for comparisons.
Armed with this knowledge, use whichever one matches your input data (ASCII v. Unicode).

I always use nvarchar as it allows whatever I'm building to withstand pretty much any data I throw at it. My CMS system does Chinese by accident, because I used nvarchar. These days, any new applications shouldn't really be concerned with the amount of space required.

It depends on how Oracle was installed. During the installation process, the NLS_CHARACTERSET option is set. You may be able to find it with the query SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET'.
If your NLS_CHARACTERSET is a Unicode encoding like UTF8, great. Using VARCHAR and NVARCHAR are pretty much identical. Stop reading now, just go for it. Otherwise, or if you have no control over the Oracle character set, read on.
VARCHAR — Data is stored in the NLS_CHARACTERSET encoding. If there are other database instances on the same server, you may be restricted by them; and vice versa, since you have to share the setting. Such a field can store any data that can be encoded using that character set, and nothing else. So for example if the character set is MS-1252, you can only store characters like English letters, a handful of accented letters, and a few others (like € and —). Your application would be useful only to a few locales, unable to operate anywhere else in the world. For this reason, it is considered A Bad Idea.
NVARCHAR — Data is stored in a Unicode encoding. Every language is supported. A Good Idea.
What about storage space? VARCHAR is generally efficient, since the character set / encoding was custom-designed for a specific locale. NVARCHAR fields store either in UTF-8 or UTF-16 encoding, base on the NLS setting ironically enough. UTF-8 is very efficient for "Western" languages, while still supporting Asian languages. UTF-16 is very efficient for Asian languages, while still supporting "Western" languages. If concerned about storage space, pick an NLS setting to cause Oracle to use UTF-8 or UTF-16 as appropriate.
What about processing speed? Most new coding platforms use Unicode natively (Java, .NET, even C++ std::wstring from years ago!) so if the database field is VARCHAR it forces Oracle to convert between character sets on every read or write, not so good. Using NVARCHAR avoids the conversion.
Bottom line: Use NVARCHAR! It avoids limitations and dependencies, is fine for storage space, and usually best for performance too.

nvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.

varchar is used for non-Unicode characters only on the other hand nvarchar is used for both unicode and non-unicode characters. Some other difference between them is given bellow.
VARCHAR vs. NVARCHAR
VARCHAR
NVARCHAR
Character Data Type
Variable-length, non-Unicode characters
Variable-length, both Unicode and non-Unicode characters such as Japanese, Korean, and Chinese.
Maximum Length
Up to 8,000 characters
Up to 4,000 characters
Character Size
Takes up 1 byte per character
Takes up 2 bytes per Unicode/Non-Unicode character
Storage Size
Actual Length (in bytes)
2 times Actual Length (in bytes)
Usage
Used when data length is variable or variable length columns and if actual data is always way less than capacity
Due to storage only, used only if you need Unicode support such as the Japanese Kanji or Korean Hangul characters.

The main difference between Varchar(n) and nvarchar(n) is:
Varchar ( Variable-length, non-Unicode character data) size is upto 8000.
It is a variable length data type
Used to store non-Unicode characters
Occupies 1 byte of space for each character
Nvarchar: Variable-length Unicode character data.
It is a variable-length data type
Used to store Unicode characters.
Data is stored in a Unicode encoding. Every
language is supported. (for example the languages Arabic, German,Hindi,etc and so on)

My two cents
Indexes can fail when not using the correct datatypes:
In SQL Server: When you have an index over a VARCHAR column and present it a Unicode String, SQL Server does not make use of the index. The same thing happens when you present a BigInt to a indexed-column containing SmallInt. Even if the BigInt is small enough to be a SmallInt, SQL Server is not able to use the index. The other way around you do not have this problem (when providing SmallInt or Ansi-Code to an indexed BigInt ot NVARCHAR column).
Datatypes can vary between different DBMS's (DataBase Management System):
Know that every database has slightly different datatypes and VARCHAR does not means the same everywhere. While SQL Server has VARCHAR and NVARCHAR, an Apache/Derby database has only VARCHAR and there VARCHAR is in Unicode.

Mainly nvarchar stores Unicode characters and varchar stores non-Unicode characters.
"Unicodes" means 16-bit character encoding scheme allowing characters from lots of other languages like Arabic, Hebrew, Chinese, Japanese, to be encoded in a single character set.
That means unicodes is using 2 bytes per character to store and nonunicodes uses only one byte per character to store. Which means unicodes need double capacity to store compared to non-unicodes.

Since SQL Server 2019 varchar columns support UTF-8 encoding.
Thus, from now on, the difference is size.
In a database system that translates to difference in speed.
Less data = Less IO + Less Memory = More speed in general. Read the article above for the numbers.
Go for varchar in UTF8 from now on!
Only if you have big percentage of data with characters in ranges 2048 - 16383 and 16384 – 65535 - you will have to measure

You're right. nvarchar stores Unicode data while varchar stores single-byte character data. Other than storage differences (nvarchar requires twice the storage space as varchar), which you already mentioned, the main reason for preferring nvarchar over varchar would be internationalization (i.e. storing strings in other languages).

nVarchar will help you to store Unicode characters. It is the way to go if you want to store localized data.

I would say, it depends.
If you develop a desktop application, where the OS works in Unicode (like all current Windows systems) and language does natively support Unicode (default strings are Unicode, like in Java or C#), then go nvarchar.
If you develop a web application, where strings come in as UTF-8, and language is PHP, which still does not support Unicode natively (in versions 5.x), then varchar will probably be a better choice.

If a single byte is used to store a character, there are 256 possible combinations, and thereby you can save 256 different characters. Collation is the pattern which defines the characters and the rules by which they are compared and sorted.
1252, which is the Latin1 (ANSI), is the most common. Single-byte character sets are also inadequate to store all the characters used by many languages. For example, some Asian languages have thousands of characters, so they must use two bytes per character.
Unicode standard
When systems using multiple code pages are used in a network, it becomes difficult to manage communication. To standardize things, the ISO and Unicode consortium introduced the Unicode. Unicode uses two bytes to store each character. That is 65,536 different characters can be defined, so almost all the characters can be covered with Unicode. If two computers use Unicode, every symbol will be represented in the same way and no conversion is needed - this is the idea behind Unicode.
SQL Server has two categories of character datatypes:
non-Unicode (char, varchar, and text)
Unicode (nchar, nvarchar, and ntext)
If we need to save character data from multiple countries, always use Unicode.

Although NVARCHAR stores Unicode, you should consider by the help of collation also you can use VARCHAR and save your data of your local languages.
Just imagine the following scenario.
The collation of your DB is Persian and you save a value like 'علی' (Persian writing of Ali) in the VARCHAR(10) datatype. There is no problem and the DBMS only uses three bytes to store it.
However, if you want to transfer your data to another database and see the correct result your destination database must have the same collation as the target which is Persian in this example.
If your target collation is different, you see some question marks(?) in the target database.
Finally, remember if you are using a huge database which is for usage of your local language, I would recommend to use location instead of using too many spaces.
I believe the design can be different. It depends on the environment you work on.

I had a look at the answers and many seem to recommend to use nvarchar over varchar, because space is not a problem anymore, so there is no harm in enabling Unicode for little extra storage. Well, this is not always true when you want to apply an index over your column. SQL Server has a limit of 900 bytes on the size of the field you can index. So if you have a varchar(900) you can still index it, but not varchar(901). With nvarchar, the number of characters is halved, so you can index up to nvarchar(450). So if you are confident you don't need nvarchar, I don't recommend using it.
In general, in databases, I recommend sticking to the size you need, because you can always expand. For example, a colleague at work once thought that there is no harm in using nvarchar(max) for a column, as we have no problem with storage at all. Later on, when we tried to apply an index over this column, SQL Server rejected this. If, however, he started with even varchar(5), we could have simply expanded it later to what we need without such a problem that will require us to do a field migration plan to fix this problem.

Jeffrey L Whitledge with ~47000 reputation score recommends usage of nvarchar
Solomon Rutzky with with ~33200 reputation score recommends: Do NOT always use NVARCHAR. That is a very dangerous, and often costly, attitude / approach.
What are the main performance differences between varchar and nvarchar SQL Server data types?
https://www.sqlservercentral.com/articles/disk-is-cheap-orly-4
Both persons of such a high reputation, what does a learning sql server database developer choose?
There are many warnings in answers and comments about performance issues if you are not consistent in choices.
There are comments pro/con nvarchar for performance.
There are comments pro/con varchar for performance.
I have a particular requirement for a table with many hundreds of columns, which in itself is probably unusual ?
I'm choosing varchar to avoid going close to the 8060 byte table record size limit of SQL*server 2012.
Use of nvarchar, for me, goes over this 8060 byte limit.
I'm also thinking that I should match the data types of the related code tables to the data types of the primary central table.
I have seen use of varchar column at this place of work, South Australian Government, by previous experienced database developers, where the table row count is going to be several millions or more (and very few nvarchar columns, if any, in these very large tables), so perhaps the expected data row volumes becomes part of this decision.

I have to say here (I realise that I'm probably going to open myself up to a slating!), but surely the only time when NVARCHAR is actually more useful (notice the more there!) than VARCHAR is when all of the collations on all of the dependant systems and within the database itself are the same...? If not then collation conversion has to happen anyway and so makes VARCHAR just as viable as NVARCHAR.
To add to this, some database systems, such as SQL Server (before 2012) have a page size of approx. 8K. So, if you're looking at storing searchable data not held in something like a TEXT or NTEXT field then VARCHAR provides the full 8k's worth of space whereas NVARCHAR only provides 4k (double the bytes, double the space).
I suppose, to summarise, the use of either is dependent on:
Project or context
Infrastructure
Database system

Follow Difference Between Sql Server VARCHAR and NVARCHAR Data Type. Here you could see in a very descriptive way.
In generalnvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.

nvarchar is safe to use compared to varchar in order to make our code error free (type mismatching) because nvarchar allows unicode characters also.
When we use where condition in SQL Server query and if we are using = operator, it will throw error some times. Probable reason for this is our mapping column will be difined in varchar. If we defined it in nvarchar this problem my not happen. Still we stick to varchar and avoid this issue we better use LIKE key word rather than =.

varchar is suitable to store non-unicode which means limited characters. Whereas nvarchar is superset of varchar so along with what characters we can store by using varchar, we can store even more without losing sight of functions.
Someone commented that storage/space is not an issue nowadays. Even if space is not an issue for one, identifying an optimal data type should be a requirement.
It's not only about storage! "Data moves" and you see where I am leading to!

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!

Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.

We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example

Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.

I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...

I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex

nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.

SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.

using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.

Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.

Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.

Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.

Use a varchar field with a length restriction.

It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.

It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.

I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string

Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.

For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar