Should NVARCHAR be used to saved 'accented characters' into Sql Server? - sql-server

I have the following two fields in a Sql Server table:
When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?
Basically, I thought:
VARCHAR = ASCII
NVARCHAR = Unicode
So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?
I can see the Γ§ and Γ© characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?
(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)
Edit
DB collation: Latin1_General_CI_AS
Server Version: 12.0.5223.6
Server Collation: SQL_Latin1_General_CP1_CI_AS

First the details of what Sql Server is doing.
VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).
NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.
As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.
Now to answer the question.
In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.
But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.
Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.

VARCHAR is ASCII using the current system code page - so the set of characters you can save depends what code page.
NVARCHAR is UNICODE so you can store all the characters.

Related

NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server

By SQL Server's documentation (and legacy documentation), a nvarchar field without _SC collation, should use the UCS-2 ENCODING.
Starting with SQL Server 2012 (11.x), when a Supplementary Character
(SC) enabled collation is used, these data types store the full range
of Unicode character data and use the UTF-16 character encoding. If a
non-SC collation is specified, then these data types store only the
subset of character data supported by the UCS-2 character encoding.
It also states that the UCS-2 ENCODING stores only the subset characters supported by UCS-2. From wikipedia UCS-2 specification:
UCS-2, uses a single code value [...] between 0 and 65,535 for each
character, and allows exactly two bytes (one 16-bit word) to represent
that value. UCS-2 thereby permits a binary representation of every
code point in the BMP that represents a character. UCS-2 cannot
represent code points outside the BMP.
So, by the specifications above, seems that I won't be able to store a emoji like: 😍 which have a value of 0x1F60D (or 128525 in decimal, way above 65535 limit of UCS-2). But on SQL Server 2008 R2 or SQL Server 2019 (both with the default SQL_Latin1_General_CP1_CI_AS COLLATION), on a nvarchar field, it's perfectly stored and returned (although not supported on comparisons with LIKE or =):
SMSS doesn't render emoji correctly, but here is the value copied and pasted from query result: 😍
So my questions are:
Is nvarchar field really using USC-2 on SQL Server 2008 R2 (I also tested on SQL Server 2019, with same non _SC collations and got same results)?
Is Microsoft's documentation of nchar/nvarchar misleading about "then these data types store only the subset of character data supported by the UCS-2 character encoding"?
Does UCS-2 ENCODING support or not code points beyond 65535?
How SQL Server was able to correctly store and retrieve this field's data, when it's outside the support of UCS-2 ENCODING?
NOTE: Server's Collation is SQL_Latin1_General_CP1_CI_AS and Field's Collation is Latin1_General_CS_AS.
NOTE 2: The original question stated tests about SQL Server 2008. I tested and got same results on a SQL Server 2019, with same respective COLLATIONs.
NOTE 3: Every other character I tested, outside UCS-2 supported range, is behaving on the same way. Some are: 𝕂, πŸ˜‚, 𨭎, 𝕬, 𝓰
There are several clarifications to make here regarding the MS documentation snippets posted in the question, and for the sample code, for the questions themselves, and for statements made in the comments on the question. Most of the confusion can be cleared up, I believe, by the information provided in the following post of mine:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
First things first (which is the only way it can be, right?): I'm not insulting the people who wrote the MS documentation as SQL Server alone is a huge product and there is a lot to cover, etc, but for the moment (until I get a chance to update it), please read the "official" documentation with a sense of caution. There are several misstatements regarding Collations / Unicode.
UCS-2 is an encoding that handles a subset of the Unicode character set. It works in 2-byte units. With 2 bytes, you can encode values 0 - 65535. This range of code points is known as the BMP (Basic Multilingual Plane). The BMP is all of the characters that are not Supplementary Characters (because those are supplementary to the BMP), but it does contain a set of code points that are exclusively used to encode Supplementary Characters in UTF-16 (i.e. the 2048 surrogate code points). This is a complete subset of UTF-16.
UTF-16 is an encoding that handles all of the Unicode character set. It also works in 2-byte units. In fact, there is no difference between UCS-2 and UTF-16 regarding the BMP code points and characters. The difference is that UTF-16 makes use of those 2048 surrogate code points in the BMP to create surrogate pairs which are the encodings for all Supplementary Characters. While Supplementary Characters are 4-bytes (in UTF-8, UTF-16, and UTF-32), they are really two 2-byte code units when encoding in UTF-16 (likewise, they are four 1-byte units in UTF-8, and one 4-byte in UTF-32).
Since UTF-16 merely extends what can be done with UCS-2 (by actually defining the usage of the surrogate code points), there is absolutely no difference in the byte sequences that can be stored in either case. All 2048 surrogate code points used to create Supplementary Characters in UTF-16 are valid code points in UCS-2, they just don't have any defined usage (i.e. interpretation) in UCS-2.
NVARCHAR, NCHAR, and the deprecated-so-do-NOT-use-it-NTEXT datatypes all store Unicode characters encoded in UCS-2 / UTF-16. From a storage perspective there is absolutely NO difference. So, it doesn't matter if something (even outside of SQL Server) says that it can store UCS-2. If it can do that, then it can inherently store UTF-16. In fact, while I have not had a chance to update the post linked above, I have been able to store and retrieve, as expected, emojis (most of which are Supplementary Characters) in SQL Server 2000 running on Windows XP. There were no Supplementary Characters defined until 2003, I think, and certainly not in 1999 when SQL Server 2000 was being developed. In fact (again), UCS-2 was only used in Windows / SQL Server because Microsoft pushed ahead with development prior to UTF-16 being finalized and published (and as soon as it was, UCS-2 became obsolete).
The only difference between UCS-2 and UTF-16 is that UTF-16 knows how to interpret surrogate pairs (comprised of a pair of surrogate code points, so at least they're appropriately named). This is where the _SC collations (and, starting in SQL Server 2017, also version _140_ collations which include support for Supplementary Characters so none of them have the _SC in their name) come in: they allow the built-in SQL Server functions to correctly interpret Supplementary Characters. That's it! Those collations have nothing to do with storing and retrieving Supplementary Characters, nor do they even have anything to do with sorting or comparing them (even though the "Collation and Unicode Support" documentation says specifically that this is what those collations do β€” another item on my "to do" list to fix). For collations that have neither _SC nor _140_ in their name (though the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at least, I remember there being some inconsistency either there or with the Japanese_*_140_BIN2 collations), the built-in functions only handle BMP code points (i.e. UCS-2).
Not "handling" Supplementary Characters means not interpreting a valid sequence of two surrogate code points as actually being a singular supplementary code point. So, for non-"SC" collations, BMP surrogate code point 1 (B1) and BMP surrogate code point 2 (B2) are just those two code points, neither one of which is defined, hence they appear as two "nothing"s (i.e. B1 followed by B2). This is why it is possible to split a Supplementary Character in two using SUBSTRING / LEFT / RIGHT because they won't know to keep those two BMP code points together. But an "SC" collation will read those code points B1 and B2 from disk or memory and see a single Supplementary code point S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.
The NCHAR() function (not the datatype; yes, poorly named function ;) is also sensitive to whether or not the default collation of the current database supports Supplementary Characters. If yes, then passing in a value between 65536 and 1114111 (the Supplementary Character range) will return a non-NULL value. If not, then passing in any value above 65535 will return NULL. (Of course, it would be far better if NCHAR() just always worked, given that storing / retrieving always works, so please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).
Fortunately, you don't need an "SC" collation to output a Supplementary Character. You can either paste in the literal character, or convert the UTF-16 Little Endian encoded surrogate pair, or use the NCHAR() function to output the surrogate pair. The following works in SQL Server 2000 (using SSMS 2005) running on Windows XP:
SELECT N'πŸ’©', -- πŸ’©
CONVERT(VARBINARY(4), N'πŸ’©'), -- 0x3DD8A9DC
CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- πŸ’© (regardless of DB Collation)
NCHAR(0xD83D) + NCHAR(0xDCA9) -- πŸ’© (regardless of DB Collation)
For more details on creating Supplementary Characters when using non-"SC" collations, please see my answer to the following DBA.SE question:
How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?
None of this affects what you see. If you store a code point, then it's there. How it behaves β€” sorting, comparison, etc β€” is controlled by collations. But, how it appears is controlled by fonts and the OS. No font can contain all characters, so different fonts contain different sets of characters, with a lot of overlap on the more widely used characters. However, if a font has a particular byte sequence mapped, then it can display that character. This is why the only work required to get Supplementary Characters displaying correctly in SQL Server 2000 (using SSMS 2005) running on Windows XP was to add a font containing the characters and doing one or two minor registry edits (no changes to SQL Server).
Supplementary Characters in SQL_* collations and collations without a version number in their name have no sort weights. Hence, they all equate to each other as well as to any other BMP code points that have no sort weights (including "space" (U+0020) and "null" (U+0000)). They started to fix this in the version _90_ collations.
SSMS has nothing to do with any of this, outside of possibly needing the font used for the query editor and/or grid results and/or errors + messages changed to one that has the desired characters. (SSMS doesn't render anything outside of maybe spatial data; characters are rendered by the display driver + font definitions + maybe something else).
Therefore, the following statement in the documentation (from the question):
If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.
is both nonsensical and incorrect. They were probably intending to say the datatypes would only store a subset of the UTF-16 encoding (since UCS-2 is the subset). Also, even if it said "UTF-16 character encoding" it would still be wrong because the bytes that you pass in will be stored (assuming enough free space in the column or variable).

How big should my nvarchar column be to store a maximum 255 characters?

I need to store 255 characters in a database column of type nvarchar. They characters are UTF-8 and can be multibyte. I am not the best with character encodings, so I'm not sure if that makes sense. I want to hold 255 characters that can be in any language, etc.
You can find some simple-to-understand background information about different Unicode encodings in this, which is a chapter I wrote in a manual for an open-source project. That background information will help you to understand some of the details in my answer.
The link to documentation about nvarchar provided by Simmo states that nvarchar is stored in UCS-2 format. Because of this, you will need to convert the UTF-8 strings into UCS-2 strings before storing them in the database. You can find C++ code to do that conversion here.
A subtle but important point is that the conversion code will actually convert into UTF-16, which is a superset of UCS-2 (UTF-16 supports the use of surrogate pairs, while UCS-2 doesn't). I don't use SQL Server so I don't know if it will complain if you try to insert some surrogate pairs into it. (Perhaps somebody else here can confirm whether or not it will).
If SQL Server disallows surrogate pairs, then there will be a limit on the range of languages your application can support, but at least you know that nvarchar(255) is sufficient for your needs.
On the other hand, if SQL Server allows the use of surrogate pairs, then you might want to use nvarchar(510) to allow for the (remote) possibility that every single character will be composed of surrogate pairs.
http://msdn.microsoft.com/en-us/library/ms186939.aspx
255 characters.

Why does SQL Server consider N'㐒㐒㐒㐒' and N'㐒㐒㐒' to be equal?

We are testing our application for Unicode compatibility and have been selecting random characters outside the Latin character set for testing.
On both Latin and Japanese-collated systems the following equality is true (U+3422):
N'㐒㐒㐒㐒' = N'㐒㐒㐒'
but the following is not (U+30C1):
N'チチチチ' = N'チチチ'
This was discovered when a test case using the first example (using U+3422) violated a unique index. Do we need to be more selective about the characters we use for testing? Obviously we don't know the semantic meaning of the above comparisons. Would this behavior be obvious to a native speaker?
Michael Kaplan has a blog post where he explains how Unicode strings are compared. It all comes down to the point that a string needs to have a weight, if it doesn't it will be considered equal to the empty string.
Sorting it all Out: The jury will give this string no weight
In SQL Server this weight is influenced by the defined collation. Microsoft has added appropriate collations for CJK Unified Ideographs in Windows XP/2003 and SQL Server 2005. This post recommends to use Chinese_Simplified_Pinyin_100_CI_AS or Chinese_Simplified_Stroke_Order_100_CI_AS:
You can always use any binary and binary2 collations although it wouldn't give you Linguistic correct result. For SQL Server 2005, you SHOULD use Chinese_PRC_90_CI_AS or Chinese_PRC_Stoke_90_CI_AS which support surrogate pair comparison (but not linguistic). For SQL Server 2008, you should use Chinese_Simplified_Pinyin_100_CI_AS and Chinese_Simplified_Stroke_Order_100_CI_AS which have better linguistic surrogate comparison. I do suggest you use these collation as your server/database/table collation instead of passing the collation name during comparison.
So the following SQL statement would work as expected:
select * from MyTable where N'' = N'㐀' COLLATE Chinese_Simplified_Stroke_Order_100_CI_AS;
A list of all supported collations can be found in MSDN:
SQL Server 2008 Books Online: Windows Collation Name
That character U+3422 is from the CJK Unified Ideographs tables, which are a relatively obscure (and politically loaded) part of the unicode standard. My guess is that SQL Server simply does not know that part - or perhaps even intentionally does not implement it due to political considerations.
Edit: looks like my guess was wrong and the real problem was that neither Latin nor Japanese collation define weights for that character.
If you look at the Unihan data page, the character appears to only have the "K-Source" field which corresponds to the South Korean government's mappings.
My guess is that MS SQL asks "is this character a Chinese character?" If so then use the Japanese sorting standard, discarding the character if the collation number isn't available - likely a SQL server-specific issue.
I very much doubt it's a political dispute as another poster suggested as the character doesn't even have a Taiwan or Hong Kong encoding mapping.
More technical info: The J-Source (the Japanese sorting order prescribed by the Japanese government) is blank as it probably was only used in classical Korean Hanja (chinese characters which are now only used in some contexts.)
The Japanese government's JIS sorting standards generally sort Kanji characters by the Japanese On reading (which is usually the approximated Chinese pronunciation when the characters were imported into Japan.) But this character probably isn't used much in Japanese and may not even have a Japanese pronunciation to associate with it, so hasn't been added to the data.

How BIG do you make your Nvarchar()

When designing a database, what decisions do you consider when deciding how big your nvarchar should be.
If i was to make an address table my gut reaction would be for address line 1 to be nvarchar(255) like an old access database.
I have found using this has got me in bother with the old 'The string would be truncated'. I know that this can be prevented by limiting the input box but if a user really has a address line one that is over 255 this should be allowed.
How big should I make my nvarchar(????)
My recommendation: make them just as big as you REALLY need them.
E.g. for a zip code column, 10-20 chars are definitely enough. Ditto for a phone number. E-Mails might be longer, 50-100 chars. Names - well, I usually get by with 50 chars, ditto for first names. You can always and easily extend fields if you really need to - that's no a big undertaking at all.
There's really no point in making all varchar/nvarchar fields as big as they can be. After all, a SQL Server page is fixed and limited to 8060 bytes per row. Having 10 fields of NVARCHAR(4000) is just asking for trouble.... (since if you actually try to fill them with too much data, SQL Server will barf at you).
If you REALLY need a really big field, use NVARCHAR/VARCHAR(MAX) - those are stored in your page, as long as they fit, and will be sent to "overflow" storage if they get too big.
NVARCHAR vs. VARCHAR: this really boils down to do you really need "exotic" characters, such as Japanese, Chinese, or other non-ASCII style characters? In Europe, even some of the eastern European characters cannot be represented by VARCHAR fields anymore (they will be stripped of their hachek (? spelling ?). Western European languages (English, German, French, etc.) are all very well served by VARCHAR.
BUT: NVARCHAR does use twice as much space - on disk and in your SQL Server memory - at all times. If you really need it, you need it - but do you REALLY ? :-) It's up to you.
Marc
I don't use nvarchar personally :-) I always use varchar.
However, I tend to use 100 for name and 1000 for comments. Trapping and dealing with longer strings is something the client can do, say via regex, so SQL only gets the data it expects.
You can avoid truncation errors be parameterising the calls, for example via stored procs.
If the parameter is defined as varchar(200), say, then truncation happens silently if you send > 200. The truncation error is thrown only for an INSERT or UPDATE statement: with parameters it won't happen.
The 255 "limit" for SQL Server goes back to 6.5 because vachar was limited to 255. SQL Server 7.0 + changed to 8000 and added support for unicode
Edit:
Why I don't use nvarchar: Double memory footprint, double index size, double disk size, simply don't need it. I work for a big Swiss company with offices globally so I'm not being parochial.
Also discussed here: varchar vs nvarchar performance
On further reflection, I'd suggest unicode appeals to client developers but as a developer DBA I focus on performance and efficiency...
It depends on what the field represents. If I'm doing a quick prototype I leave the defaults of 255. For anything like comments etc I'd probably put it to 1000.
The only way I'd make it smaller really is on things I definately know the siez of, zip codes or NI numbers etc.
For columns that you need to have certain constraints on - like names, emails, addresses, etc - you should put a reasonably high max length. For instance a first name of more than 50 characters seems a bit suspicious and an input above that size will probably contain more that just a first name. But for the initial design of a database, take that reasonable size and double it. So for first names, set it to 100 (or 200 if 100 is your 'reasonable size'). Then put the app in production, let the users play around for a sufficiently long time to gather data and then check the actual max(len(FirstName)). Are there any suspicious values there? Anything above 50 chars? Find out what's in there and see if it's actually a first name or not. If it's not, the input form probably needs better explanations/validations.
Do the same for comments; Set them to nvharchar(max) initially. Then come back when your database has grown enough for you to start optimizing performance. Take the max length of the comments, double it and you have a good max length for your column.

What is the difference between varchar and nvarchar?

Is it just that nvarchar supports multibyte characters? If that is the case, is there really any point, other than storage concerns, to using varchars?
An nvarchar column can store any Unicode data. A varchar column is restricted to an 8-bit codepage. Some people think that varchar should be used because it takes up less space. I believe this is not the correct answer. Codepage incompatabilities are a pain, and Unicode is the cure for codepage problems. With cheap disk and memory nowadays, there is really no reason to waste time mucking around with code pages anymore.
All modern operating systems and development platforms use Unicode internally. By using nvarchar rather than varchar, you can avoid doing encoding conversions every time you read from or write to the database. Conversions take time, and are prone to errors. And recovery from conversion errors is a non-trivial problem.
If you are interfacing with an application that uses only ASCII, I would still recommend using Unicode in the database. The OS and database collation algorithms will work better with Unicode. Unicode avoids conversion problems when interfacing with other systems. And you will be preparing for the future. And you can always validate that your data is restricted to 7-bit ASCII for whatever legacy system you're having to maintain, even while enjoying some of the benefits of full Unicode storage.
varchar: Variable-length, non-Unicode character data. The database collation determines which code page the data is stored using.
nvarchar: Variable-length Unicode character data. Dependent on the database collation for comparisons.
Armed with this knowledge, use whichever one matches your input data (ASCII v. Unicode).
I always use nvarchar as it allows whatever I'm building to withstand pretty much any data I throw at it. My CMS system does Chinese by accident, because I used nvarchar. These days, any new applications shouldn't really be concerned with the amount of space required.
It depends on how Oracle was installed. During the installation process, the NLS_CHARACTERSET option is set. You may be able to find it with the query SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET'.
If your NLS_CHARACTERSET is a Unicode encoding like UTF8, great. Using VARCHAR and NVARCHAR are pretty much identical. Stop reading now, just go for it. Otherwise, or if you have no control over the Oracle character set, read on.
VARCHAR β€” Data is stored in the NLS_CHARACTERSET encoding. If there are other database instances on the same server, you may be restricted by them; and vice versa, since you have to share the setting. Such a field can store any data that can be encoded using that character set, and nothing else. So for example if the character set is MS-1252, you can only store characters like English letters, a handful of accented letters, and a few others (like € and β€”). Your application would be useful only to a few locales, unable to operate anywhere else in the world. For this reason, it is considered A Bad Idea.
NVARCHAR β€” Data is stored in a Unicode encoding. Every language is supported. A Good Idea.
What about storage space? VARCHAR is generally efficient, since the character set / encoding was custom-designed for a specific locale. NVARCHAR fields store either in UTF-8 or UTF-16 encoding, base on the NLS setting ironically enough. UTF-8 is very efficient for "Western" languages, while still supporting Asian languages. UTF-16 is very efficient for Asian languages, while still supporting "Western" languages. If concerned about storage space, pick an NLS setting to cause Oracle to use UTF-8 or UTF-16 as appropriate.
What about processing speed? Most new coding platforms use Unicode natively (Java, .NET, even C++ std::wstring from years ago!) so if the database field is VARCHAR it forces Oracle to convert between character sets on every read or write, not so good. Using NVARCHAR avoids the conversion.
Bottom line: Use NVARCHAR! It avoids limitations and dependencies, is fine for storage space, and usually best for performance too.
nvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.
varchar is used for non-Unicode characters only on the other hand nvarchar is used for both unicode and non-unicode characters. Some other difference between them is given bellow.
VARCHAR vs. NVARCHAR
VARCHAR
NVARCHAR
Character Data Type
Variable-length, non-Unicode characters
Variable-length, both Unicode and non-Unicode characters such as Japanese, Korean, and Chinese.
Maximum Length
Up to 8,000 characters
Up to 4,000 characters
Character Size
Takes up 1 byte per character
Takes up 2 bytes per Unicode/Non-Unicode character
Storage Size
Actual Length (in bytes)
2 times Actual Length (in bytes)
Usage
Used when data length is variable or variable length columns and if actual data is always way less than capacity
Due to storage only, used only if you need Unicode support such as the Japanese Kanji or Korean Hangul characters.
The main difference between Varchar(n) and nvarchar(n) is:
Varchar ( Variable-length, non-Unicode character data) size is upto 8000.
It is a variable length data type
Used to store non-Unicode characters
Occupies 1 byte of space for each character
Nvarchar: Variable-length Unicode character data.
It is a variable-length data type
Used to store Unicode characters.
Data is stored in a Unicode encoding. Every
language is supported. (for example the languages Arabic, German,Hindi,etc and so on)
My two cents
Indexes can fail when not using the correct datatypes:
In SQL Server: When you have an index over a VARCHAR column and present it a Unicode String, SQL Server does not make use of the index. The same thing happens when you present a BigInt to a indexed-column containing SmallInt. Even if the BigInt is small enough to be a SmallInt, SQL Server is not able to use the index. The other way around you do not have this problem (when providing SmallInt or Ansi-Code to an indexed BigInt ot NVARCHAR column).
Datatypes can vary between different DBMS's (DataBase Management System):
Know that every database has slightly different datatypes and VARCHAR does not means the same everywhere. While SQL Server has VARCHAR and NVARCHAR, an Apache/Derby database has only VARCHAR and there VARCHAR is in Unicode.
Mainly nvarchar stores Unicode characters and varchar stores non-Unicode characters.
"Unicodes" means 16-bit character encoding scheme allowing characters from lots of other languages like Arabic, Hebrew, Chinese, Japanese, to be encoded in a single character set.
That means unicodes is using 2 bytes per character to store and nonunicodes uses only one byte per character to store. Which means unicodes need double capacity to store compared to non-unicodes.
Since SQL Server 2019 varchar columns support UTF-8 encoding.
Thus, from now on, the difference is size.
In a database system that translates to difference in speed.
Less data = Less IO + Less Memory = More speed in general. Read the article above for the numbers.
Go for varchar in UTF8 from now on!
Only if you have big percentage of data with characters in ranges 2048 - 16383 and 16384 – 65535 - you will have to measure
You're right. nvarchar stores Unicode data while varchar stores single-byte character data. Other than storage differences (nvarchar requires twice the storage space as varchar), which you already mentioned, the main reason for preferring nvarchar over varchar would be internationalization (i.e. storing strings in other languages).
nVarchar will help you to store Unicode characters. It is the way to go if you want to store localized data.
I would say, it depends.
If you develop a desktop application, where the OS works in Unicode (like all current Windows systems) and language does natively support Unicode (default strings are Unicode, like in Java or C#), then go nvarchar.
If you develop a web application, where strings come in as UTF-8, and language is PHP, which still does not support Unicode natively (in versions 5.x), then varchar will probably be a better choice.
If a single byte is used to store a character, there are 256 possible combinations, and thereby you can save 256 different characters. Collation is the pattern which defines the characters and the rules by which they are compared and sorted.
1252, which is the Latin1 (ANSI), is the most common. Single-byte character sets are also inadequate to store all the characters used by many languages. For example, some Asian languages have thousands of characters, so they must use two bytes per character.
Unicode standard
When systems using multiple code pages are used in a network, it becomes difficult to manage communication. To standardize things, the ISO and Unicode consortium introduced the Unicode. Unicode uses two bytes to store each character. That is 65,536 different characters can be defined, so almost all the characters can be covered with Unicode. If two computers use Unicode, every symbol will be represented in the same way and no conversion is needed - this is the idea behind Unicode.
SQL Server has two categories of character datatypes:
non-Unicode (char, varchar, and text)
Unicode (nchar, nvarchar, and ntext)
If we need to save character data from multiple countries, always use Unicode.
Although NVARCHAR stores Unicode, you should consider by the help of collation also you can use VARCHAR and save your data of your local languages.
Just imagine the following scenario.
The collation of your DB is Persian and you save a value like 'ΨΉΩ„ΫŒ' (Persian writing of Ali) in the VARCHAR(10) datatype. There is no problem and the DBMS only uses three bytes to store it.
However, if you want to transfer your data to another database and see the correct result your destination database must have the same collation as the target which is Persian in this example.
If your target collation is different, you see some question marks(?) in the target database.
Finally, remember if you are using a huge database which is for usage of your local language, I would recommend to use location instead of using too many spaces.
I believe the design can be different. It depends on the environment you work on.
I had a look at the answers and many seem to recommend to use nvarchar over varchar, because space is not a problem anymore, so there is no harm in enabling Unicode for little extra storage. Well, this is not always true when you want to apply an index over your column. SQL Server has a limit of 900 bytes on the size of the field you can index. So if you have a varchar(900) you can still index it, but not varchar(901). With nvarchar, the number of characters is halved, so you can index up to nvarchar(450). So if you are confident you don't need nvarchar, I don't recommend using it.
In general, in databases, I recommend sticking to the size you need, because you can always expand. For example, a colleague at work once thought that there is no harm in using nvarchar(max) for a column, as we have no problem with storage at all. Later on, when we tried to apply an index over this column, SQL Server rejected this. If, however, he started with even varchar(5), we could have simply expanded it later to what we need without such a problem that will require us to do a field migration plan to fix this problem.
Jeffrey L Whitledge with ~47000 reputation score recommends usage of nvarchar
Solomon Rutzky with with ~33200 reputation score recommends: Do NOT always use NVARCHAR. That is a very dangerous, and often costly, attitude / approach.
What are the main performance differences between varchar and nvarchar SQL Server data types?
https://www.sqlservercentral.com/articles/disk-is-cheap-orly-4
Both persons of such a high reputation, what does a learning sql server database developer choose?
There are many warnings in answers and comments about performance issues if you are not consistent in choices.
There are comments pro/con nvarchar for performance.
There are comments pro/con varchar for performance.
I have a particular requirement for a table with many hundreds of columns, which in itself is probably unusual ?
I'm choosing varchar to avoid going close to the 8060 byte table record size limit of SQL*server 2012.
Use of nvarchar, for me, goes over this 8060 byte limit.
I'm also thinking that I should match the data types of the related code tables to the data types of the primary central table.
I have seen use of varchar column at this place of work, South Australian Government, by previous experienced database developers, where the table row count is going to be several millions or more (and very few nvarchar columns, if any, in these very large tables), so perhaps the expected data row volumes becomes part of this decision.
I have to say here (I realise that I'm probably going to open myself up to a slating!), but surely the only time when NVARCHAR is actually more useful (notice the more there!) than VARCHAR is when all of the collations on all of the dependant systems and within the database itself are the same...? If not then collation conversion has to happen anyway and so makes VARCHAR just as viable as NVARCHAR.
To add to this, some database systems, such as SQL Server (before 2012) have a page size of approx. 8K. So, if you're looking at storing searchable data not held in something like a TEXT or NTEXT field then VARCHAR provides the full 8k's worth of space whereas NVARCHAR only provides 4k (double the bytes, double the space).
I suppose, to summarise, the use of either is dependent on:
Project or context
Infrastructure
Database system
Follow Difference Between Sql Server VARCHAR and NVARCHAR Data Type. Here you could see in a very descriptive way.
In generalnvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.
nvarchar is safe to use compared to varchar in order to make our code error free (type mismatching) because nvarchar allows unicode characters also.
When we use where condition in SQL Server query and if we are using = operator, it will throw error some times. Probable reason for this is our mapping column will be difined in varchar. If we defined it in nvarchar this problem my not happen. Still we stick to varchar and avoid this issue we better use LIKE key word rather than =.
varchar is suitable to store non-unicode which means limited characters. Whereas nvarchar is superset of varchar so along with what characters we can store by using varchar, we can store even more without losing sight of functions.
Someone commented that storage/space is not an issue nowadays. Even if space is not an issue for one, identifying an optimal data type should be a requirement.
It's not only about storage! "Data moves" and you see where I am leading to!

Resources