Is it just that nvarchar supports multibyte characters? If that is the case, is there really any point, other than storage concerns, to using varchars?
An nvarchar column can store any Unicode data. A varchar column is restricted to an 8-bit codepage. Some people think that varchar should be used because it takes up less space. I believe this is not the correct answer. Codepage incompatabilities are a pain, and Unicode is the cure for codepage problems. With cheap disk and memory nowadays, there is really no reason to waste time mucking around with code pages anymore.
All modern operating systems and development platforms use Unicode internally. By using nvarchar rather than varchar, you can avoid doing encoding conversions every time you read from or write to the database. Conversions take time, and are prone to errors. And recovery from conversion errors is a non-trivial problem.
If you are interfacing with an application that uses only ASCII, I would still recommend using Unicode in the database. The OS and database collation algorithms will work better with Unicode. Unicode avoids conversion problems when interfacing with other systems. And you will be preparing for the future. And you can always validate that your data is restricted to 7-bit ASCII for whatever legacy system you're having to maintain, even while enjoying some of the benefits of full Unicode storage.
varchar: Variable-length, non-Unicode character data. The database collation determines which code page the data is stored using.
nvarchar: Variable-length Unicode character data. Dependent on the database collation for comparisons.
Armed with this knowledge, use whichever one matches your input data (ASCII v. Unicode).
I always use nvarchar as it allows whatever I'm building to withstand pretty much any data I throw at it. My CMS system does Chinese by accident, because I used nvarchar. These days, any new applications shouldn't really be concerned with the amount of space required.
It depends on how Oracle was installed. During the installation process, the NLS_CHARACTERSET option is set. You may be able to find it with the query SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET'.
If your NLS_CHARACTERSET is a Unicode encoding like UTF8, great. Using VARCHAR and NVARCHAR are pretty much identical. Stop reading now, just go for it. Otherwise, or if you have no control over the Oracle character set, read on.
VARCHAR — Data is stored in the NLS_CHARACTERSET encoding. If there are other database instances on the same server, you may be restricted by them; and vice versa, since you have to share the setting. Such a field can store any data that can be encoded using that character set, and nothing else. So for example if the character set is MS-1252, you can only store characters like English letters, a handful of accented letters, and a few others (like € and —). Your application would be useful only to a few locales, unable to operate anywhere else in the world. For this reason, it is considered A Bad Idea.
NVARCHAR — Data is stored in a Unicode encoding. Every language is supported. A Good Idea.
What about storage space? VARCHAR is generally efficient, since the character set / encoding was custom-designed for a specific locale. NVARCHAR fields store either in UTF-8 or UTF-16 encoding, base on the NLS setting ironically enough. UTF-8 is very efficient for "Western" languages, while still supporting Asian languages. UTF-16 is very efficient for Asian languages, while still supporting "Western" languages. If concerned about storage space, pick an NLS setting to cause Oracle to use UTF-8 or UTF-16 as appropriate.
What about processing speed? Most new coding platforms use Unicode natively (Java, .NET, even C++ std::wstring from years ago!) so if the database field is VARCHAR it forces Oracle to convert between character sets on every read or write, not so good. Using NVARCHAR avoids the conversion.
Bottom line: Use NVARCHAR! It avoids limitations and dependencies, is fine for storage space, and usually best for performance too.
nvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.
varchar is used for non-Unicode characters only on the other hand nvarchar is used for both unicode and non-unicode characters. Some other difference between them is given bellow.
VARCHAR vs. NVARCHAR
VARCHAR
NVARCHAR
Character Data Type
Variable-length, non-Unicode characters
Variable-length, both Unicode and non-Unicode characters such as Japanese, Korean, and Chinese.
Maximum Length
Up to 8,000 characters
Up to 4,000 characters
Character Size
Takes up 1 byte per character
Takes up 2 bytes per Unicode/Non-Unicode character
Storage Size
Actual Length (in bytes)
2 times Actual Length (in bytes)
Usage
Used when data length is variable or variable length columns and if actual data is always way less than capacity
Due to storage only, used only if you need Unicode support such as the Japanese Kanji or Korean Hangul characters.
The main difference between Varchar(n) and nvarchar(n) is:
Varchar ( Variable-length, non-Unicode character data) size is upto 8000.
It is a variable length data type
Used to store non-Unicode characters
Occupies 1 byte of space for each character
Nvarchar: Variable-length Unicode character data.
It is a variable-length data type
Used to store Unicode characters.
Data is stored in a Unicode encoding. Every
language is supported. (for example the languages Arabic, German,Hindi,etc and so on)
My two cents
Indexes can fail when not using the correct datatypes:
In SQL Server: When you have an index over a VARCHAR column and present it a Unicode String, SQL Server does not make use of the index. The same thing happens when you present a BigInt to a indexed-column containing SmallInt. Even if the BigInt is small enough to be a SmallInt, SQL Server is not able to use the index. The other way around you do not have this problem (when providing SmallInt or Ansi-Code to an indexed BigInt ot NVARCHAR column).
Datatypes can vary between different DBMS's (DataBase Management System):
Know that every database has slightly different datatypes and VARCHAR does not means the same everywhere. While SQL Server has VARCHAR and NVARCHAR, an Apache/Derby database has only VARCHAR and there VARCHAR is in Unicode.
Mainly nvarchar stores Unicode characters and varchar stores non-Unicode characters.
"Unicodes" means 16-bit character encoding scheme allowing characters from lots of other languages like Arabic, Hebrew, Chinese, Japanese, to be encoded in a single character set.
That means unicodes is using 2 bytes per character to store and nonunicodes uses only one byte per character to store. Which means unicodes need double capacity to store compared to non-unicodes.
Since SQL Server 2019 varchar columns support UTF-8 encoding.
Thus, from now on, the difference is size.
In a database system that translates to difference in speed.
Less data = Less IO + Less Memory = More speed in general. Read the article above for the numbers.
Go for varchar in UTF8 from now on!
Only if you have big percentage of data with characters in ranges 2048 - 16383 and 16384 – 65535 - you will have to measure
You're right. nvarchar stores Unicode data while varchar stores single-byte character data. Other than storage differences (nvarchar requires twice the storage space as varchar), which you already mentioned, the main reason for preferring nvarchar over varchar would be internationalization (i.e. storing strings in other languages).
nVarchar will help you to store Unicode characters. It is the way to go if you want to store localized data.
I would say, it depends.
If you develop a desktop application, where the OS works in Unicode (like all current Windows systems) and language does natively support Unicode (default strings are Unicode, like in Java or C#), then go nvarchar.
If you develop a web application, where strings come in as UTF-8, and language is PHP, which still does not support Unicode natively (in versions 5.x), then varchar will probably be a better choice.
If a single byte is used to store a character, there are 256 possible combinations, and thereby you can save 256 different characters. Collation is the pattern which defines the characters and the rules by which they are compared and sorted.
1252, which is the Latin1 (ANSI), is the most common. Single-byte character sets are also inadequate to store all the characters used by many languages. For example, some Asian languages have thousands of characters, so they must use two bytes per character.
Unicode standard
When systems using multiple code pages are used in a network, it becomes difficult to manage communication. To standardize things, the ISO and Unicode consortium introduced the Unicode. Unicode uses two bytes to store each character. That is 65,536 different characters can be defined, so almost all the characters can be covered with Unicode. If two computers use Unicode, every symbol will be represented in the same way and no conversion is needed - this is the idea behind Unicode.
SQL Server has two categories of character datatypes:
non-Unicode (char, varchar, and text)
Unicode (nchar, nvarchar, and ntext)
If we need to save character data from multiple countries, always use Unicode.
Although NVARCHAR stores Unicode, you should consider by the help of collation also you can use VARCHAR and save your data of your local languages.
Just imagine the following scenario.
The collation of your DB is Persian and you save a value like 'علی' (Persian writing of Ali) in the VARCHAR(10) datatype. There is no problem and the DBMS only uses three bytes to store it.
However, if you want to transfer your data to another database and see the correct result your destination database must have the same collation as the target which is Persian in this example.
If your target collation is different, you see some question marks(?) in the target database.
Finally, remember if you are using a huge database which is for usage of your local language, I would recommend to use location instead of using too many spaces.
I believe the design can be different. It depends on the environment you work on.
I had a look at the answers and many seem to recommend to use nvarchar over varchar, because space is not a problem anymore, so there is no harm in enabling Unicode for little extra storage. Well, this is not always true when you want to apply an index over your column. SQL Server has a limit of 900 bytes on the size of the field you can index. So if you have a varchar(900) you can still index it, but not varchar(901). With nvarchar, the number of characters is halved, so you can index up to nvarchar(450). So if you are confident you don't need nvarchar, I don't recommend using it.
In general, in databases, I recommend sticking to the size you need, because you can always expand. For example, a colleague at work once thought that there is no harm in using nvarchar(max) for a column, as we have no problem with storage at all. Later on, when we tried to apply an index over this column, SQL Server rejected this. If, however, he started with even varchar(5), we could have simply expanded it later to what we need without such a problem that will require us to do a field migration plan to fix this problem.
Jeffrey L Whitledge with ~47000 reputation score recommends usage of nvarchar
Solomon Rutzky with with ~33200 reputation score recommends: Do NOT always use NVARCHAR. That is a very dangerous, and often costly, attitude / approach.
What are the main performance differences between varchar and nvarchar SQL Server data types?
https://www.sqlservercentral.com/articles/disk-is-cheap-orly-4
Both persons of such a high reputation, what does a learning sql server database developer choose?
There are many warnings in answers and comments about performance issues if you are not consistent in choices.
There are comments pro/con nvarchar for performance.
There are comments pro/con varchar for performance.
I have a particular requirement for a table with many hundreds of columns, which in itself is probably unusual ?
I'm choosing varchar to avoid going close to the 8060 byte table record size limit of SQL*server 2012.
Use of nvarchar, for me, goes over this 8060 byte limit.
I'm also thinking that I should match the data types of the related code tables to the data types of the primary central table.
I have seen use of varchar column at this place of work, South Australian Government, by previous experienced database developers, where the table row count is going to be several millions or more (and very few nvarchar columns, if any, in these very large tables), so perhaps the expected data row volumes becomes part of this decision.
I have to say here (I realise that I'm probably going to open myself up to a slating!), but surely the only time when NVARCHAR is actually more useful (notice the more there!) than VARCHAR is when all of the collations on all of the dependant systems and within the database itself are the same...? If not then collation conversion has to happen anyway and so makes VARCHAR just as viable as NVARCHAR.
To add to this, some database systems, such as SQL Server (before 2012) have a page size of approx. 8K. So, if you're looking at storing searchable data not held in something like a TEXT or NTEXT field then VARCHAR provides the full 8k's worth of space whereas NVARCHAR only provides 4k (double the bytes, double the space).
I suppose, to summarise, the use of either is dependent on:
Project or context
Infrastructure
Database system
Follow Difference Between Sql Server VARCHAR and NVARCHAR Data Type. Here you could see in a very descriptive way.
In generalnvarchar stores data as Unicode, so, if you're going to store multilingual data (more than one language) in a data column you need the N variant.
nvarchar is safe to use compared to varchar in order to make our code error free (type mismatching) because nvarchar allows unicode characters also.
When we use where condition in SQL Server query and if we are using = operator, it will throw error some times. Probable reason for this is our mapping column will be difined in varchar. If we defined it in nvarchar this problem my not happen. Still we stick to varchar and avoid this issue we better use LIKE key word rather than =.
varchar is suitable to store non-unicode which means limited characters. Whereas nvarchar is superset of varchar so along with what characters we can store by using varchar, we can store even more without losing sight of functions.
Someone commented that storage/space is not an issue nowadays. Even if space is not an issue for one, identifying an optimal data type should be a requirement.
It's not only about storage! "Data moves" and you see where I am leading to!
Related
By SQL Server's documentation (and legacy documentation), a nvarchar field without _SC collation, should use the UCS-2 ENCODING.
Starting with SQL Server 2012 (11.x), when a Supplementary Character
(SC) enabled collation is used, these data types store the full range
of Unicode character data and use the UTF-16 character encoding. If a
non-SC collation is specified, then these data types store only the
subset of character data supported by the UCS-2 character encoding.
It also states that the UCS-2 ENCODING stores only the subset characters supported by UCS-2. From wikipedia UCS-2 specification:
UCS-2, uses a single code value [...] between 0 and 65,535 for each
character, and allows exactly two bytes (one 16-bit word) to represent
that value. UCS-2 thereby permits a binary representation of every
code point in the BMP that represents a character. UCS-2 cannot
represent code points outside the BMP.
So, by the specifications above, seems that I won't be able to store a emoji like: 😍 which have a value of 0x1F60D (or 128525 in decimal, way above 65535 limit of UCS-2). But on SQL Server 2008 R2 or SQL Server 2019 (both with the default SQL_Latin1_General_CP1_CI_AS COLLATION), on a nvarchar field, it's perfectly stored and returned (although not supported on comparisons with LIKE or =):
SMSS doesn't render emoji correctly, but here is the value copied and pasted from query result: 😍
So my questions are:
Is nvarchar field really using USC-2 on SQL Server 2008 R2 (I also tested on SQL Server 2019, with same non _SC collations and got same results)?
Is Microsoft's documentation of nchar/nvarchar misleading about "then these data types store only the subset of character data supported by the UCS-2 character encoding"?
Does UCS-2 ENCODING support or not code points beyond 65535?
How SQL Server was able to correctly store and retrieve this field's data, when it's outside the support of UCS-2 ENCODING?
NOTE: Server's Collation is SQL_Latin1_General_CP1_CI_AS and Field's Collation is Latin1_General_CS_AS.
NOTE 2: The original question stated tests about SQL Server 2008. I tested and got same results on a SQL Server 2019, with same respective COLLATIONs.
NOTE 3: Every other character I tested, outside UCS-2 supported range, is behaving on the same way. Some are: 𝕂, 😂, 𨭎, 𝕬, 𝓰
There are several clarifications to make here regarding the MS documentation snippets posted in the question, and for the sample code, for the questions themselves, and for statements made in the comments on the question. Most of the confusion can be cleared up, I believe, by the information provided in the following post of mine:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
First things first (which is the only way it can be, right?): I'm not insulting the people who wrote the MS documentation as SQL Server alone is a huge product and there is a lot to cover, etc, but for the moment (until I get a chance to update it), please read the "official" documentation with a sense of caution. There are several misstatements regarding Collations / Unicode.
UCS-2 is an encoding that handles a subset of the Unicode character set. It works in 2-byte units. With 2 bytes, you can encode values 0 - 65535. This range of code points is known as the BMP (Basic Multilingual Plane). The BMP is all of the characters that are not Supplementary Characters (because those are supplementary to the BMP), but it does contain a set of code points that are exclusively used to encode Supplementary Characters in UTF-16 (i.e. the 2048 surrogate code points). This is a complete subset of UTF-16.
UTF-16 is an encoding that handles all of the Unicode character set. It also works in 2-byte units. In fact, there is no difference between UCS-2 and UTF-16 regarding the BMP code points and characters. The difference is that UTF-16 makes use of those 2048 surrogate code points in the BMP to create surrogate pairs which are the encodings for all Supplementary Characters. While Supplementary Characters are 4-bytes (in UTF-8, UTF-16, and UTF-32), they are really two 2-byte code units when encoding in UTF-16 (likewise, they are four 1-byte units in UTF-8, and one 4-byte in UTF-32).
Since UTF-16 merely extends what can be done with UCS-2 (by actually defining the usage of the surrogate code points), there is absolutely no difference in the byte sequences that can be stored in either case. All 2048 surrogate code points used to create Supplementary Characters in UTF-16 are valid code points in UCS-2, they just don't have any defined usage (i.e. interpretation) in UCS-2.
NVARCHAR, NCHAR, and the deprecated-so-do-NOT-use-it-NTEXT datatypes all store Unicode characters encoded in UCS-2 / UTF-16. From a storage perspective there is absolutely NO difference. So, it doesn't matter if something (even outside of SQL Server) says that it can store UCS-2. If it can do that, then it can inherently store UTF-16. In fact, while I have not had a chance to update the post linked above, I have been able to store and retrieve, as expected, emojis (most of which are Supplementary Characters) in SQL Server 2000 running on Windows XP. There were no Supplementary Characters defined until 2003, I think, and certainly not in 1999 when SQL Server 2000 was being developed. In fact (again), UCS-2 was only used in Windows / SQL Server because Microsoft pushed ahead with development prior to UTF-16 being finalized and published (and as soon as it was, UCS-2 became obsolete).
The only difference between UCS-2 and UTF-16 is that UTF-16 knows how to interpret surrogate pairs (comprised of a pair of surrogate code points, so at least they're appropriately named). This is where the _SC collations (and, starting in SQL Server 2017, also version _140_ collations which include support for Supplementary Characters so none of them have the _SC in their name) come in: they allow the built-in SQL Server functions to correctly interpret Supplementary Characters. That's it! Those collations have nothing to do with storing and retrieving Supplementary Characters, nor do they even have anything to do with sorting or comparing them (even though the "Collation and Unicode Support" documentation says specifically that this is what those collations do — another item on my "to do" list to fix). For collations that have neither _SC nor _140_ in their name (though the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at least, I remember there being some inconsistency either there or with the Japanese_*_140_BIN2 collations), the built-in functions only handle BMP code points (i.e. UCS-2).
Not "handling" Supplementary Characters means not interpreting a valid sequence of two surrogate code points as actually being a singular supplementary code point. So, for non-"SC" collations, BMP surrogate code point 1 (B1) and BMP surrogate code point 2 (B2) are just those two code points, neither one of which is defined, hence they appear as two "nothing"s (i.e. B1 followed by B2). This is why it is possible to split a Supplementary Character in two using SUBSTRING / LEFT / RIGHT because they won't know to keep those two BMP code points together. But an "SC" collation will read those code points B1 and B2 from disk or memory and see a single Supplementary code point S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.
The NCHAR() function (not the datatype; yes, poorly named function ;) is also sensitive to whether or not the default collation of the current database supports Supplementary Characters. If yes, then passing in a value between 65536 and 1114111 (the Supplementary Character range) will return a non-NULL value. If not, then passing in any value above 65535 will return NULL. (Of course, it would be far better if NCHAR() just always worked, given that storing / retrieving always works, so please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).
Fortunately, you don't need an "SC" collation to output a Supplementary Character. You can either paste in the literal character, or convert the UTF-16 Little Endian encoded surrogate pair, or use the NCHAR() function to output the surrogate pair. The following works in SQL Server 2000 (using SSMS 2005) running on Windows XP:
SELECT N'💩', -- 💩
CONVERT(VARBINARY(4), N'💩'), -- 0x3DD8A9DC
CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- 💩 (regardless of DB Collation)
NCHAR(0xD83D) + NCHAR(0xDCA9) -- 💩 (regardless of DB Collation)
For more details on creating Supplementary Characters when using non-"SC" collations, please see my answer to the following DBA.SE question:
How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?
None of this affects what you see. If you store a code point, then it's there. How it behaves — sorting, comparison, etc — is controlled by collations. But, how it appears is controlled by fonts and the OS. No font can contain all characters, so different fonts contain different sets of characters, with a lot of overlap on the more widely used characters. However, if a font has a particular byte sequence mapped, then it can display that character. This is why the only work required to get Supplementary Characters displaying correctly in SQL Server 2000 (using SSMS 2005) running on Windows XP was to add a font containing the characters and doing one or two minor registry edits (no changes to SQL Server).
Supplementary Characters in SQL_* collations and collations without a version number in their name have no sort weights. Hence, they all equate to each other as well as to any other BMP code points that have no sort weights (including "space" (U+0020) and "null" (U+0000)). They started to fix this in the version _90_ collations.
SSMS has nothing to do with any of this, outside of possibly needing the font used for the query editor and/or grid results and/or errors + messages changed to one that has the desired characters. (SSMS doesn't render anything outside of maybe spatial data; characters are rendered by the display driver + font definitions + maybe something else).
Therefore, the following statement in the documentation (from the question):
If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.
is both nonsensical and incorrect. They were probably intending to say the datatypes would only store a subset of the UTF-16 encoding (since UCS-2 is the subset). Also, even if it said "UTF-16 character encoding" it would still be wrong because the bytes that you pass in will be stored (assuming enough free space in the column or variable).
I have the following two fields in a Sql Server table:
When I add some test data with accented characters into the field, it actually stores them! I thought I had to change the column from VARCHAR to NVARCHAR to accept accented characters, etc?
Basically, I thought:
VARCHAR = ASCII
NVARCHAR = Unicode
So is this a case where façade etc are actually ASCII .. while some other characters would error (if VARCHAR)?
I can see the ç and é characters in the extended ASCII chart (link above) .. so does this mean ASCII includes 0->127 or 0->255?
(Side thought: I guess I'm happy with accepting 0->255 and stripping out anything else.)
Edit
DB collation: Latin1_General_CI_AS
Server Version: 12.0.5223.6
Server Collation: SQL_Latin1_General_CP1_CI_AS
First the details of what Sql Server is doing.
VARCHAR stores single-byte characters using a specific collation. ASCII only uses 7 bits, or half of the possible values in a byte. A collation references a specific code page (along with sorting and equating rules) to use the other half of the possible values in each byte. These code pages often include support for a limited and specific set of accented characters. If the code page used for your data supports an accent character, you can do it; if it doesn't, you see weird results (unprintable "box" or ? characters). You can even output data stored in one collation as if it had been stored in another, and get really weird stuff that way (but don't do this).
NVARCHAR is unicode, but there is still some reliance on collations. In most situations, you will end up with UTF-16, which does allow for the full range of unicode characters. Certain collations will result instead in UCS-2, which is slightly more limited. See the nchar/nvarchar documentation for more information.
As an additional quirk, the upcoming Sql Server 2019 will include support for UTF-8 in char and varchar types when using the correct collation.
Now to answer the question.
In some rare cases, where you are sure your data only needs to support accent characters originating from a single specific (usually local) culture, and only those specific accent characters, you can get by with the varchar type.
But be very careful making this determination. In an increasingly global and diverse world, where even small businesses want to take advantage of the internet to increase their reach, even within their own community, using an insufficient encoding can easily result in bugs and even security vulnerabilities. The majority of situations where it seems like a varchar encoding might be good enough are really not safe anymore.
Personally, about the only place I use varchar today is mnemonic code strings that are never shown to or provided by an end user; things that might be enum values in procedural code. Even then, this tends to be legacy code, and given the option I'll use integer values instead, for faster joins and more efficient memory use. However, the upcoming UTF-8 support may change this.
VARCHAR is ASCII using the current system code page - so the set of characters you can save depends what code page.
NVARCHAR is UNICODE so you can store all the characters.
I need to store 255 characters in a database column of type nvarchar. They characters are UTF-8 and can be multibyte. I am not the best with character encodings, so I'm not sure if that makes sense. I want to hold 255 characters that can be in any language, etc.
You can find some simple-to-understand background information about different Unicode encodings in this, which is a chapter I wrote in a manual for an open-source project. That background information will help you to understand some of the details in my answer.
The link to documentation about nvarchar provided by Simmo states that nvarchar is stored in UCS-2 format. Because of this, you will need to convert the UTF-8 strings into UCS-2 strings before storing them in the database. You can find C++ code to do that conversion here.
A subtle but important point is that the conversion code will actually convert into UTF-16, which is a superset of UCS-2 (UTF-16 supports the use of surrogate pairs, while UCS-2 doesn't). I don't use SQL Server so I don't know if it will complain if you try to insert some surrogate pairs into it. (Perhaps somebody else here can confirm whether or not it will).
If SQL Server disallows surrogate pairs, then there will be a limit on the range of languages your application can support, but at least you know that nvarchar(255) is sufficient for your needs.
On the other hand, if SQL Server allows the use of surrogate pairs, then you might want to use nvarchar(510) to allow for the (remote) possibility that every single character will be composed of surrogate pairs.
http://msdn.microsoft.com/en-us/library/ms186939.aspx
255 characters.
I'm using SQL Express 2008 edition. I've planned my database tables/relationships on a piece of paper and was ready to begin.
Unfortunately, when I clicked on data type for my pkID_Officer I was offered more than I expected, and therefore doubt has set in.
Can anyone point me to a place where these dates types are explained and examples are given as to what fields work better with which data types.
Examples:
is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
nChar and vChar?
Any help is always appreciated.
Thanks
Mike.
The MSDN site has a good overview of the SQL 2008 datatypes.
http://msdn.microsoft.com/en-us/library/ms187752.aspx
For the ID field use a guid if it needs to be unique across separate systems or tables or you want to be able to generate the ID outside of the database. Otherwise an int/identity value works just fine.
I store telephone numbers a character data since I won't ever be doing calculations on it. I would think email would be stored the same way.
As for hyperlinks you can basically store the hyperlink by itself as a varchar and render the link on the client or store the markup itself in the database. Really depends on the circumstances.
Use nvarchar if you ever think you'll need to support double byte languages now or in the future.
For primary key I always prefer (as starting point) to use an auto increment int. It makes everything more usable and you don't have any "natural" relationship with the real data. Of course there can be exception to this...
is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
I personally prefer INT IDENTITY over GUIDs - especially for your clustered index. GUIDs are random in nature and thus lead to a lot of index fragmentation and therefore poor performance when used as a clustered index on SQL Server. INT doesn't have this trouble, plus it's only 4 bytes vs. 16 bytes, so if you have lots of rows, and lots of non-clustered indexes (the clustered key gets added to each and every entry in each and every non-clustered index), using a GUID will unnecessarily bloat your space requirements (on disk and also in your machine's RAM)
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
I'd use string fields for all of these.
VARCHAR is fine, as long as you don't need any "foreign" language support, e.g. it's okay for English and Western European languages, but fails on Eastern European and Asian languages (Cyrillic, Chinese etc.).
NVARCHAR will handle all those extra pesky languages at a price - each character is stored in 2 bytes, e.g. a string of 100 chars will use 200 bytes of storage - ALWAYS.
Hope this helps a bit !
Marc
uniqueidentifier are GUIDs:
http://de.wikipedia.org/wiki/Globally_Unique_Identifier
Guids
- are worldwide unique
- have in DotNet e.g. her own objects (System.Guid)
- are 16 digits long
If you want such things, then use it. If you are fine with int-ids, then it's ok.
Telephone-Numbers / emails / Hyperlings are normal strings.
NCHAR/NVARCHR are Unicode counterpoints of CHAR/VARCHAR datatypes. I almost always use them in my apps - unless I have compelling reasons not to use them.
I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar