This question already has answers here:
Weird SQL Server 2005 Collation difference between varchar() and nvarchar()
(2 answers)
Closed 9 years ago.
I have this query:
select ' ' C union
select '*' C union
select '-' C
order by C
And the result is space, asterisk and dash, but if I have unicode characters like this:
select N' ' C union
select N'*' C union
select N'-' C
order by C
I get space, dash, asterisk.
Can anyone explain why?
Thanks!
I thought at first that this would be down to different default collations for varchar and nvarchar, but that doesn't seem to be the case. It does seem to vary a bit by collation (I don't see it with Latin1_General_CI_AS but I do if I use SQL_Latin1_General_CP1_CI_AS).
Looking into it further, I found this answer here on Stack Overflow, which references this article on MSDN, which has this to say about hyphens and Unicode:
A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type. For example, if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen.
So I've marked this answer a CW, as this is really just a duplicate of that answer (and the question a a duplicate of that question).
Related
My SQL Server database was created & designed by a freelance developer.
I see the database getting quite big and I want to ensure that the column datatypes are the most efficient in preserving the size as small as possible.
Most columns were created as
VARCHAR (255), NULL
This covers those where they are
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
Alpha which will contain just 1 letter or are blank
Then there are a number of columns which are alphanumeric with a maximum of 10
alphanumeric characters with a maximum of 25.
There is one big alphanumeric column which can be up to 300 characters.
There has been an amendment for a column which show time taken in seconds to race an event. Under 1000 seconds and up to 2 decimal places
This is set as DECIMAL (18,2) NULL
The question is can I reduce the size of the database by changing the column data types, or was the original design, optimum for purpose?
You should definitely strive to use the most appropriate data types for all columns - and in this regard, that freelance developer did a very poor job - both from a point of consistency and usability (just try to sum up the numbers in a VARCHAR(255) column, or sort by their numeric value - horribly bad design...), but also from a performance point of view.
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
-> if you don't need any fractional decimal points (only whole numbers) - use INT
Alpha which will contain just 1 letter or are blank
-> in this case, I'd use a CHAR(1) (or NCHAR(1) if you need to be able to handle Unicode characters, like Hebrew, Arabic, Cyrillic or east asian languages). Since it's really only ever 1 character (or nothing), there's no need or point in using a variable-length string datatype, since that only adds at least 2 bytes of overhead per string stored
There is one big alphanumeric column which can be up to 300 characters.
-> That's a great candidate for a VARCHAR(300) column (or again: NVARCHAR(300) if you need to support Unicode). Here I'd definitely use a variable-length string type to avoid padding the column with spaces up to the defined length if you really want to store fewer characters.
We have one column in sql server where we need to consider the all the digits as it is transaction table.The column data type like decimal(38,35) ,its appending zeroes when in sql server like the value is 1.2369 but its displaying as 1.236900000000000000.. but it can be restricted by using float and cast like
"select cast(cast('1.2369'as decimal(38,35))as float)" it will truncate all the zeroes but the real question is when we use the same expression for the bigger decimal value like 1.236597879646479444896645 its truncate the trailing values,considering only up-to scale of 15 digits,if anybody finds the logic for this one please help me .Thank you and
Note :The values are always dynamic.
Becouse float has a precision of 15 digits.
See documentation: float and real
To format a DECIMAL(38,35) without insignificant zeroes, use an explicit FORMAT string, e.g.
SELECT FORMAT(1.23690000000000000, '0.' + REPLICATE('#', 35))
gives 1.2369 (SQL Server 2012 and up). Note, however, that the resulting type is a string, not a number, and so this should only ever be done as the final step (and only if your client software isn't smart enough to format the values on its own). While you're calculating with it, there is either no need to cut off digits, or else you need to be explicit about it by converting to the appropriate DECIMAL (e.g. 1.2369 fits in a DECIMAL(5, 4)). SQL Server can't do this automatically because it doesn't know what kind of precision you're going for in your calculations, but it is definitely something you must take into account, because combining DECIMALs of different scale and precisions can give unexpected results when you're close to the edge.
If you want to remove trailing 0s, recognize that this is now string formatting, not any numeric processing. So we'll convert it to a string and then trick RTRIM into doing the job for us:
select REPLACE(
RTRIM(
REPLACE(
CONVERT(varchar(40),convert(decimal(38,35),'1.2069'))
,'0',' '))
,' ','0')
As I said in a comment though, it's usually more appropriate to put these presentation concerns in your presentation layer, rather than doing it down in the database - especially if whatever is consuming this result set wants to work with this data numerically. In that case, leave it alone - the trailing zeroes are only their because that's how management studio chooses to format decimals for display.
All 3 options are case and accent sensitive, and support Unicode.
According to the documentation:
NVarchar sorts and compares data based on the "dictionaries for the associated language or alphabet" (?)
Bin sorts and compares data based on the "bit patterns" (?)
Bin2 sorts and compares data based on "Unicode code points for Unicode data" (?)
To make complex things simple, can I say that the Bin is an improvement of the NVarchar and the Bin2 is an improvement of the Bin; and unless I am restricted to backwards compatibility, it is always recommended to use Bin2 or at least Bin in order to enjoy a better performance?
=========================================================================
I will try to explain my self again.
Have a look:
If Object_ID('words2','U') Is Not Null Drop Table words2;
Create Table words2(word1 NVarchar(20),
word2 NVarchar(20) Collate Cyrillic_General_BIN,
word3 NVarchar(20) Collate Cyrillic_General_BIN2);
Insert
Into words2
Values (N'ھاوتایی',N'ھاوتایی',N'ھاوتایی'),
(N'Συμμετρία',N'Συμμετρία',N'Συμμετρία'),
(N'אבַּג',N'אבַּג',N'אבַּג'),
(N'対称性',N'対称性',N'対称性');
Select * From words2;
All 3 options support all kinds of alphabet, no matter what is the collation.
The question is- what is practical difference between the 3 options? Suppose I want to store private names in different alphabets, which option may I use? I guess I will have to find specific names (Select .. From.. Where..), order names (Select.. From.. Order By..).
All 3 options are case and accent sensitive, and support Unicode.
NVARCHAR is a datatype (like INT, DATETIME, etc.) and not an option. It stores Unicode characters in the UCS-2 / UTF-16 (Little Endian) encoding. UCS-2 and UTF-16 are the identical code points for the U+0000 through U+FFFF (decimal values 0 - 65535) range. UTF-16 handles code points U+10000 and above (known as Supplementary Characters), all of which are defined as pairs of code points (known as Surrogate Pairs) that exist in the UCS-2 range. Since the byte sequences are identical between the two, the only difference is in the handling of the data. Meaning, built-in functions do not know how to interpret Supplementary Characters when using Collations that do not end in _SC, whereas they do work correctly for the full UTF-16 range when using Collations that do end in _SC. The _SC Collations were added in SQL Server 2012, but you can still store and retrieve Supplementary Characters in prior versions; it is only the built-in functions that do not behave as expected when operating on Supplementary Characters.
More directly:
NVARCHAR, being a datatype, is not inherently case or accent (or any other sensitivity) sensitive or insensitive. The exact behavior depends on the collation set for the column, or the database's default collation, or the COLLATE clause, depending on the context of the expression.
While it is an extremely common misconception, binary collations are neither case nor accent -sensitive. It only appears that they are when viewed simplistically. Being "sensitive" means being able to detect differences for a particular sensitivity (case, accent, width, Kana type, and starting in SQL Server 2017: variation selector) while still allowing for differences in other sensitivities and/or underlying byte representations. For more details and examples, please see: No, Binary Collations are not Case-Sensitive.
Collations, while literally being about how characters sort and compare to each other, in SQL Server also imply the Locale / LCID (which determines the cultural rules that override the default handling of those comparisons) and the Code Page used for VARCHAR data.
Non-binary collations are considered "dictionary" sorting / comparisons because they take into account the rules of the particular culture specified by the Collation (specifically the associated LCID). On the other hand, binary collations do not deal with any culture-specific rules and only sort and compare based on the numeric value of each 2-byte sequence. For this reason binary collations are much faster because they don't need to apply a large list of rules, but they also have no way to know that single two-byte Code Point that is a u with an accent is not the same as 2 two-byte sequences which are a u and a separate accent that will render on screen the same as the single two-byte code point, and will compare as being equal when using a non-binary collation.
The difference between _BIN and _BIN2 is sorting accuracy, not performance. The older _BIN collations do a simplistic byte-by-byte sorting and comparison (after the first character, which is seen as a code point and not two bytes, thus it sorts correctly) whereas the newer _BIN2 collations (starting in SQL Server 2005) compare each Code "Unit" (Supplementary Characters are made up of two Code Units, and _BIN2 collations see each Code Unit individually instead of seeing the combination of them as a Code Point). There is a difference in sort order between these two approaches mainly due to SQL Server being "Little Endian" which stores bytes (for a single entity: UTF-16 code unit, INT value, BIGINT value, etc) in reverse order. Hence, code point U+0206 will actually sort after U+0402 when using a _BIN collation:
SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN;
/*
ID Thing ThingBytes
2 aЂ 0x61000204
1 aȆ 0x61000602 <-- U+0206, stored as 0x06 then 0x02, should sort first
*/
SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN2;
/*
ID Thing ThingBytes
1 aȆ 0x61000602
2 aЂ 0x61000204
*/
For more details and examples of this distinction, please see: Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2).
Also, all binary collations sort and compare in exactly the same manner when it comes to Unicode / NVARCHAR data. Code Points are numerical values and there are no linguistic / cultural variations to consider when comparing them. Hence the only purpose in having more than a single, global "BINARY" Collation is the need to still specify the Code Page to use for VARCHAR data.
Suppose I want to store private names in different alphabets, which option may I use?
If you were using VARCHAR fields, then the Collation specific (regardless of binary or non-binary) would determine which characters are available since that is 8-bit Extended ASCII which typically has a range of 256 different characters (unless using a Double-Byte Character Set, in which case it can handle many more, but those are still mostly of a single culture / alphabet). If using NVARCHAR to store the data, since that is Unicode it has a single character set comprised of all characters from all languages, plus lots of other stuff.
So choosing NVARCHAR takes care of the problem of being able to hold the proper characters of names coming from various languages. HOWEVER, you still need to pick a particular cultures dictionary rules in order to sort in a manner that each particular culture expects. This is a problem because Collations cannot be set dynamically. So pick the one that is used the most. Binary collations will not help you here, and in fact would go against what you are trying to do. They are, however, quite handy when you need to distinguish between characters that would otherwise equate, such as in this case: SQL server filtering CJK punctuation characters (here on S.O.).
Another related scenario in which I have used a _BIN2 collation was detecting case changes in URLs. Some parts of a URL are case-insensitive, such as the hostname / domain name. But, in the QueryString, the values being passed in are potentially sensitive. If you compare URL values in a case-insensitive operation, then http://domain.tld/page.ext?var1=val would equate to http://domain.tld/page.ext?var1=VAL, and those values should not be assumed to be the same. Using a case-sensitive Collation would also typically work, but I use Latin1_General_100_BIN2 because it's faster (no linguistic rules) and would not ignore a change of ü to u + combining diaeresis (which renders as ü).
I have more explanations of Collations spread across the following answers (so won't duplicate here as most of them contain several examples):
UCS-2 and SQL Server
SQL Server default character encoding
What is the point of COLLATIONS for nvarchar (Unicode) columns?
Unicode to Non-unicode conversion
NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server
And these are on DBA.StackExchange:
How To Strip Hebrew Accent Marks
Latin1_General_BIN performance impact when changing the database default collation
Storing Japanese characters in a table
For more info on working with Collations, Encodings, Unicode, etc, please visit: Collations Info
nvarchar is a data type, and the "BIN" or "BIN2" collations are just that - collation sequences. They are two different things.
You use an nvarchar column to store unicode character data:
nchar and nvarchar (Transact-SQL)
String data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.
https://msdn.microsoft.com/en-GB/library/ms186939(v=sql.105).aspx
An nvarchar column will have an associated collation sequence that defines how the characters sort and compare. This can also be set for the whole database.
COLLATE (Transact-SQL)
Is a clause that can be applied to a database definition or a column definition to define the collation, or to a character string expression to apply a collation cast.
https://msdn.microsoft.com/en-us/library/ms184391(v=sql.105).aspx
So, when working with character data in SQL server, you always use both a character data-type (nvarchar, varchar, nchar or char) along with an appropriate collation according to your needs for case-sensitivity, accent-sensitivity etc.
For example, in my work I normally use the "Latin1_General_CI_AI" collation. This is suitable for latin character sets, and provides case-insensitive and accent-insensitive matching for queries.
That means that the following strings are all considered to be equal:
Höller, höller, Holler, holler
This is ideal for systems where there may be words containing accented characters (as above), but you can't be sure you users will enter the accents when searching for something.
If you only wanted case-insensitivity then you would use a "CI_AS" (accent sensitive) collation instead.
The "_BIN" collations are for binary comparisons that treat every distinct character as different, and wouldn't be used for general text comparisons.
Edit for updated question:
Provided that you always use nvarchar (as opposed to varchar) columns then you always have support for all unicode code points, no matter what collation is used.
There is no practical difference in your example query, as it is only a simple insert and select. Also bear in mind that your first "word1" column will be using the database or server's default collation - there's always a collation in use!
Where the differences will occur is if you use criteria against your nvarchar columns, or sort by them. This is what collations are for - they define which characters should be treated as equivalent for comparisons and sorting.
I can't say anything about Cyrillic, but in the case of Latin characters, using the "Latin1_General_CI_AI" collation, then characters such as A a á â etc are all equivalent - the case and the accent are ignored.
Imagine if you have the string Aaáâ stored in your "word1" column, then the query SELECT * FROM words2 WHERE word1 = 'aaaa' will return your row.
If you use a "_BIN" collation then all these characters are treated as distinct, and the query above would not return a row. I can't think of a situation where you'd want to use a "_BIN" collation when working with textual data. Edit 2: Actually I can - storing password hashes would be a good place to use a binary collation, so that comparisons are exact. That's about all.
I hope this makes it clearer.
Has anyone encountered the following where when you divide a number in SQL a random number of trailing zeros are appended?...
SELECT 8.95/10 ... results in 0.895000
If you have encountered this, what is the reason for the addition of the zeros?
UPDATE: I am aware that casting the result to FLOAT will remove the 0's
First of all, seeing trailing zeros or anything when querying in SSMS is not because it's something special with the DB engine, but it's always the result of the internal query result formating used for displaying. After all, all numbers are just binary values in some representation that at some point gets translated to strings for displaying.
Anyway, the real reason is because of the datatypes involved, and how SSMS decides to display them. When doing those calculations, SQL Server must decide what datatype the result will be, based on the types of the inputs, and in that particular case it was numeric(7,6). You can easily see the result types by saving the result to a temp table and running sp_columns on that:
SELECT 8.95 AS dividend,10 AS divider,8.95/10 AS result INTO #temp ;
EXEC tempdb..sp_columns '#temp' ;
SELECT * FROM #temp ;
DROP TABLE #temp ;
In my case it returned this (among other uninteresting things for now):
COLUMN_NAME TYPE_NAME PRECISION LENGTH SCALE
dividend numeric 3 5 2
divided int 10 4 0
result numeric 7 9 6
Playing with castings in various places in the division will only change the resulting data types. The interesting fact is the Scale for the result column, note that it's a 6. That's exactly the number of decimal places that SSMS decides to display for the NUMERIC data type, regardless of the actual value. FLOAT don't have this formating from SSMS, which is why the casting eliminates the trailing zeros. Of course, when using the DB from outside SSMS, the formating will depend on the calling application and will not be subject to all this.
As another example of this behavior, just try SELECT CAST(1 AS NUMERIC(18,10)) and see that it shows 1.0000000000.
I had (perhaps naively) assumed that in SQL Server, an nvarchar would store each character in two bytes. But this does not always seem to be the case. The documentation out there suggests that some characters might take more bytes. Does someone have a definitive answer?
yes it does it uses 2 bytes, use datalength to get the storage size, you can't use LEN because LEN just counts the characters, see here: The differences between LEN and DATALENGTH in SQL Server
DECLARE #n NVARCHAR(10)
DECLARE #v VARCHAR(10)
SELECT #n = 'A', #v='A'
SELECT DATALENGTH(#n),DATALENGTH(#v)
---------
2 1
Here is what Books On Line has: http://msdn.microsoft.com/en-us/library/ms186939.aspx
Character data types that are either
fixed-length, nchar, or
variable-length, nvarchar, Unicode
data and use the UNICODE UCS-2
character set.
nchar [ ( n ) ]
Fixed-length Unicode
character data of n characters. n must
be a value from 1 through 4,000. The
storage size is two times n bytes. The
ISO synonyms for nchar are national
char and national character.
nvarchar [ ( n | max ) ]
Variable-length Unicode character
data. n can be a value from 1 through
4,000. max indicates that the maximum
storage size is 2^31-1 bytes. The
storage size, in bytes, is two times
the number of characters entered + 2
bytes. The data entered can be 0
characters in length. The ISO synonyms
for nvarchar are national char varying
and national character varying.
That said unicode compression was introduced in SQL Server 2008 R2 so it might store ascii as 1 byte, you can read about unicode compression here
SQL Server 2008 R2 : A quick experiment in Unicode Compression
SQL Server 2008 R2 : Digging deeper into Unicode compression
More testing of Unicode Compression in SQL Server 2008 R2
Given that there are more than 65536 characters, it should be obvious that a character cannot possibly fit in just two octets (i.e. 16 bits).
SQL Server, like most of Microsoft's products (Windows, .NET, NTFS, …) uses UTF-16 to store text, in which a character takes up either two or four octets, although as #SQLMenace points out, current versions of SQL Server use compression to reduce that.
My understanding of this issue is that SQL server uses UCS-2 internally, but that its UCS-2 implementation has been hacked to support a subset of characters of up to 4 bytes in the GB18030 character set, which are stored as UCS-2 but are transparently converted by the database engine back to multibyte characters when queried.
Surrogate/supplementary characters aren't fully supported - the implementation of a number of SQL server string functions doesn't support surrogate pairs, as detailed here.