unable to update nvarchar(50) having czech letters in it [duplicate] - sql-server

I have seen prefix N in some insert T-SQL queries. Many people have used N before inserting the value in a table.
I searched, but I was not able to understand what is the purpose of including the N before inserting any strings into the table.
INSERT INTO Personnel.Employees
VALUES(N'29730', N'Philippe', N'Horsford', 20.05, 1),
What purpose does this 'N' prefix serve, and when should it be used?

It's declaring the string as nvarchar data type, rather than varchar
You may have seen Transact-SQL code that passes strings around using
an N prefix. This denotes that the subsequent string is in Unicode
(the N actually stands for National language character set). Which
means that you are passing an NCHAR, NVARCHAR or NTEXT value, as
opposed to CHAR, VARCHAR or TEXT.
To quote from Microsoft:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
If you want to know the difference between these two data types, see this SO post:
What is the difference between varchar and nvarchar?

Let me tell you an annoying thing that happened with the N' prefix - I wasn't able to fix it for two days.
My database collation is SQL_Latin1_General_CP1_CI_AS.
It has a table with a column called MyCol1. It is an Nvarchar
This query fails to match Exact Value That Exists.
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = 'ESKİ'
// 0 result
using prefix N'' fixes it
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = N'ESKİ'
// 1 result - found!!!!
Why? Because latin1_general doesn't have big dotted İ that's why it fails I suppose.

1. Performance:
Assume your where clause is like this:
WHERE NAME='JON'
If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode. This means the data type of NAME column and string 'JON' are different and so SQL Server implicitly converts one operand’s type to the other. If the SQL Server converts the literal’s type
to the column’s type then there is no issue, but if it does the other way then performance will get hurt because the column's index (if available) wont be used.
2. Character set:
If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.

Assuming the value is nvarchar type for that only we are using N''

Related

SQL Server string comparison with equals sign and equals or greater in the strings [duplicate]

I have seen prefix N in some insert T-SQL queries. Many people have used N before inserting the value in a table.
I searched, but I was not able to understand what is the purpose of including the N before inserting any strings into the table.
INSERT INTO Personnel.Employees
VALUES(N'29730', N'Philippe', N'Horsford', 20.05, 1),
What purpose does this 'N' prefix serve, and when should it be used?
It's declaring the string as nvarchar data type, rather than varchar
You may have seen Transact-SQL code that passes strings around using
an N prefix. This denotes that the subsequent string is in Unicode
(the N actually stands for National language character set). Which
means that you are passing an NCHAR, NVARCHAR or NTEXT value, as
opposed to CHAR, VARCHAR or TEXT.
To quote from Microsoft:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
If you want to know the difference between these two data types, see this SO post:
What is the difference between varchar and nvarchar?
Let me tell you an annoying thing that happened with the N' prefix - I wasn't able to fix it for two days.
My database collation is SQL_Latin1_General_CP1_CI_AS.
It has a table with a column called MyCol1. It is an Nvarchar
This query fails to match Exact Value That Exists.
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = 'ESKİ'
// 0 result
using prefix N'' fixes it
SELECT TOP 1 * FROM myTable1 WHERE MyCol1 = N'ESKİ'
// 1 result - found!!!!
Why? Because latin1_general doesn't have big dotted İ that's why it fails I suppose.
1. Performance:
Assume your where clause is like this:
WHERE NAME='JON'
If the NAME column is of any type other than nvarchar or nchar, then you should not specify the N prefix. However, if the NAME column is of type nvarchar or nchar, then if you do not specify the N prefix, then 'JON' is treated as non-unicode. This means the data type of NAME column and string 'JON' are different and so SQL Server implicitly converts one operand’s type to the other. If the SQL Server converts the literal’s type
to the column’s type then there is no issue, but if it does the other way then performance will get hurt because the column's index (if available) wont be used.
2. Character set:
If the column is of type nvarchar or nchar, then always use the prefix N while specifying the character string in the WHERE criteria/UPDATE/INSERT clause. If you do not do this and one of the characters in your string is unicode (like international characters - example - ā) then it will fail or suffer data corruption.
Assuming the value is nvarchar type for that only we are using N''

cast to NVARCHAR(MAX) causes "chinese"/UTF encoded characters

I am using code like this in my SELECT statement:
CAST(HASHBYTES(N'SHA1', Bla) AS NVARCHAR(MAX)) AS hashed_bla
and end-up with "chinese"/UTF encoded characters in the ssms grid but also in upstream apps. Is there a way to change this? Does this have to do with the collation? Thanks!
What you have is working as expected. Take the following example:
SELECT HASHBYTES('SHA1','B8187F0D-5DBA-4D43-95FC-CD5A009DB98C');
This returns the varbinary value 0xA04B9CB18A2DC4BC08B83FCCE48A0AF1A1390756. You are then converting that value to an nvarchar, so get a result like N'䮠놜ⶊ별레찿諤㦡嘇' (on my collation). For an varbinary each 4 characters represents a single character. So, for the above A04B is the first character (which is N'䮠').
It appears what you are after is an varchar representing a varbinary value (you don't need an nvarchar here, as there will be no unicode characters). To do so, you need to use CONVERT and a style code. For the example I gave above that would be:
SELECT CONVERT(varchar(100),HASHBYTES('SHA1','B8187F0D-5DBA-4D43-95FC-CD5A009DB98C'),1);
Which returns the varchar value '0xA04B9CB18A2DC4BC08B83FCCE48A0AF1A1390756'. If you don't want the '0x' at the start, use style code 2, rather than 1.

How is Unicode (UTF-16) data that is out of collation stored in varchar column?

This is purely theoretical question to wrap my head around
Let's say I have Unicode cyclone (🌀 1F300) symbol. If I try to store it in varchar column that has default Latin1_General_CI_AS collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar...
The ways I can see this done:
Like javascript does for symbols out of Basic plane(BMP) where it stores them as 2 symbols (surrogate pairs), and then additional processing is needed to put them back together...
Just truncate the symbol, store first byte and drop the second.... (data is toast - you should have read the manual....)
Data is destroyed and nothing of use is saved... (data is toast - you should have read the manual....)
Some other option that is outside of my mental capacity.....
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10)) in both cases I got 0x3F3F.
3F in ascii is ? (question mark) e.g two question marks (??) that I also see when doing normal select * does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?
The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2)) which is also 0xF3F3 (as opposed to casting N'👽').
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.
Yes the data has gone.
Varchar requires less space, compared to NVarchar. But that reduction comes at a cost. There is no space for a Varchar to store Unicode characters (at 1 byte per character the internal lookup just isn't big enough).
From Microsoft's Developer Network:
...consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.
As you've spotted, unsupported characters are repalced with question marks.

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.
Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????
Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

Do I have use the prefix N in the "insert into" statement for unicode?

Like:
insert into table (col) values (N'multilingual unicode strings')
I'm using SQL Server 2008 and I already use nVarChar as the column data type.
You need the N'' syntax only if the string contains characters which are not inside the default code page. "Best practice" is to have N'' whenever you insert into an nvarchar or ntext column.
Yes, you do if you have unicode characters in the strings.
From books online (http://msdn.microsoft.com/en-us/library/ms191313.aspx)...
"Unicode string constants that appear in code executed on the server, such as in stored procedures and triggers, must be preceded by the capital letter N. This is true even if the column being referenced is already defined as Unicode. Without the N prefix, the string is converted to the default code page of the database. This may not recognize certain characters. The requirement to use the N prefix applies to both string constants that originate on the server and those sent from the client."
It is preferable for compatibility sake.
Best practice is to use parameterisation in which case you don't need the N prefix.

Resources