Difference between CHAR & NCHAR in database WITH UTF-8 collation - sql-server

In the SAP SQL Anywhere (where datatypes and most of the structures are very similar to SQL Server) the default database collation is set to UTF-8 - settings in detail below:
I have a set of special characters which the database needs to store and work with (range: U+1400 - U+167F) and after the test insert both VARCHAR and NVARCHAR datatypes were able to accommodate for these special characters with no visible difference (except the allocated space) - see below:
Do I understand correctly when DB collation is set to UTF-8 (with UTF8BIN charset) the CHAR/VARCHAR datatype is by default able to store UTF-8 charset and NCHAR/NVARCHAR the UTF-16? Meaning, I do not have to convert all CHAR/VARCHAR objects into NCHAR/NVARCHAR if all I need is the UTF-8 range: U+1400 - U+167F ?

To answer my own question:
Yes, CHAR and VARCHAR in UTF-8 Collation will store all characters but the datatype lenght specification will differ. When defining varchar lenght e.g.: VARCHAR(100) we expect 100 character string limit. This will only work for the characters where 1char = 1byte (ASCII), for all UTF-8 characters (2-4bytes) the number will specify the byte-lenght e.g.: VARCHAR(100) will be able to contain only UTF-8 string which is 25 characters long for 4-byte UTF-8 text.
Please feel free to correct me or improve my answer.

Related

How does SQL Server store these Unicode characters into a column that is VARCHAR(MAX) and not NVARCHAR(MAX)

I have some data which I believe is Unicode and seeing what happens when I store it into my database column which is of VARCHAR(MAX) datatype.
And here's the source, from the file which is UTF-8...
looking for that ‘X’ and • 3 large bedrooms with 2 ensuites and • Main bedroom with ensuite & surround with plantation shutters`
and using the Visual Studio debugger:
=> so 2x apostrophes and 2x bullets.
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
I'm assuming my source data is not Unicode and therefore, I totally suck at all this Unicode/UTF-8 stuff :(
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
That's correct. As far as I can guess from your example, it is not storing Unicode. Probably it is storing bytes encoded in Windows code page 1252, which would be the default encoding for a Western install of SQL Server.
Code page 1252 happens to include mappings for characters ‘, ’ and •, so those characters can be safely stored. But step outside that limited repertoire and you'll start losing characters.

Remove emoji or smiley characters in SQL Server ntext column?

I have a mobile chat conversation text area which is stored in ntext data type in SQL Server 2008. I am doing some process character by character. I need to do something I do not know to pass these kind of emoji characters. Should I eliminate them or collate to different collation or encode to different char-set. My table's collation type is Latin1_General_CI_AS. I need something like this:
IF(SUBSTRING(#chat_Conversation, #i, 1) = 'Emoji')
CONTINUE;
As a first guess I'd suggest to place an N in front of your literal
Compare the results:
SELECT '😊'
,N'😊';
The result
ExtASCII Unicode
?? 😊
Without the N the literal is read as extended ASCII, unknown characters are returned as question marks. With N you are dealing with UNICODE (to be exact: UCS-2)...
UPDATE
As pointed out in comments: Do not use NTEXT!
NTEXT, TEXT and IMAGE are deprecated for centuries! These types will not be supported in future versions!
Convert all your work (columns, variables...) to
NTEXT -> NVARCHAR(MAX) (covering UCS-2 characters)
TEXT -> VARCHAR(MAX) (covering extended ASCII, depending on COLLATION and code page)
IMAGE -> VARBINARY(MAX) (covering BLOBs)
Hint
If you are dealing with special characters like foreign alphabets or emojis you should always use the N with literals and with types...

Unable to return query Thai data

I have a table with columns that contain both thai and english text data. NVARCHAR(255).
In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.
SELECT TOP 1000 [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
Returns
Province District SubDistrict Branch
อุตรดิตถ์ ลับแล ศรีพนมมาศ Northern
Bangkok Khlong Toei Khlong Tan SSS1
But this query:
SELECT [Province]
,[District]
,[SubDistrict]
,[Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
where [Province] LIKE 'อุตรดิตถ์'
Returns no rows.
What do I need o do to get the expected results.
The collation set is Latin1_General_CI_AS.
The data is displayed and inserted with no errors just can't search.
Two problems:
The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:
SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
-- ????????? อุตรดิตถ
What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.
Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".
For an in-depth look at this behavior, please see my two-part series:
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part A of 2: “Duck”)
Which Collation is Used to Convert NVARCHAR to VARCHAR in a WHERE Condition? (Part B of 2: “Rabbit”)
You need to add the wildcard characters to both ends:
N'%อุตรดิตถ์%'
The end result will look like:
WHERE [Province] LIKE N'%อุตรดิตถ์%'
EDIT:
I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):
WHERE [Province] = N'อุตรดิตถ์'
EDIT 2:
A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:
DECLARE #ASCII VARCHAR(20);
SET #ASCII = N'อุตรดิตถ์';
SELECT #ASCII AS [ImplicitlyConverted]
-- ?????????
Could be a number of things!
Fist of print out the value of the column and your query string in hex.
SELECT convert(varbinary(20)Province) as stored convert(varbinary(20),'อุตรดิตถ์') as query from allDistricsBranches;
This should give you some insight to the problem. I think the most likely cause is the ั, ิ, characters being typed in the wrong sequence. They are displayed as part of the main letter but are stored internally as separate characters.

Inserting special characters (greater/less than or equal symbol) into SQL Server database

I am trying to insert ≤ and ≥ into a symbol table where the column is of type nvarchar.
Is this possible or are these symbols not allowed in SQL Server?
To make it work, prefix the string with N
create table symboltable
(
val nvarchar(10)
)
insert into symboltable values(N'≥')
select *
from symboltable
Further Reading:
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
Why do some SQL strings have an 'N' prefix?
To add to gonzalo's answer, both the string literal and the field need to support unicode characters.
String Literal
Per Marc Gravell's answer on What does N' stands for in a SQL script ?:
'abcd' is a literal for a [var]char string, occupying 4 bytes memory, and using whatever code-page the SQL server is configured for.
N'abcd' is a literal for a n[var]char string, occupying 8 bytes of memory, and using UTF-16.
Where the N prefix stands for "National" Language in the SQL-92 standard and is used for representing unicode characters. For example, in the following code, any unicode characters in the basic string literal are first encoded into SQL Server's "code page":
Aside: You can check your code page with the following SQL:
SELECT DATABASEPROPERTYEX('dbName', 'Collation') AS dbCollation;
SELECT COLLATIONPROPERTY( 'SQL_Latin1_General_CP1_CI_AS' , 'CodePage' ) AS [CodePage];
The default is Windows-1252 which only contains these 256 characters
Field Type
Once the values are capable of being passed, they'll also need to be capable of being stored into a column that supports unicode types, for example:
nchar
nvarchar
ntext
Further Reading:
Why do we need to put N before strings in Microsoft SQL Server?
What is the meaning of the prefix N in T-SQL statements?
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
Why do some SQL strings have an 'N' prefix?

What data type use instead of 'ntext' data type?

I want to write a trigger for one of my tables which has an ntext datatype field an as you know the trigger can't be written for ntext datatype.
Now I want to replace the ntext with nvarchar datatype. The ntext maximum length is 2,147,483,647 character whereas nvarchar(max) is 4000 character.
what datatype can I use instead of ntext datatype.
Or are there any ways to write trigger for when I have ntext datatype?
It's better to say my database is designed before with SQL 2000 and it is full of data.
You're out of luck with sql server 2000, but you can possibly chain together a bunch of nvarchar(4000) variables. Its a hack, but it may be the only option you have. I would also do an assesment of your data, and see what the largest data you actually have in that column. A lot of times, columns are made in anticipation of a large data set, but in the end it doesn't have them.
in MSDN i see this :
* Important *
ntext, text, and image data types will be removed in a future version of Microsoft SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use nvarchar(max), varchar(max), and varbinary(max) instead.
Fixed and variable-length data types for storing large non-Unicode and Unicode character and binary data. Unicode data uses the UNICODE UCS-2 character set.
and it preferd nvarchar(MAX) , You can see details below :
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and can be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes (2 GB). The storage size, in bytes, is two times the actual length of data entered + 2 bytes. The ISO synonyms for nvarchar are national char varying and national character varying.

Resources