Cannot store particular Unicode code points / characters in NVARCHAR fields - sql-server

I'm doing some tests with SQL Server 2017.
I'm trying to store arbitrary Unicode code points in an NVARCHAR column.
I've tried different collations.
I have no problem with common characters in the BMP plane of Unicode.
For more exotic symbols, for example if I try to store the "𝌹" character (U+1D33), the following happens:
If I do it within Management Studio, I only see the infamous square symbol. But Management Studio has the proper font since I can paste it in the query editor.
If I send the text from Visual Studio, the value I see in Management Studio is "??", that's what I retrieve from Visual Studio, too, after performing a query.
My understanding is, for non-supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because NCHAR fields are limited to 2 bytes.
But, I tried with Latin1_General_100_CS_AS_KS_WS_SC, both at the DB level and column level, and it doesn't seem to work either.
Any ideas?
Thanks

I can't reproduce any data loss or encoding issue. I can reproduce a squares that becomes 𝌹 when copied. It's probably caused by the font used to display results in the SSMS grid or the Visual Studio debugger windows.
SQL Server and Windows use UTF16 for some time now, not UCS-2. Few fonts support the full UTF16 range though.
When I tried this in SSMS :
create table #tc(name nvarchar(20));
insert into #tc values (N'𝌹');
select name,len(name),DATALENGTH(name) from #tc;
I saw a square, 2 and 4 in the grid. This means the character was stored properly and took 4 bytes. When I tried to copy those results to SO though I saw :
name (No column name) (No column name)
𝌹 2 4
When I used Result to Text I got the actual character :
name
-------------------- ----------- -----------
𝌹 2 4
The correct character is there but the SSMS grid's font can't display it
Update
As Dan Guzman noted,the font can be changed from Tools-->Options-->Environment-->Fonts and Colors-->Show settings for:-->Grid Results. The default font is Microsoft Sans Serif, a small font (855KB) used as the default font on Windows. It contains "only" 3000 glyphs. Chinese characters aren't included, which is why squares are displayed.
Chinese computers use SimShun as the default though, whose file is 17.1MB. They wouldn't have any problem displaying chinese characters.

I'm trying to store arbitrary unicode points in an nvarchar column. I've tried different collations. I have no problem with common characters in the PBS plane of Unicode.
Collations have nothing to do with what code points you can store in an NVARCHAR / NCHAR / NTEXT (deprecated) column, variable, or literal. Those datatypes can store all 1,114,112 Unicode code points (even though most haven't been mapped to a character yet).
if I try to store 𝌹 character(U+1D33), ... within Management Studio, i only see the infamous square symbol. But management studio has the proper font since i can paste it in the query editor.
As others have explained already: this is merely a font issue. Fonts can hold a max of 65k characters, so you might need multiple fonts to cover all of the characters you are trying to use. I prefer Code2003 which you can find on FontSpace.com.
If i send the text from Visual Studio, the value i see in management studio is '??'
This should be due to forgetting to prefix the string literal with an upper-case "N" ;-).
SELECT '𝌹' AS [Oops], N'𝌹' AS [No Oops];
-- ?? 𝌹
My understanding is, for non supplementary character collations, characters outside the UCS-2 subset shouldn't be interpreted correctly because nchar fields are limited to 2 bytes.
The Supplementary Character-Aware (SCA) collations — those ending with _SC or with _140_ in their names — do support supplementary characters. BUT, "support" only means that the built-in functions handle the surrogate pair as a single, supplementary code point instead a pair of surrogate code points. But, support for sorting and comparison of supplementary characters actually started in SQL Server 2005 with the introduction of the version 90 collations.
All code units in UCS-2 and UTF-16 are 16 bits / 2 bytes. Supplementary characters are merely two of those 2-byte code units. Hence, being able to store supplementary characters should have been available back in SQL Server 7.0 when NVARCHAR was introduced. Even though no supplementary characters were defined until years later (after SQL Server 2000 was released), the NVARCHAR types were still capable of storing and retrieving them. I don't have SQL Server 7.0 to test with, but I have confirmed this on SQL Server 2000.
For more info, please see:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
Collations Info

Related

SQL Server appears to correctly interpret non-supported string literal formats

Have an instance of SQL Server 2012 that appears to correctly interpret string literal dates whose formats are not listed in the docs (though note these docs are for SQL Server 2017).
Eg. I have a TSV with a column of dates of the format %d-%b-%y (see https://devhints.io/datetime#date-1) which looks like "25-FEB-93". However, this throws type errors when trying to copy the data into the SQL Server table (via mssql-tools bcp binary). Yet, when testing on another table in SQL Server, I can do something like...
select top 10 * from account where BIRTHDATE > '25-FEB-93'
without any errors. All this, even though the given format is not listed in the docs for acceptable date formats and it apparently also can't be used as a castable string literal when writing in new records. Can anyone explain what is going on here?
the given format is not listed in the docs for acceptable date formats
That means it's not supported, and does not have documented behavior. There's lots of strings that under certain regional settings will convert due to quirks in the parsing implementation.
It's a performance-critical code path, and so the string formats are not rigorously validated on conversion. You're expected to ensure that the strings are in a supported format.
So you may need to load the column as a varchar(n) and then convert it. eg
declare #v varchar(200) = '25-FEB-93'
select convert(datetime,replace(#v,'-',' '),6)
Per the docs format 6 is dd mon YY, but note that this conversion "works" without replacing the - with , but that's an example of the behavior you observed.

How does SQL Server store these Unicode characters into a column that is VARCHAR(MAX) and not NVARCHAR(MAX)

I have some data which I believe is Unicode and seeing what happens when I store it into my database column which is of VARCHAR(MAX) datatype.
And here's the source, from the file which is UTF-8...
looking for that ‘X’ and • 3 large bedrooms with 2 ensuites and • Main bedroom with ensuite & surround with plantation shutters`
and using the Visual Studio debugger:
=> so 2x apostrophes and 2x bullets.
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
I'm assuming my source data is not Unicode and therefore, I totally suck at all this Unicode/UTF-8 stuff :(
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
That's correct. As far as I can guess from your example, it is not storing Unicode. Probably it is storing bytes encoded in Windows code page 1252, which would be the default encoding for a Western install of SQL Server.
Code page 1252 happens to include mappings for characters ‘, ’ and •, so those characters can be safely stored. But step outside that limited repertoire and you'll start losing characters.

PDO DBLIB multibyte (chinese) character encoding - SQL server

On a Linux machine, I am using PDO DBLIB to connect to an MSSQL database and insert data in a SQL_Latin1_General_CP1_CI_AS table. The problem is that when I am trying to insert chinese characters (multibyte) they are inserted as 哈市香åŠåŒºç æ±Ÿè·¯å·.
My (part of) code is as follows:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;", $myUser, $myPass);
$query = "
INSERT INTO UserSignUpInfo
(FirstName)
VALUES
(:firstname)";
$STH = $DBH->prepare($query);
$STH->bindParam(':firstname', $firstname);
What I've tried so far:
Doing mb_convert_encoding to UTF-16LE on $firstname and CAST as VARBINARY in the query like:
$firstname = mb_convert_encoding($firstname, 'UTF-16LE', 'UTF-8');
VALUES
(CAST(:firstname AS VARBINARY));
Which results in inserting the characters properly, until there are some not-multibyte characters, which break the PDO execute.
Setting my connection as utf8:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;charset=UTF-8;", $myUser, $myPass);
$DBH->exec('SET CHARACTER SET utf8');
$DBH->query("SET NAMES utf8");
Setting client charset to UTF-8 in my freetds.conf
Which had no impact.
Is there any way at all, to insert multibyte data in that SQL database? Is there any other workaround? I've thought of trying PDO ODBC or even mssql, but thought it's better to ask here before wasting any more time.
Thanks in advance.
EDIT:
I ended up using MSSQL and the N data type prefix. I will swap for and try PDO_ODBC when I have more time. Thanks everyone for the answers!
Is there any way at all, to insert multibyte data in [this particular] SQL
database? Is there any other workaround?
If you can switch to PDO_ODBC, Microsoft provides free SQL Server ODBC drivers for Linux (only for 64-bit Red Hat Enterprise Linux, and 64-bit SUSE Linux Enterprise) which support Unicode.
If you can change to PDO_ODBC, then the N-prefix for inserting Unicode is going to work.
If you can change the affected table from SQL_Latin1_General_CP1_CI_AS to UTF-8 (which is the default for MSSQL), then that would be ideal.
Your case is more restricted. This solution is suited for the case when you have mixed multibyte and non-multibyte characters in your input string, and you need to save them to a Latin table, and the N data type prefix isn't working, and you don't want to change away from PDO DBLIB (because Microsoft's Unicode PDO_ODBC is barely supported on linux). Here is one workaround.
Conditionally encode the input string as base64. After all, that's how we can safely transport pictures in line with emails.
Working Example:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;", $myUser, $myPass);
$query = "
INSERT INTO [StackOverflow].[dbo].[UserSignUpInfo]
([FirstName])
VALUES
(:firstname)";
$STH = $DBH->prepare($query);
$firstname = "输入中国文字!Okay!";
/* First, check if this string has any Unicode at all */
if (strlen($firstname) != strlen(utf8_decode($firstname))) {
/* If so, change the string to base64. */
$firstname = base64_encode($firstname);
}
$STH->bindParam(':firstname', $firstname);
$STH->execute();
Then to go backwards, you can test for base64 strings, and decode only them without damaging your existing entries, like so:
while ($row = $STH->fetch()) {
$entry = $row[0];
if (base64_encode(base64_decode($entry , true)) === $entry) {
/* Decoding and re-encoding a true base64 string results in the original entry */
print_r(base64_decode($entry) . PHP_EOL);
} else {
/* Previous entries not encoded will fall through gracefully */
print_r($entry . PHP_EOL);
}
}
Entries will be saved like this:
Guan Tianlang
5pys6Kqe44KS5a2maGVsbG8=
But you can easily convert them back to:
Guan Tianlang
输入中国文字!Okay!
Collation shouldn't matter here.
Double-byte characters need to be stored in nvarchar, nchar, or ntext fields. You don't need to perform any casting.
The n data type prefix stands for National, and it causes SQL Server to store text as Unicode (UTF-16).
Edit:
PDO_DBLIB does not support Unicode, and is now deprecated.
If you can switch to PDO_ODBC, Microsoft provides free SQL Server ODBC drivers for Linux which support Unicode.
Microsoft - SQL Server ODBC Driver Documentation
Blog - Installing and Using the Microsoft SQL Server ODBC Driver for Linux
You can use Unicode compatible data-type for the table column for supporting foreign languages(exceptions are shown in EDIT 2).
(char, varchar, text) Versus (nchar, nvarchar, ntext)
Non-Unicode :
Best suited for US English: "One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters
Unicode
Best suited for systems that need to support at least one foreign language: "The Unicode specification defines a single encoding scheme for most characters widely used in businesses around the world. All computers consistently translate the bit patterns in Unicode data into characters using the single Unicode specification. This ensures that the same bit pattern is always converted to the same character on all computers. Data can be freely transferred from one database or computer to another without concern that the receiving system will translate the bit patterns into characters incorrectly.
Example :
Also i have tried one example you can view its screens below,it would be helpful for issues relating the foreign language insertions as the question is right now.The column as seen below in nvarchar and it do support the Chinese language
EDIT 1:
Another related issue is discussed here
EDIT 2 :
Unicode unsupported scripts are shown here
just use nvarchar, ntext, nChar and when you want to insert then
use
INSERT INTO UserSignUpInfo
(FirstName)
VALUES
(N'firstname');
N will refer to Unicode charactor and it is standard world wide.
Ref :
https://aalamrangi.wordpress.com/2012/05/13/storing-and-retrieving-non-english-unicode-characters-hindi-czech-arabic-etc-in-sql-server/
https://technet.microsoft.com/en-us/library/ms191200(v=sql.105).aspx
https://irfansworld.wordpress.com/2011/01/25/what-is-unicode-and-non-unicode-data-formats/
This link Explain of chinese character in MYSQL. Can't insert Chinese character into MySQL .
You have to create table table_name () CHARACTER SET = utf8;
Use UTF-8 when you insert to table
set username utf8; INSERT INTO table_name (ABC,VAL);
abd create Database in CHARACTER SET utf8 COLLATE utf8_general_ci;
then You can insert in chinese character in table

Why is Turkish Lira symbol ₺ replaced with ? in SQL server 2008 database

Any idea why the Turkish Lira symbol is replaced by a question mark when I insert it in a table in the database. See the image below
This is not a font issue. This is a Unicode (UTF-16) vs 8-bit Code Page character set issue (i.e. NVARCHAR vs VARCHAR). The character you are trying to use does not exist in the particular Code Page indicated by the default Collation of the DB in which you are executing this query. The Code Page used by the DB's default Collation is relevant here since your string literal is not prefixed with an upper-case "N". If it was, then the string would be interpreted as being Unicode and no conversion would take place. But since you are passing in a non-Unicode string, it will be forced into the current DB's default Collation's Code Page as the query is parsed. Any characters not available in that Code Page, and not having a Best-fit mapping, get turned into "?".
You can run the following to see for yourself:
SELECT '₺';
PRINT '₺';
It both prints AND displays in the results grid as ?
If you want to see what character SQL Server thinks it is, run the following:
SELECT ASCII('₺');
And it will return: 63
If you want to see what character has an ASCII value of 63, run this:
SELECT CHAR(63);
And it will return: ?
Now run this:
SELECT N'₺';
PRINT N'₺';
This will both print and display in the results grid correctly.
To see what character value the symbol really is, run the following:
SELECT UNICODE(N'₺'), UNICODE('₺');
This will return: 8378 and 63
But isn't 63 the question mark? Yes. That is because not prefixing the string literal '₺' with a capital "N" tells SQL Server that it is VARCHAR and so it gets translated to the default unknown character.
Now, if you were to execute this VARCHAR version in a DB that had a Collation tied to a Code Page that had this character, then it would work even when not prefixing the string literal with an upper-case "N". However, at the moment, I cannot find any Code Page used within SQL Server that supports this character. So, it might be a Unicode-only character, at least at far as SQL Server is concerned.
The way to fix this is:
Change the datatype of the field to NVARCHAR (I see in a comment on the question that the field is currently VARCHAR). If the field is VARCHAR then even if you use the N prefix on the string, the character will still get stored as ?, unless the Code Page specified by the Collation of the column supports this character, but again, I think this might be a Unicode-only character.
Change your INSERT statement to prefix the string field with a capital "N": (73, 4, N'(3) ₺'). Even if you change the field to NVARCHAR, if you don't prefix the string with N then SQL Server will translate the character to ? first and then insert the ?. This is because the query gets parsed before it gets executed, and parsing (for non-Unicode string literals and variables) is done in the Code Page of the DB's default Collation
Probably for the same reason my browser isn't displaying it in the title for this question: It isn't in the application's character set (or maybe not supported by the font).
In this case, my browser shows some numbers in a box (denoting the character code).
SQL-server is translating it to a known character instead.
Ensure you're storing it in a field that supports the character in it's character set (I think UTF-8 is sufficient)

Automatic character encoding handling in Perl / DBI / DBD::ODBC

I'm using Perl with DBI / DBD::ODBC to retrieve data from an SQL Server database, and have some issues with character encoding.
The database has a default collation of SQL_Latin1_General_CP1_CI_AS, so data in varchar columns is encoded in Microsoft's version of Latin-1, AKA windows-1252.
There doesn't seem to be a way to handle this transparently in DBI/DBD::ODBC. I get data back still encoded as windows-1252, for instance, € “ ” are encoded as bytes 0x80, 0x93 and 0x94. When I write those to an UTF-8 encoded XML file without decoding them first, they are written as Unicode characters 0x80, 0x93 and 0x94 instead of 0x20AC, 0x201C, 0x201D, which is obviously not correct.
My current workaround is to call $val = Encode::decode('windows-1252', $val) on every column after every fetch. This works, but hardly seems like the proper way to do this.
Isn't there a way to tell DBI or DBD::ODBC to do this conversion for me?
I'm using ActivePerl (5.12.2 Build 1202), with DBI (1.616) and DBD::ODBC (1.29) provided by ActivePerl and updated with ppm; running on the same server that hosts the database (SQL Server 2008 R2).
My connection string is:
dbi:ODBC:Driver={SQL Server Native Client 10.0};Server=localhost;Database=$DB_NAME;Trusted_Connection=yes;
Thanks in advance.
DBD::ODBC (and ODBC API) does not know the character set of the underlying column so DBD::ODBC cannot do anything with 8 bit data returned, it can only return it as it is and you need to know what it is and decode it. If you bind the columns as SQL_WCHAR/SQL_WVARCHAR the driver/sql_server should translate the characters to UCS2 and DBD::ODBC should see the columns as SQL_WCHAR/SQL_WVARCHAR. When DBD::ODBC is built in unicode mode SQL_WCHAR columns are treat as UCS2 and decoded and re-encoded in UTF-8 and Perl should see them as unicode characters.
You need to set SQL_WCHAR as the bind type after bind_columns as bind types are not sticky like parameter types.
If you want to continue reading your varchar data which windows 1252 as bytes then currently you have no choice but to decode them. I'm not in a rush to add something to DBD::ODBC to do this for you since this is the first time anyone has mentioned this to me. You might want to look at DBI callbacks as decoding the returned data might be more easily done in those (say the fetch method).
You might also want to investigate the "Perform Translation for character data" setting in newer SQL Server ODBC Drivers although I have little experience with it myself.

Resources