PDO DBLIB multibyte (chinese) character encoding - SQL server - sql-server

On a Linux machine, I am using PDO DBLIB to connect to an MSSQL database and insert data in a SQL_Latin1_General_CP1_CI_AS table. The problem is that when I am trying to insert chinese characters (multibyte) they are inserted as 哈市香åŠåŒºç æ±Ÿè·¯å·.
My (part of) code is as follows:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;", $myUser, $myPass);
$query = "
INSERT INTO UserSignUpInfo
(FirstName)
VALUES
(:firstname)";
$STH = $DBH->prepare($query);
$STH->bindParam(':firstname', $firstname);
What I've tried so far:
Doing mb_convert_encoding to UTF-16LE on $firstname and CAST as VARBINARY in the query like:
$firstname = mb_convert_encoding($firstname, 'UTF-16LE', 'UTF-8');
VALUES
(CAST(:firstname AS VARBINARY));
Which results in inserting the characters properly, until there are some not-multibyte characters, which break the PDO execute.
Setting my connection as utf8:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;charset=UTF-8;", $myUser, $myPass);
$DBH->exec('SET CHARACTER SET utf8');
$DBH->query("SET NAMES utf8");
Setting client charset to UTF-8 in my freetds.conf
Which had no impact.
Is there any way at all, to insert multibyte data in that SQL database? Is there any other workaround? I've thought of trying PDO ODBC or even mssql, but thought it's better to ask here before wasting any more time.
Thanks in advance.
EDIT:
I ended up using MSSQL and the N data type prefix. I will swap for and try PDO_ODBC when I have more time. Thanks everyone for the answers!

Is there any way at all, to insert multibyte data in [this particular] SQL
database? Is there any other workaround?
If you can switch to PDO_ODBC, Microsoft provides free SQL Server ODBC drivers for Linux (only for 64-bit Red Hat Enterprise Linux, and 64-bit SUSE Linux Enterprise) which support Unicode.
If you can change to PDO_ODBC, then the N-prefix for inserting Unicode is going to work.
If you can change the affected table from SQL_Latin1_General_CP1_CI_AS to UTF-8 (which is the default for MSSQL), then that would be ideal.
Your case is more restricted. This solution is suited for the case when you have mixed multibyte and non-multibyte characters in your input string, and you need to save them to a Latin table, and the N data type prefix isn't working, and you don't want to change away from PDO DBLIB (because Microsoft's Unicode PDO_ODBC is barely supported on linux). Here is one workaround.
Conditionally encode the input string as base64. After all, that's how we can safely transport pictures in line with emails.
Working Example:
$DBH = new PDO("dblib:host=$myServer;dbname=$myDB;", $myUser, $myPass);
$query = "
INSERT INTO [StackOverflow].[dbo].[UserSignUpInfo]
([FirstName])
VALUES
(:firstname)";
$STH = $DBH->prepare($query);
$firstname = "输入中国文字!Okay!";
/* First, check if this string has any Unicode at all */
if (strlen($firstname) != strlen(utf8_decode($firstname))) {
/* If so, change the string to base64. */
$firstname = base64_encode($firstname);
}
$STH->bindParam(':firstname', $firstname);
$STH->execute();
Then to go backwards, you can test for base64 strings, and decode only them without damaging your existing entries, like so:
while ($row = $STH->fetch()) {
$entry = $row[0];
if (base64_encode(base64_decode($entry , true)) === $entry) {
/* Decoding and re-encoding a true base64 string results in the original entry */
print_r(base64_decode($entry) . PHP_EOL);
} else {
/* Previous entries not encoded will fall through gracefully */
print_r($entry . PHP_EOL);
}
}
Entries will be saved like this:
Guan Tianlang
5pys6Kqe44KS5a2maGVsbG8=
But you can easily convert them back to:
Guan Tianlang
输入中国文字!Okay!

Collation shouldn't matter here.
Double-byte characters need to be stored in nvarchar, nchar, or ntext fields. You don't need to perform any casting.
The n data type prefix stands for National, and it causes SQL Server to store text as Unicode (UTF-16).
Edit:
PDO_DBLIB does not support Unicode, and is now deprecated.
If you can switch to PDO_ODBC, Microsoft provides free SQL Server ODBC drivers for Linux which support Unicode.
Microsoft - SQL Server ODBC Driver Documentation
Blog - Installing and Using the Microsoft SQL Server ODBC Driver for Linux

You can use Unicode compatible data-type for the table column for supporting foreign languages(exceptions are shown in EDIT 2).
(char, varchar, text) Versus (nchar, nvarchar, ntext)
Non-Unicode :
Best suited for US English: "One problem with data types that use 1 byte to encode each character is that the data type can only represent 256 different characters. This forces multiple encoding specifications (or code pages) for different alphabets such as European alphabets, which are relatively small. It is also impossible to handle systems such as the Japanese Kanji or Korean Hangul alphabets that have thousands of characters
Unicode
Best suited for systems that need to support at least one foreign language: "The Unicode specification defines a single encoding scheme for most characters widely used in businesses around the world. All computers consistently translate the bit patterns in Unicode data into characters using the single Unicode specification. This ensures that the same bit pattern is always converted to the same character on all computers. Data can be freely transferred from one database or computer to another without concern that the receiving system will translate the bit patterns into characters incorrectly.
Example :
Also i have tried one example you can view its screens below,it would be helpful for issues relating the foreign language insertions as the question is right now.The column as seen below in nvarchar and it do support the Chinese language
EDIT 1:
Another related issue is discussed here
EDIT 2 :
Unicode unsupported scripts are shown here

just use nvarchar, ntext, nChar and when you want to insert then
use
INSERT INTO UserSignUpInfo
(FirstName)
VALUES
(N'firstname');
N will refer to Unicode charactor and it is standard world wide.
Ref :
https://aalamrangi.wordpress.com/2012/05/13/storing-and-retrieving-non-english-unicode-characters-hindi-czech-arabic-etc-in-sql-server/
https://technet.microsoft.com/en-us/library/ms191200(v=sql.105).aspx
https://irfansworld.wordpress.com/2011/01/25/what-is-unicode-and-non-unicode-data-formats/

This link Explain of chinese character in MYSQL. Can't insert Chinese character into MySQL .
You have to create table table_name () CHARACTER SET = utf8;
Use UTF-8 when you insert to table
set username utf8; INSERT INTO table_name (ABC,VAL);
abd create Database in CHARACTER SET utf8 COLLATE utf8_general_ci;
then You can insert in chinese character in table

Related

How does SQL Server store these Unicode characters into a column that is VARCHAR(MAX) and not NVARCHAR(MAX)

I have some data which I believe is Unicode and seeing what happens when I store it into my database column which is of VARCHAR(MAX) datatype.
And here's the source, from the file which is UTF-8...
looking for that ‘X’ and • 3 large bedrooms with 2 ensuites and • Main bedroom with ensuite & surround with plantation shutters`
and using the Visual Studio debugger:
=> so 2x apostrophes and 2x bullets.
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
I'm assuming my source data is not Unicode and therefore, I totally suck at all this Unicode/UTF-8 stuff :(
I thought SQL Server can only store Unicode if the column is of type NVARCHAR?
That's correct. As far as I can guess from your example, it is not storing Unicode. Probably it is storing bytes encoded in Windows code page 1252, which would be the default encoding for a Western install of SQL Server.
Code page 1252 happens to include mappings for characters ‘, ’ and •, so those characters can be safely stored. But step outside that limited repertoire and you'll start losing characters.

Remove emoji or smiley characters in SQL Server ntext column?

I have a mobile chat conversation text area which is stored in ntext data type in SQL Server 2008. I am doing some process character by character. I need to do something I do not know to pass these kind of emoji characters. Should I eliminate them or collate to different collation or encode to different char-set. My table's collation type is Latin1_General_CI_AS. I need something like this:
IF(SUBSTRING(#chat_Conversation, #i, 1) = 'Emoji')
CONTINUE;
As a first guess I'd suggest to place an N in front of your literal
Compare the results:
SELECT '😊'
,N'😊';
The result
ExtASCII Unicode
?? 😊
Without the N the literal is read as extended ASCII, unknown characters are returned as question marks. With N you are dealing with UNICODE (to be exact: UCS-2)...
UPDATE
As pointed out in comments: Do not use NTEXT!
NTEXT, TEXT and IMAGE are deprecated for centuries! These types will not be supported in future versions!
Convert all your work (columns, variables...) to
NTEXT -> NVARCHAR(MAX) (covering UCS-2 characters)
TEXT -> VARCHAR(MAX) (covering extended ASCII, depending on COLLATION and code page)
IMAGE -> VARBINARY(MAX) (covering BLOBs)
Hint
If you are dealing with special characters like foreign alphabets or emojis you should always use the N with literals and with types...

PostgreSQL: unable to save special character (regional language) in blob

I am using PostgreSQL 9.0 and am trying to store a bytea file which contains certain special characters (regional language characters - UTF8 encoded). But I am not able to store the data as input by the user.
For example :
what I get in request while debugging:
<sp_first_name_gu name="sp_first_name_gu" value="ઍયેઍ"></sp_first_name_gu><sp_first_name name="sp_first_name" value="aaa"></sp_first_name>
This is what is stored in DB:
<sp_first_name_gu name="sp_first_name_gu" value="\340\252\215\340\252\257\340\253\207\340\252\215"></sp_first_name_gu><sp_first_name name="sp_first_name" value="aaa"></sp_first_name>
Note the difference in value tag. With this issue I am not able to retrieve the proper text input by the user.
Please suggest what do I need to do?
PS: My DB is UTF8 encoded.
The value is stored correctly, but is escaped into octal escape sequences upon retrieval.
To fix that - change the settings of the DB driver or chose different different encoding/escaping for bytea.
Or just use proper field types for the XML data - like varchar or XML.
Your string \340\252\215\340\252\257\340\253\207\340\252\215 is exactly ઍયેઍ in octal encoding, so postgres stores your data correctly. PostgreSQL escapes all non printable characters, for more details see postgresql documentation, especially section 8.4.2

How to read Arabic characters from varchar datatype?

I have an old system that uses varchar datatype in its database to store Arabic names, now the names appear in the database like this:
"ãíÓÇÁ ÇáãÈíÖíä"
Now I am building a new system using VB.NET, how can I read these names to appear in Arabic characters?
Also I need to point out here that the old system even it stores the data as I mentioned earlier it converts the characters in a correct format.
How to display it properly in the new system and in the SQL Server Management Studio?
have you tried nvarchar? you may find some usefull information at the link below
When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?
I faced the same Problem, and I solved it by two steps:
1.change the datatype of the column in DB into nvarchar
2.use the encoding to change the data into Arabic
I used the following function
private string GetDataWithArabic(string srcData)
{
Encoding iso = Encoding.GetEncoding("iso-8859-1");
Encoding unicode = Encoding.Default;
byte[] unicodeBytes = iso.GetBytes(srcData);
return unicode.GetString(unicodeBytes);
}
but make sure you use this method once on DB data, because it will corrupt the data if used twice
I think your answer is here: "storing and retrieving non english characters" http://aalamrangi.wordpress.com/2012/05/13/storing-and-retrieving-non-english-unicode-characters-hindi-czech-arabic-etc-in-sql-server/

Automatic character encoding handling in Perl / DBI / DBD::ODBC

I'm using Perl with DBI / DBD::ODBC to retrieve data from an SQL Server database, and have some issues with character encoding.
The database has a default collation of SQL_Latin1_General_CP1_CI_AS, so data in varchar columns is encoded in Microsoft's version of Latin-1, AKA windows-1252.
There doesn't seem to be a way to handle this transparently in DBI/DBD::ODBC. I get data back still encoded as windows-1252, for instance, € “ ” are encoded as bytes 0x80, 0x93 and 0x94. When I write those to an UTF-8 encoded XML file without decoding them first, they are written as Unicode characters 0x80, 0x93 and 0x94 instead of 0x20AC, 0x201C, 0x201D, which is obviously not correct.
My current workaround is to call $val = Encode::decode('windows-1252', $val) on every column after every fetch. This works, but hardly seems like the proper way to do this.
Isn't there a way to tell DBI or DBD::ODBC to do this conversion for me?
I'm using ActivePerl (5.12.2 Build 1202), with DBI (1.616) and DBD::ODBC (1.29) provided by ActivePerl and updated with ppm; running on the same server that hosts the database (SQL Server 2008 R2).
My connection string is:
dbi:ODBC:Driver={SQL Server Native Client 10.0};Server=localhost;Database=$DB_NAME;Trusted_Connection=yes;
Thanks in advance.
DBD::ODBC (and ODBC API) does not know the character set of the underlying column so DBD::ODBC cannot do anything with 8 bit data returned, it can only return it as it is and you need to know what it is and decode it. If you bind the columns as SQL_WCHAR/SQL_WVARCHAR the driver/sql_server should translate the characters to UCS2 and DBD::ODBC should see the columns as SQL_WCHAR/SQL_WVARCHAR. When DBD::ODBC is built in unicode mode SQL_WCHAR columns are treat as UCS2 and decoded and re-encoded in UTF-8 and Perl should see them as unicode characters.
You need to set SQL_WCHAR as the bind type after bind_columns as bind types are not sticky like parameter types.
If you want to continue reading your varchar data which windows 1252 as bytes then currently you have no choice but to decode them. I'm not in a rush to add something to DBD::ODBC to do this for you since this is the first time anyone has mentioned this to me. You might want to look at DBI callbacks as decoding the returned data might be more easily done in those (say the fetch method).
You might also want to investigate the "Perform Translation for character data" setting in newer SQL Server ODBC Drivers although I have little experience with it myself.

Resources