Japanese Character in Oracle Database not displayed properly - database

I recently created an Oracle DB with JA16SJIS Character Set.
And then I try to insert some data include Japanese characters using SQL*Plus running an external SQL file. The file is encoded in Shift-JIS (and I can see Japanese characters properly in the file using notepad++).
Inserting was success but when I select the data (using SQL*Plus), Japanese characters are not displayed properly (like some alphabet characters with some question marks).
Even when I use SQL Developer to view the data, Japanese characters still unreadable.
And I'm using Window 7 Professional SP1, Oracle Database 11g R2, system locale set to Japan as well.

First, you should try to insert some text directly from SQLDeveloper data view. That should work no matter what, so you can use it to check your imports.
Then before you connect with SQL*Plus you must specify what you're going to send by setting or changing the value of environment variable NLS_LANG.
NSL_LANG=ENGLISH_FRANCE.JA16SJIS
The syntax will depend on your OS. The only important part is the last one JA16SJIS which means Shift-Jis as you already know.
You can then connect with SQL*Plus and import your file.
Note that the encoding that you specify must match the encoding of your file but not necessarily the encoding of the base as Oracle will do a conversion if necessary. So you could have your base in UTF8 and it would still work (because UTF8 can hold japanese characters).

In these cases the first thing I do is to have a look at what byte values are stored in the database. You can use the dump function for that.
select dump(<column>) from <table>
If you know what byte values your characters should have you can check if the correct values are in your table.

Related

Unicode conversion, database woes (Delphi 2007 to XE2)

Currently, I am in the process of updating all of our Delphi 2007 code base to Delphi XE2. The biggest consideration is the ANSI to Unicode conversion, which we've dealt with by re-defining all base types (char/string) to ANSI types (ansichar/ansistring). This has worked in many of our programs, until I started working with the database.
The problem started when I converted a program that stores information read from a file into an SQL Server 2008 database. Suddenly simple queries that used a string to locate data would fail, such as:
SELECT id FROM table WHERE name = 'something'
The name field is a varchar. I found that I was able to complete the query successfully by prefixing the string name with an N. I was under the impression that varchar could only store ANSI characters, but it appears to be storing Unicode?
Some more information: the name field in Delphi is string[13], but I've tried dropping the [13]. The database collation is SQL_Latin1_General_CP1_CI_AS. We use ADO to interface with the database. The connection information is stored in the ODBC Administrator.
NOTE: I've solved my actual problem thanks to a bit of direction from Panagiotis. The name we read from our map file is an array[1..24] of AnsiChar. This value was being implicitly converted to string[13], which was including null characters. So a name with 5 characters was really being stored as the 5 characters + 8 null characters in the database.
varchar fields do NOT store Unicode characters. They store ASCII values in the codepage specified by the field's collation. SQL Server will try to convert characters to the correct codepage when you try to store Unicode or data from a different codepage. You can disable this feature but the best option is to avoid the whole mess by using nvarchar fields and UnicodeString in your application.
You mention that you changes all character types to ANSI, not UNICODE types in your application. If you want to use UNICODE you should be using a UNICODE type like UnicodeString. Otherwise your values will be converted to ANSI when they are sent to your server. This conversion is done by your code when you create the AnsiString that is sent to the server.
BTW, your select statement stores an ASCII value in the field. You have to prepend the value with N if you want to store it as a unicode value, eg.g
SELECT id FROM table WHERE name = N'something'
Even this will not guarantee that your data will reach the server in a Unicode form. If you store the statement in an AnsiString the entire statement is converted to ANSI before it is sent to the server. If your app makes a wrong conversion, you will end up with mangled data on the server.
The solution is very simple, just use parameterized statements to pass unicode values as unicode parameters and store them in NVarchar fields. It is much faster, avoids all conversion errors and prevents SQL injection attacks.

How to script tables in SSMS that contain non-unicode text

I'm working with some tables in SQL Server that store text using 8-bit characters rather than unicode -- varchar rather than nvarchar. A certain amount of the text contains characters with values outside the ASCII range, for example curly quotes, em-dashes, and international characters such as ñ. Presumably this works because all our PCs and servers use the same code page.
However, when I use the Task > Generate Scripts scripting tool in SSMS to script such a table, the resulting script translates the special characters in such a way that if I use the script to reconstruct the table, the special characters are corrupted. For example "cañon" becomes "ca±on." I can see this in the INSERT statements that the script contains, where "cañon" from the database appears as "ca±on" in the INSERT statement.
In SSMS, how do I script a table that contains varchar data outside the ASCII range so that round-tripping will work?
In the Script Wizard's Output Option, you need to save as ANSI text, if you want to script extended ASCII data in CHAR and VARCHAR fields correctly.

SQL Server Data Types for Chinese Characters and Yale Romanization

I'm building a game for learning Cantonese. A core component is a database table with the following columns:
Chinese Character(s) | Yale Romanization | English Equivalent
What SQL Server data type should I choose for the first and second columns?
I do not yet know where my source data will come from. So I can't yet tell you what encoding it will use. My best guess is UTF-8.
*EDIT - I now know where my source data will come from. Someone will manually enter it into an Excel spreadsheet that I will then import. This raises two related questions. First, what format should the Excel spreadhsheet be saved in to preserve accent marks that are part of Yale romanization? Second, is any font that supports the requisite accent marks acceptable? Or are only certain fonts compatible with the necessary character encoding?
nvarchar would be the choice for unicode, variable length strings. And you can set collation for each field in the table as well.
As for Excel, I would test it out. My guess would be that Excel would preserve collation, but the best way would be to test it out.

How can I recover Unicode data which displays in SQL Server as?

I have a database in SQL Server containing a column which needs to contain Unicode data (it contains user's addresses from all over the world e.g. القاهرة‎ for Cairo)
This column is an nvarchar column with a collation of database default (Latin1_General_CI_AS), but I've noticed data inserted into it via SQL statements containing non English characters and displays as ?????.
The solution seems to be that I wasn't using the n prefix e.g.
INSERT INTO table (address) VALUES ('القاهرة')
Instead of:
INSERT INTO table (address) VALUES (n'القاهرة')
I was under the impression that Unicode would automatically be converted for nvarchar columns and I didn't need this prefix, but this appears to be incorrect.
The problem is I still have some data in this column which appears as ????? in SQL Server Management Studio and I don't know what it is!
Is the data still there but in an incorrect character encoding preventing it from displaying but still salvageable (and if so how can I recover it?), or is it gone for good?
Thanks,
Tom
To find out what SQL Server really stores, use
SELECT CONVERT(VARBINARY(MAX), 'some text')
I just tried this with umlauted characters and Arabic (copied from Wikipedia, I have no idea) both as plain strings and as N'' Unicode strings.
The results are that Arabic non-Unicode strings really end up as question marks (0x3F) in the conversion to VARCHAR.
SSMS sometimes won't display all characters, I just tried what you had and it worked for me, copy and paste it into Word and it might display it corectly
Usually if SSMS can't display it it should be boxes not ?
Try to write a small client that will retrieve these data to a file or web page. Check ALL your code if there are no other inserts or updates that might convertthe data to varchar before storing them in tables.

Storing UTF-16/Unicode data in SQL Server

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.
This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.
I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.
Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.
SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".
So one fix for the original problem is to adopt SQL Server 2012.
The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.
So, as long as you use the proper functions in your code, everything should work transparently.
EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.
EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.
something to add, that I just learned the hard way:
if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.
and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.
this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847
Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

Resources