We are migrating some data from sql server to oracle. For columns defined as NVARCHAR in SQL server we started creating NVARCHAR columns in Oracle thinking them to be similar..But it looks like they are not.
I have read couple of posts on stackoverflow and want to confirm my findings.
Oracle VARCHAR2 already supports unicode if the database character set is say AL32UTF8 (which is true for our case).
SQLServer VARCHAR does not support unicode. SQLServer explicitly requires columns to be in NCHAR/NVARCHAR type to store data in unicode (specifically in the 2 byte UCS-2 format)..
Hence would it be correct to say that SQL Server NVARCHAR columns can/should be migrated as Oracle VARCHAR2 columns ?
Yes, if your Oracle database is created using a Unicode character set, an NVARCHAR in SQL Server should be migrated to a VARCHAR2 in Oracle. In Oracle, the NVARCHAR data type exists to allow applications to store data using a Unicode character set when the database character set does not support Unicode.
One thing to be aware of in migrating, however, is character length semantics. In SQL Server, a NVARCHAR(20) allocates space for 20 characters which requires up to 40 bytes in UCS-2. In Oracle, by default, a VARCHAR2(20) allocates 20 bytes of storage. In the AL32UTF8 character set, that is potentially only enough space for 6 characters though most likely it will handle much more (a single character in AL32UTF8 requires between 1 and 3 bytes. You probably want to declare your Oracle types as VARCHAR2(20 CHAR) which indicates that you want to allocate space for 20 characters regardless of how many bytes that requires. That tends to be much easier to communicate than trying to explain why some 20 character strings are allowed while other 10 character strings are rejected.
You can change the default length semantics at the session level so that any tables you create without specifying any length semantics will use character rather than byte semantics
ALTER SESSION SET nls_length_semantics=CHAR;
That lets you avoid typing CHAR every time you define a new column. It is also possible to set that at a system level but doing so is discouraged by the NLS team-- apparently, not all the scripts Oracle provides have been thoroughly tested against databases where the NLS_LENGTH_SEMANTICS has been changed. And probably very few third-party scripts have been.
Related
We are using MS-SQL and Oracle as our database.
We have used hibernate annotations to create tables, in the annotation class file we have declared column definition as
#Column(name="UCAALSNO",nullable=false,columnDefinition="nvarchar(20)")
and this works fine for MS-SQL.
But when it comes to Oracle nvarchar throws an exception as oracle supports only nvarchar2.
How to create annotation file to support datatype nvarchar for both the databases.
You could use NCHAR:
In MSSQL:
nchar [ ( n ) ]
Fixed-length Unicode string data. n defines the string length and must
be a value from 1 through 4,000. The storage size is two times n
bytes. When the collation code page uses double-byte characters, the
storage size is still n bytes. Depending on the string, the storage
size of n bytes can be less than the value specified for n. The ISO
synonyms for nchar are national char and national character.
while in Oracle:
NCHAR
The maximum length of an NCHAR column is 2000 bytes. It can hold up to
2000 characters. The actual data is subject to the maximum byte limit
of 2000. The two size constraints must be satisfied simultaneously at
run time.
Nchar occupies a fixed space, so for very large table there will be a considerable space difference between an nchar and an nvarchar, so you should take this into consideration.
I usually have incremental DB schema migration scripts for my production DBs and I only rely on Hibernate DDL generation for my integration testing in-memory databases (e.g. HSQLDB or H2). This way I choose the production schema types first and the "columnDefinition" only applies to the testing schema, so there is no conflict.
You might want to read this too, which disregards the N(VAR)CHAR(2) additional complexity, so you might consider setting a default character encoding:
Given that, I'd much rather go with the approach that maximizes
flexibility going forward, and that's converting the entire database
to Unicode (AL32UTF8 presumably) and just using that.
Although that you might be recommanded to use VARCHAR2, VARCHAR has been synonym with VARCAHR2 for a long time now.
So quoting a DBA opinion:
The Oracle 9.2 and 8.1.7 documentation say essentially the same thing,
so even though Oracle continually discourages the use of VARCHAR, so
far they haven't done anything to change it's parity with VARCHAR2.
I'd say give it a try for VARCHAR too, as it's supported on most DBs.
Currently, I am in the process of updating all of our Delphi 2007 code base to Delphi XE2. The biggest consideration is the ANSI to Unicode conversion, which we've dealt with by re-defining all base types (char/string) to ANSI types (ansichar/ansistring). This has worked in many of our programs, until I started working with the database.
The problem started when I converted a program that stores information read from a file into an SQL Server 2008 database. Suddenly simple queries that used a string to locate data would fail, such as:
SELECT id FROM table WHERE name = 'something'
The name field is a varchar. I found that I was able to complete the query successfully by prefixing the string name with an N. I was under the impression that varchar could only store ANSI characters, but it appears to be storing Unicode?
Some more information: the name field in Delphi is string[13], but I've tried dropping the [13]. The database collation is SQL_Latin1_General_CP1_CI_AS. We use ADO to interface with the database. The connection information is stored in the ODBC Administrator.
NOTE: I've solved my actual problem thanks to a bit of direction from Panagiotis. The name we read from our map file is an array[1..24] of AnsiChar. This value was being implicitly converted to string[13], which was including null characters. So a name with 5 characters was really being stored as the 5 characters + 8 null characters in the database.
varchar fields do NOT store Unicode characters. They store ASCII values in the codepage specified by the field's collation. SQL Server will try to convert characters to the correct codepage when you try to store Unicode or data from a different codepage. You can disable this feature but the best option is to avoid the whole mess by using nvarchar fields and UnicodeString in your application.
You mention that you changes all character types to ANSI, not UNICODE types in your application. If you want to use UNICODE you should be using a UNICODE type like UnicodeString. Otherwise your values will be converted to ANSI when they are sent to your server. This conversion is done by your code when you create the AnsiString that is sent to the server.
BTW, your select statement stores an ASCII value in the field. You have to prepend the value with N if you want to store it as a unicode value, eg.g
SELECT id FROM table WHERE name = N'something'
Even this will not guarantee that your data will reach the server in a Unicode form. If you store the statement in an AnsiString the entire statement is converted to ANSI before it is sent to the server. If your app makes a wrong conversion, you will end up with mangled data on the server.
The solution is very simple, just use parameterized statements to pass unicode values as unicode parameters and store them in NVarchar fields. It is much faster, avoids all conversion errors and prevents SQL injection attacks.
By default - what is the character encoding set for a database in Microsoft SQL Server?
How can I see the current character encoding in SQL Server?
Encodings
In most cases, SQL Server stores Unicode data (i.e. that which is found in the XML and N-prefixed types) in UCS-2 / UTF-16 (storage is the same, UTF-16 merely handles Supplementary Characters correctly). This is not configurable: there is no option to use either UTF-8 or UTF-32 (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019). Whether or not the built-in functions can properly handle Supplementary Characters, and whether or not those are sorted and compared properly, depends on the Collation being used. The older Collations — names starting with SQL_ (e.g. SQL_Latin1_General_CP1_CI_AS) xor no version number in the name (e.g. Latin1_General_CI_AS) — equate all Supplementary Characters with each other (due to having no sort weight). Starting in SQL Server 2005 they introduced the 90 series Collations (those with _90_ in the name) that could at least do a binary comparison on Supplementary Characters so that you could differentiate between them, even if they didn't sort in the desired order. That also holds true for the 100 series Collations introduced in SQL Server 2008. SQL Server 2012 introduced Collations with names ending in _SC that not only sort Supplementary Characters properly, but also allow the built-in functions to interpret them as expected (i.e. treating the surrogate pair as a single entity). Starting in SQL Server 2017, all new Collations (the 140 series) implicitly support Supplementary Characters, hence there are no new Collations with names ending in _SC.
Starting in SQL Server 2019, UTF-8 became a supported encoding for CHAR and VARCHAR data (columns, variables, and literals), but not TEXT (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019).
Non-Unicode data (i.e. that which is found in the CHAR, VARCHAR, and TEXT types — but don't use TEXT, use VARCHAR(MAX) instead) uses an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC). The specific character set / encoding is based on the Code Page, which in turn is based on the Collation of a column, or the Collation of the current database for literals and variables, or the Collation of the Instance for variable / cursor names and GOTO labels, or what is specified in a COLLATE clause if one is being used.
To see how locales match up to collations, check out:
Windows Collation Name
SQL Server Collation Name
To see the Code Page associated with a particular Collation (this is the character set and only affects CHAR / VARCHAR / TEXT data), run the following:
SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'CodePage' ) AS [CodePage];
To see the LCID (i.e. locale) associated with a particular Collation (this affects the sorting & comparison rules), run the following:
SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'LCID' ) AS [LCID];
To view the list of available Collations, along with their associated LCIDs and Code Pages, run:
SELECT [name],
COLLATIONPROPERTY( [name], 'LCID' ) AS [LCID],
COLLATIONPROPERTY( [name], 'CodePage' ) AS [CodePage]
FROM sys.fn_helpcollations()
ORDER BY [name];
Defaults
Before looking at the Server and Database default Collations, one should understand the relative importance of those defaults.
The Server (Instance, really) default Collation is used as the default for newly created Databases (including the system Databases: master, model, msdb, and tempdb). But this does not mean that any Database (other than the 4 system DBs) is using that Collation. The Database default Collation can be changed at any time (though there are dependencies that might prevent a Database from having it's Collation changed). The Server default Collation, however, is not so easy to change. For details on changing all collations, please see: Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
The server/Instance Collation controls:
local variable names
CURSOR names
GOTO labels
Instance-level meta-data
The Database default Collation is used in three ways:
as the default for newly created string columns. But this does not mean that any string column is using that Collation. The Collation of a column can be changed at any time. Here knowing the Database default is important as an indication of what the string columns are most likely set to.
as the Collation for operations involving string literals, variables, and built-in functions that do not take string inputs but produces a string output (i.e. IF (#InputParam = 'something') ). Here knowing the Database default is definitely important as it governs how these operations will behave.
Database-level meta-data
The column Collation is either specified in the COLLATE clause at the time of the CREATE TABLE or an ALTER TABLE {table_name} ALTER COLUMN, or if not specified, taken from the Database default.
Since there are several layers here where a Collation can be specified (Database default / columns / literals & variables), the resulting Collation is determined by Collation Precedence.
All of that being said, the following query shows the default / current settings for the OS, SQL Server Instance, and specified Database:
SELECT os_language_version,
---
SERVERPROPERTY('LCID') AS 'Instance-LCID',
SERVERPROPERTY('Collation') AS 'Instance-Collation',
SERVERPROPERTY('ComparisonStyle') AS 'Instance-ComparisonStyle',
SERVERPROPERTY('SqlSortOrder') AS 'Instance-SqlSortOrder',
SERVERPROPERTY('SqlSortOrderName') AS 'Instance-SqlSortOrderName',
SERVERPROPERTY('SqlCharSet') AS 'Instance-SqlCharSet',
SERVERPROPERTY('SqlCharSetName') AS 'Instance-SqlCharSetName',
---
DATABASEPROPERTYEX(N'{database_name}', 'LCID') AS 'Database-LCID',
DATABASEPROPERTYEX(N'{database_name}', 'Collation') AS 'Database-Collation',
DATABASEPROPERTYEX(N'{database_name}', 'ComparisonStyle') AS 'Database-ComparisonStyle',
DATABASEPROPERTYEX(N'{database_name}', 'SQLSortOrder') AS 'Database-SQLSortOrder'
FROM sys.dm_os_windows_info;
Installation Default
Another interpretation of "default" could mean what default Collation is selected for the Instance-level collation when installing. That varies based on the OS language, but the (horrible, horrible) default for systems using "US English" is SQL_Latin1_General_CP1_CI_AS. In that case, the "default" encoding is Windows Code Page 1252 for VARCHAR data, and as always, UTF-16 for NVARCHAR data. You can find the list of OS language to default SQL Server collation here: Collation and Unicode support: Server-level collations. Keep in mind that these defaults can be overridden; this list is merely what the Instance will use if not overridden during install.
UPDATE 2018-10-02
SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes (not TEXT!). This is accomplished via a set of new collations, the names of which all end with _UTF8. This is an interesting capability that will definitely help some folks, but there are some "quirks" with it, especially when UTF-8 isn't being used for all columns and the Database's default Collation, so don't use it just because you have heard that UTF-8 is magically better. UTF-8 was designed solely for ASCII compatibility: to enable ASCII-only systems (i.e. UNIX back in the day) to support Unicode without changing any existing code or files. That it saves space for data using mostly (or only) US English characters (and some punctuation) is a side-effect. When not using mostly (or only) US English characters, data can be the same size as UTF-16, or even larger, depending on which characters are being used. And, in cases where space is being saved, performance might improve, but it might also get worse.
For a detailed analysis of this new feature, please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?".
If you need to know the default collation for a newly created database use:
SELECT SERVERPROPERTY('Collation')
This is the server collation for the SQL Server instance that you are running.
The default character encoding for a SQL Server database is iso_1, which is ISO 8859-1. Note that the character encoding depends on the data type of a column. You can get an idea of what character encodings are used for the columns in a database as well as the collations using this SQL:
select data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name, count(*) count
from information_schema.columns
group by data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name;
If it's using the default, the character_set_name should be iso_1 for the char and varchar data types. Since nchar and nvarchar store Unicode data in UCS-2 format, the character_set_name for those data types is UNICODE.
SELECT DATABASEPROPERTYEX('DBName', 'Collation') SQLCollation;
Where DBName is your database name.
I think this is worthy of a separate answer: although internally unicode data is stored as UTF-16 in Sql Server this is the Little Endian flavour, so if you're calling the database from an external system, you probably need to specify UTF-16LE.
You can see collation settings for each table like the following code:
SELECT t.name TableName, c.name ColumnName, collation_name
FROM sys.columns c
INNER JOIN sys.tables t on c.object_id = t.object_id where t.name = 'name of table';
we have WinForms app which stores data in SQL Server (2000, we are working on porting it in 2008) through ADO.NET (1.1, working on porting to 4.0). Everything works fine if I read data previsouly written in Western-European locale (E.g.: "test", "test ù"), but now we have to be able to mix Western and non-Western alphabets as well (E.g.: "test - ۓےۑ" - these are just random arabic chars).
On the SQL Server side, database has been set with the Latin1_General collation, the field is a nvarchar(80). If I run a SQL SELECT statement (E.g.: "SELECT * FROM MyTable WHERE field = 'test - ۓےۑ'", don't mind about the "*" or the actual names) from Query Analyzer, I get no results; the same happens if I pass the Sql statement to an ADO.NET DataAdapter to fill a DataTable. My guess is that it has something to do with collation, but I don't know how to correct this: do I have to change to collation (SQL Server) to a different one? Or do I have to set the locale on the DataAdaoter/DataTable (ADO.NET)?
Thanks in advance to anyone who will help
Shouldn't you use N when comparing nvarchar with extended char. set?
SELECT * From TestTable WHERE GreekColCaseInsensitive = N'test - ۓےۑ'
Yes, the problem is most likely the collation. The Latin1_General collation does not include the rules to sort and compare non latin characters.
MSDN claims:
If you must store character data that reflects multiple languages, you can minimize collation compatibility issues by always using the Unicode nchar, nvarchar, and ntext data types instead of the char, varchar, text data types. Using the Unicode data types eliminates code page conversion issues.
Since you have already complied with this, you should read further on the info about Mixed Collation Environments here.
Additionally I want to add that just changing a collation is not something done easy, check the MSDN for SQL 2000:
When you set up SQL Server 2000, it is important to use the correct collation settings. You can change collation settings after running Setup, but you must rebuild the databases and reload the data. It is recommended that you develop a standard within your organization for these options. Many server-to-server activities can fail if the collation settings are not consistent across servers.
You can specify a collation on a per column bases however:
CREATE TABLE TestTable (
id int,
GreekColCaseInsensitive nvarchar(10) collate greek_ci_as,
LatinColCaseSensitive nvarchar(10) collate latin1_general_cs_as
)
Have a look at the different binary multilingual collations here. Depending on the charset you use, you should find one that fits your purpose.
If you are not able or willing to change the collation of a column you can also just specify the collation to be used in the query like:
SELECT * From TestTable
WHERE GreekColCaseInsensitive = N'test - ۓےۑ'
COLLATE latin1_general_cs_as
As jfrobishow pointed out the use of N in front of the string you want to use to compare is essential. What does it do:
It denotes that the subsequent string is in Unicode (the N actually stands for National language character set). Which means that you are passing an NCHAR, NVARCHAR or NTEXT value, as opposed to CHAR, VARCHAR or TEXT. See Article #2354 for a comparison of these data types.
You can find a quick rundown here.
I'm trying to store Japanese characters in nvarchar fields in my SQL Server 2000 database.
When I run an update statement like:
update blah
set address = N'スタンダードチャ'
where key_ID = 1
from SQL Server Management Studio, then run a select statement I see only question marks returned to the results window. I'm seeing the same question marks in the webpage which looks at the database.
It seems this is an issue with storing the proper data right? Can anyone tell me what I need to do differently?
This cannot be a correct answer given your example, but the most common reason I've seen is that string literals do not need a unicode N prefix.
So, instead of
set address = N'スタンダードチャ'
one would try to write to a nvarchar field without the unicode prefix
set address = 'スタンダードチャ'
See also:
N prefix before string in Transact-SQL query
I was facing this same issue when using Indian languages characters while storing in DB nvarchar fields. Then i went through this microsoft article -
http://support.microsoft.com/kb/239530
I followed this and my unicode issue got resolved.In this article they say - You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
SQL Server Unicode Support
SQL Server Unicode data types support UCS-2 encoding. Unicode data types store character data using two bytes for each character rather than one byte. There are 65,536 different bit patterns in two bytes, so Unicode can use one standard set of bit patterns to encode each character in all languages, including languages such as Chinese that have large numbers of characters.
In SQL Server, data types that support Unicode data are:
nchar
nvarchar
nvarchar(max) – new in SQL Server 2005
ntext
Use of nchar, nvarchar, nvarchar(max), and ntext is the same as char, varchar, varchar(max), and text, respectively, except:
- Unicode supports a wider range of characters.
- More space is needed to store Unicode characters.
- The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.
- Unicode constants are specified with a leading N, for example, N'A Unicode string'
APPLIES TO
Microsoft SQL Server 7.0 Standard Edition
Microsoft SQL Server 2000 Standard Edition
Microsoft SQL Server 2005 Standard Edition
Microsoft SQL Server 2005 Express Edition
Microsoft SQL Server 2005 Developer Edition
Microsoft SQL Server 2005 Enterprise Edition
Microsoft SQL Server 2005 Workgroup Edition
The code is absolutely fine. You can insert an unicode string with a prefix N into a field declared as NVARCHAR. So can you check if Address is a NVARCHAR column. Tested the below code in SQL Server 2008R2 and it worked.
update blah
set address = N'スタンダードチャ'
where key_ID = 1
enter image description here
you need to write N before the string value.
e.g.INSERT INTO LabelManagement (KeyValue) VALUES
(N'変更命令');
Here I am storing value in japanese language and i have added N before the string character.
I am using Sql server 2014.
Hope you find the solution.
Enjoy.
You need to check the globalisation settings of all the code that deals with this data, from your database, data access and presentation layers. This includes SSMS.
(You also need to work out which version you are using, 2003 doesn't exist...)
SSMS will not display that correctly, you might see question marks or boxes
paste the results into word and they should be in Japanese
In the webpage you need to set the Content-Type, the code below will display Chinese Big5
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=big5">
To verify the data you can't use ascii since ascii only can see the ascii character set
run this
select unicode(address),ascii(address)
from blah where key_ID = 1
Output should be the following (it only looks at the first character)
12473 63
I can almost gurantee that the data type is not unicode. If you want to learn more you can check Wikipedia for information on Unicode, ASCII, and ANSI. Unicode can store more unique characters, but takes more space to store, transfer, and process. Also some programs and other things don't support unicode. The unicode data types for MS SQL are "nchar", "nvarchar", and "ntext".
We are using Microsoft SQL Server 2008 R2(SP3). Our table collation is specified as SQL_Latin1_General_CP1_CI_AS. I have my types specified as the n variety
nvarchar(MAX)
nchar(2)
..etc
To insert Japanese characters I prefix the string with a capital N
N'素晴らしい一日を'
Works like a charm.