SQL Server multi language data support - sql-server

How do you setup a SQL Server 2005 DBMS, so that you can store data in different languages?
My exact problem is this: in SQL Server Management Studio I'm writing an insert statement which contains German Umlauts. Text is successfully saved but reading the same value results in text without Umlaut.
Consider that I have to support 4 languages: English, German, Greek & Russian (I don't want to think what I will face with the Russian text).
The DBMS is now setup with Greek collation (to support Greek).
Does this cause any problem??
Any hints??

You need to use nvarchar data type for strings ( http://msdn.microsoft.com/en-us/library/ms186939.aspx ) and you also need to precede all unicode strings with N ( http://support.microsoft.com/kb/239530 ).
When dealing with Unicode string constants in SQL Server you must precede all Unicode strings with a capital letter N, as documented in the SQL Server Books Online topic "Using Unicode Data". The "N" prefix stands for National Language in the SQL-92 standard, and must be uppercase. If you do not prefix a Unicode string constant with N, SQL Server will convert it to the non-Unicode code page of the current database before it uses the string.

are you using nvarchar type (rather than varchar)? would be recommended if you have multiple langs in the same column. that will let you store and retrieve properly.
of course, sql server can only maintain one collation type on a particular column, even if the column is being used to store strings in multiple languages, so that is something to consider...

Related

Unicode conversion, database woes (Delphi 2007 to XE2)

Currently, I am in the process of updating all of our Delphi 2007 code base to Delphi XE2. The biggest consideration is the ANSI to Unicode conversion, which we've dealt with by re-defining all base types (char/string) to ANSI types (ansichar/ansistring). This has worked in many of our programs, until I started working with the database.
The problem started when I converted a program that stores information read from a file into an SQL Server 2008 database. Suddenly simple queries that used a string to locate data would fail, such as:
SELECT id FROM table WHERE name = 'something'
The name field is a varchar. I found that I was able to complete the query successfully by prefixing the string name with an N. I was under the impression that varchar could only store ANSI characters, but it appears to be storing Unicode?
Some more information: the name field in Delphi is string[13], but I've tried dropping the [13]. The database collation is SQL_Latin1_General_CP1_CI_AS. We use ADO to interface with the database. The connection information is stored in the ODBC Administrator.
NOTE: I've solved my actual problem thanks to a bit of direction from Panagiotis. The name we read from our map file is an array[1..24] of AnsiChar. This value was being implicitly converted to string[13], which was including null characters. So a name with 5 characters was really being stored as the 5 characters + 8 null characters in the database.
varchar fields do NOT store Unicode characters. They store ASCII values in the codepage specified by the field's collation. SQL Server will try to convert characters to the correct codepage when you try to store Unicode or data from a different codepage. You can disable this feature but the best option is to avoid the whole mess by using nvarchar fields and UnicodeString in your application.
You mention that you changes all character types to ANSI, not UNICODE types in your application. If you want to use UNICODE you should be using a UNICODE type like UnicodeString. Otherwise your values will be converted to ANSI when they are sent to your server. This conversion is done by your code when you create the AnsiString that is sent to the server.
BTW, your select statement stores an ASCII value in the field. You have to prepend the value with N if you want to store it as a unicode value, eg.g
SELECT id FROM table WHERE name = N'something'
Even this will not guarantee that your data will reach the server in a Unicode form. If you store the statement in an AnsiString the entire statement is converted to ANSI before it is sent to the server. If your app makes a wrong conversion, you will end up with mangled data on the server.
The solution is very simple, just use parameterized statements to pass unicode values as unicode parameters and store them in NVarchar fields. It is much faster, avoids all conversion errors and prevents SQL injection attacks.

What is the equivalent Collation type in Oracle for the Latin1_General_BIN collation type in SQL Server?

I am in the process of migrating a SQL Server database to Oracle and would like to know the collation equivalent in Oracle for Latin1_General_BIN .
It would be great help if someone could help me with the syntax to set collations in Oracle .
Thanks !
Collation refers to how the database stores and sorts data.
SQL Server
Latin1_General = U.S. English character set (code page 1252).
_BIN = Sorts/compares data based on bit patterns of each character. Sort order is case-sensitive; lowercase precedes uppercase, and accent-sensitive. This is the fastest sorting order.
Oracle
NLS_LANG specifies the Oracle character set as WE8MSWIN1252 which maps to my Windows ANSI code page 1252.
and an NLS_SORT of BINARY
You do not want to choose the options with suffix of _CI (case insensitivity) or _AI (accent-insensitive and case-insensitive).
NLS_LANG is just a client-side parameter.
For the database side there is many NLS_ Parameters for:
- Language support
- Territory support
- Linguistic sorting and searching
- Character sets and semantics
you have also 2 independants NLS_ parameter for character set in every database : The Database character set and the National character set

T-SQL Unicode "word" definition

I'm new to Unicode in Microsoft SQL Server 2005 / 2008. I converted my DB to use NVarChar() instead of VarChar(). I found to my surprise that the sorting is different than with VarChar(). I found this other reference here on StackOverflow, for SQL Sorting and hyphens that explained Unicode sorting is done on a "word" basis. After more research, I found the Unicode Consortium site (www.unicode.org), in particular the Unicode Text Segmentation (www.unicode.org/reports/tr29) site that discusses this, and it does mention the hyphen as a special case. (Sorry, as a new user, I couldn't post hyperlinks for these).
But what I'm trying to define is exactly what the rules are for the different collations, in particular for US English collations. What other special cases are there? For example, is hyphen the only character that's ignored? Or what about other punctuation, like apostrophes?
Any links or pointers will be greatly appreciated.
Don't use a SQL collation; use a Windows one. This is mentioned in the KB article.
From "Windows Collation Sorting Styles":
For Windows collations, the nchar,
nvarchar, and ntext Unicode data types
have the same sorting behavior as
char, varchar, and text non-Unicode
data types.
However, you should also consider why you have unicode. In addition to your sorting issues,it's slower: varchar vs nvarchar performance and even MS agreee

Why is sql server storing question mark characters instead of Japanese characters in NVarchar fields?

I'm trying to store Japanese characters in nvarchar fields in my SQL Server 2000 database.
When I run an update statement like:
update blah
set address = N'スタンダードチャ'
where key_ID = 1
from SQL Server Management Studio, then run a select statement I see only question marks returned to the results window. I'm seeing the same question marks in the webpage which looks at the database.
It seems this is an issue with storing the proper data right? Can anyone tell me what I need to do differently?
This cannot be a correct answer given your example, but the most common reason I've seen is that string literals do not need a unicode N prefix.
So, instead of
set address = N'スタンダードチャ'
one would try to write to a nvarchar field without the unicode prefix
set address = 'スタンダードチャ'
See also:
N prefix before string in Transact-SQL query
I was facing this same issue when using Indian languages characters while storing in DB nvarchar fields. Then i went through this microsoft article -
http://support.microsoft.com/kb/239530
I followed this and my unicode issue got resolved.In this article they say - You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
SQL Server Unicode Support
SQL Server Unicode data types support UCS-2 encoding. Unicode data types store character data using two bytes for each character rather than one byte. There are 65,536 different bit patterns in two bytes, so Unicode can use one standard set of bit patterns to encode each character in all languages, including languages such as Chinese that have large numbers of characters.
In SQL Server, data types that support Unicode data are:
nchar
nvarchar
nvarchar(max) – new in SQL Server 2005
ntext
Use of nchar, nvarchar, nvarchar(max), and ntext is the same as char, varchar, varchar(max), and text, respectively, except:
- Unicode supports a wider range of characters.
- More space is needed to store Unicode characters.
- The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.
- Unicode constants are specified with a leading N, for example, N'A Unicode string'
APPLIES TO
Microsoft SQL Server 7.0 Standard Edition
Microsoft SQL Server 2000 Standard Edition
Microsoft SQL Server 2005 Standard Edition
Microsoft SQL Server 2005 Express Edition
Microsoft SQL Server 2005 Developer Edition
Microsoft SQL Server 2005 Enterprise Edition
Microsoft SQL Server 2005 Workgroup Edition
The code is absolutely fine. You can insert an unicode string with a prefix N into a field declared as NVARCHAR. So can you check if Address is a NVARCHAR column. Tested the below code in SQL Server 2008R2 and it worked.
update blah
set address = N'スタンダードチャ'
where key_ID = 1
enter image description here
you need to write N before the string value.
e.g.INSERT INTO LabelManagement (KeyValue) VALUES
(N'変更命令');
Here I am storing value in japanese language and i have added N before the string character.
I am using Sql server 2014.
Hope you find the solution.
Enjoy.
You need to check the globalisation settings of all the code that deals with this data, from your database, data access and presentation layers. This includes SSMS.
(You also need to work out which version you are using, 2003 doesn't exist...)
SSMS will not display that correctly, you might see question marks or boxes
paste the results into word and they should be in Japanese
In the webpage you need to set the Content-Type, the code below will display Chinese Big5
<META HTTP-EQUIV="content-type" CONTENT="text/html; charset=big5">
To verify the data you can't use ascii since ascii only can see the ascii character set
run this
select unicode(address),ascii(address)
from blah where key_ID = 1
Output should be the following (it only looks at the first character)
12473 63
I can almost gurantee that the data type is not unicode. If you want to learn more you can check Wikipedia for information on Unicode, ASCII, and ANSI. Unicode can store more unique characters, but takes more space to store, transfer, and process. Also some programs and other things don't support unicode. The unicode data types for MS SQL are "nchar", "nvarchar", and "ntext".
We are using Microsoft SQL Server 2008 R2(SP3). Our table collation is specified as SQL_Latin1_General_CP1_CI_AS. I have my types specified as the n variety
nvarchar(MAX)
nchar(2)
..etc
To insert Japanese characters I prefix the string with a capital N
N'素晴らしい一日を'
Works like a charm.

When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?

Is there a rule when we must use the Unicode types?
I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns.
I am looking for something like:
If you have Chinese --> use NVARCHAR
If you have German and Arabic --> use NVARCHAR
What about the collation of the server/database?
I don't want to use always NVARCHAR like suggested here
What are the main performance differences between varchar and nvarchar SQL Server data types?
The real reason you want to use NVARCHAR is when you have different languages in the same column, you need to address the columns in T-SQL without decoding, you want to be able to see the data "natively" in SSMS, or you want to standardize on Unicode.
If you treat the database as dumb storage, it is perfectly possible to store wide strings and different (even variable-length) encodings in VARCHAR (for instance UTF-8). The problem comes when you are attempting to encode and decode, especially if the code page is different for different rows. It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on (potentially variably) encoded columns.
Using NVARCHAR avoids all this.
I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained.
I would recommend VARCHAR for any column which is a natural key (like a vehicle license plate, SSN, serial number, service tag, order number, airport callsign, etc) which is typically defined and constrained by a standard or legislation or convention. Also VARCHAR for user-entered, and very constrained (like a phone number) or a code (ACTIVE/CLOSED, Y/N, M/F, M/S/D/W, etc). There is absolutely no reason to use NVARCHAR for those.
So for a simple rule:
VARCHAR when guaranteed to be constrained
NVARCHAR otherwise
Both the two most upvoted answers are wrong. It should have nothing to do with "store different/multiple languages". You can support Spanish characters like ñ and English, with just common varchar field and Latin1_General_CI_AS COLLATION, e.g.
Short Version
You should use NVARCHAR/NCHAR whenever the ENCODING, which is determined by COLLATION of the field, doesn't support the characters needed.
Also, depending on the SQL Server version, you can use specific COLLATIONs, like Latin1_General_100_CI_AS_SC_UTF8 which is available since SQL Server 2019. Setting this collation on a VARCHAR field (or entire table/database), will use UTF-8 ENCODING for storing and handling the data on that field, allowing fully support UNICODE characters, and hence any languages embraced by it.
To FULLY UNDERSTAND:
To fully understand what I'm about to explain, it's mandatory to have the concepts of UNICODE, ENCODING and COLLATION all extremely clear in your head. If you don't, then first take a look below at my humble and simplified explanation on "What is UNICODE, ENCODING, COLLATION and UTF-8, and how they are related" section and supplied documentation links. Also, everything I say here is specific to Microsoft SQL Server, and how it stores and handles data in char/nchar and varchar/nvarchar fields.
Let's say we wanna store a peculiar text on our MSSQL Server database. It could be an Instagram comment as "I love stackoverflow! 😍".
The plain English part would be perfectly supported even by ASCII, but since there are also an emoji, which is a character specified in the UNICODE standard, we need an ENCODING that supports this Unicode character.
MSSQL Server uses the COLLATION to determine what ENCODING is used on char/nchar/varchar/nvarchar fields. So, differently than a lot think, COLLATION is not only about sorting and comparing data, but also about ENCODING, and by consequence: how our data will be stored!
So, HOW WE KNOW WHAT IS THE ENCODING USED BY OUR COLLATION? With this:
SELECT COLLATIONPROPERTY( 'Latin1_General_CI_AI' , 'CodePage' ) AS [CodePage]
--returns 1252
This simple SQL returns the Windows Code Page for a COLLATION. A Windows Code Page is nothing more than another mapping to ENCODINGs. For the Latin1_General_CI_AI COLLATION it returns the Windows Code Page code 1252 , that maps to Windows-1252 ENCODING.
So, for a varchar column, with Latin1_General_CI_AI COLLATION, this field will handle its data using the Windows-1252 ENCODING, and only correctly store characters supported by this encoding.
If we check the Windows-1252 ENCODING specification Character List for Windows-1252, we will find out that this encoding won't support our emoji character. And if we still try it out:
OK, SO HOW CAN WE SOLVE THIS?? Actually, it depends, and that is GOOD!
NCHAR/NVARCHAR
Before SQL Server 2019 all we had was NCHAR and NVARCHAR fields. Some say they are UNICODE fields. THAT IS WRONG!. Again, it depends on the field's COLLATION and also SQLServer Version.
Microsoft's "nchar and nvarchar (Transact-SQL)" documentation specifies perfectly:
Starting with SQL Server 2012 (11.x), when a
Supplementary Character (SC) enabled collation is used, these data
types store the full range of Unicode character data and use the
UTF-16 character encoding. If a non-SC collation is specified, then
these data types store only the subset of character data supported by
the UCS-2 character encoding.
In other words, if we use SQL Server older that 2012, like SQL Server 2008 R2 for example, the ENCODING for those fields will use UCS-2 ENCODING which support a subset of UNICODE. But if we use SQL Server 2012 or newer, and define a COLLATION that has Supplementary Character enabled, than with our field will use the UTF-16 ENCODING, that fully supports UNICODE.
BUT WHAIT, THERE IS MORE! WE CAN USE UTF-8 NOW!!
CHAR/VARCHAR
Starting with SQL Server 2019, WE CAN USE CHAR/VARCHAR fields and still fully support UNICODE using UTF-8 ENCODING!!!
From Microsoft's "char and varchar (Transact-SQL)" documentation:
Starting with SQL Server 2019 (15.x), when a
UTF-8 enabled collation is used, these data types store the full range
of Unicode character data and use the UTF-8 character encoding. If a
non-UTF-8 collation is specified, then these data types store only a
subset of characters supported by the corresponding code page of that
collation.
Again, in other words, if we use SQL Server older that 2019, like SQL Server 2008 R2 for example, we need to check the ENCODING using the method explained before. But if we use SQL Server 2019 or newer, and define a COLLATION like Latin1_General_100_CI_AS_SC_UTF8, then our field will use UTF-8 ENCODING which is by far the most used and efficient encoding that supports all the UNICODE characters.
Bonus Information:
Regarding the OP's observation on "I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns", I think it's nice to know why it is:
For the most common COLLATIONs, like the default ones as Latin1_General_CI_AI or SQL_Latin1_General_CP1_CI_AS the ENCODING will be Windows-1252 for varchar fields. If we take a look on it's documentation, we can see that it supports:
English, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish. Plus
also German, Finnish and French. And Dutch except the IJ character
But as I said before, it's not about language, it's about what characters do you expect to support/store, as shown in the emoji example, or some sentence like "The electric resistance of a lithium battery is 0.5Ω" where we have again plain English, and a Greek letter/character "omega" (which is the symbol for resistance in ohms), which won't be correctly handled by Windows-1252 ENCODING.
Conclusion:
So, there it is! When use char/nchar and varchar/nvarchar depends on the characters that you want to support, and also the version of your SQL Server that will determines which COLLATIONs and hence the ENCODINGs you have available.
What is UNICODE, ENCODING, COLLATION and UTF-8, and how they are related
Note: all the explanations below are simplifications. Please, refer to the supplied documentation links to know all the details about those concepts.
UNICODE - Is a standard, a convention, that aims to regulate all the characters in a unified and organized table. In this table, every character has an unique number. This number is commonly called character's code point.UNICODE IS NOT AN ENCODING!
ENCODING - Is a mapping between a character and a byte/bytes sequence. So a encoding is used to "transform" a character to bytes and also the other way around, from bytes to a character. Among the most popular ones are UTF-8, ISO-8859-1, Windows-1252 and ASCII. You can think of it as a "conversion table" (i really simplified here).
COLLATION - That one is important. Even Microsoft's documentation doesn't let this clear as it should be. A Collation specifies how your data would be sorted, compared, AND STORED!. Yeah, I bet you was not expecting for that last one, right!? The collations on SQL Server determines too what would be the ENCODING used on that particular char/nchar/varchar/nvarchar field.
ASCII ENCODING - Was one of the firsts encodings. It is both the character table (like an own tiny version of UNICODE) and its byte mappings. So it doesn't map a byte to UNICODE, but map a byte to its own character's table. Also, it always use only 7bits, and supported 128 different characters. It was enough to support all English letters upper and down cased, numbers, punctuation and some other limited number of characters. The problem with ASCII is that since it only used 7bits and almost every computer was 8bits at the time, there were another 128 possibilities of characters to be "explored", and everybody started to map this "available" bytes to its own table of characters, creating a lot of different ENCODINGs.
UTF-8 ENCODING - This is another ENCODING, one of the most (if not the most) used ENCODING around. It uses variable byte width (one character can be from 1 to 6 bytes long, by specification) and fully supports all UNICODE characters.
Windows-1252 ENCODING - Also one of the most used ENCODING, it's widely used on SQL Server. It's fixed-size, so every one character is always 1byte. It also supports a lot of accents, from various languages but doesn't support all existing, nor supports UNICODE. That's why your varchar field with a common collation like Latin1_General_CI_AS supports á,é,ñ characters, even that it isn't using a supportive UNICODE ENCODING.
Resources:
https://blog.greglow.com/2019/07/25/sql-think-that-varchar-characters-if-so-think-again/
https://medium.com/#apiltamang/unicode-utf-8-and-ascii-encodings-made-easy-5bfbe3a1c45a
https://www.johndcook.com/blog/2019/09/09/how-utf-8-works/
https://www.w3.org/International/questions/qa-what-is-encoding
https://en.wikipedia.org/wiki/List_of_Unicode_characters
https://www.fileformat.info/info/charset/windows-1252/list.htm
https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/windows-collation-name-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/sql-server-collation-name-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15#SQL-collations
SQL Server default character encoding
https://en.wikipedia.org/wiki/Windows_code_page
You should use NVARCHAR anytime you have to store multiple languages. I believe you have to use it for the Asian languages but don't quote me on it.
Here's the problem if you take Russian for example and store it in a varchar, you will be fine so long as you define the correct code page. But let's say your using a default english sql install, then the russian characters will not be handled correctly. If you were using NVARCHAR() they would be handled properly.
Edit
Ok let me quote MSDN and maybee I was to specific but you don't want to store more then one code page in a varcar column, while you can you shouldn't
When you deal with text data that is
stored in the char, varchar,
varchar(max), or text data type, the
most important limitation to consider
is that only information from a single
code page can be validated by the
system. (You can store data from
multiple code pages, but this is not
recommended.) The exact code page used
to validate and store the data depends
on the collation of the column. If a
column-level collation has not been
defined, the collation of the database
is used. To determine the code page
that is used for a given column, you
can use the COLLATIONPROPERTY
function, as shown in the following
code examples:
Here's some more:
This example illustrates the fact that
many locales, such as Georgian and
Hindi, do not have code pages, as they
are Unicode-only collations. Those
collations are not appropriate for
columns that use the char, varchar, or
text data type
So Georgian or Hindi really need to be stored as nvarchar. Arabic is also a problem:
Another problem you might encounter is
the inability to store data when not
all of the characters you wish to
support are contained in the code
page. In many cases, Windows considers
a particular code page to be a "best
fit" code page, which means there is
no guarantee that you can rely on the
code page to handle all text; it is
merely the best one available. An
example of this is the Arabic script:
it supports a wide array of languages,
including Baluchi, Berber, Farsi,
Kashmiri, Kazakh, Kirghiz, Pashto,
Sindhi, Uighur, Urdu, and more. All of
these languages have additional
characters beyond those in the Arabic
language as defined in Windows code
page 1256. If you attempt to store
these extra characters in a
non-Unicode column that has the Arabic
collation, the characters are
converted into question marks.
Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h.
All in all with all the issues you have to deal with when dealing with internalitionalization. It is my opinion that is easier to just use Unicode characters from the start, avoid the extra conversions and take the space hit. Hence my statement earlier.
Greek would need UTF-8 on N column types: αβγ ;)
Josh says:
"....Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h."
I'm a native Spanish Speaker and "ch" is not a letter but two "c" and "h" and the Spanish alphabet is like:
abcdefghijklmn ñ opqrstuvwxyz
We don't expect "ch" after "h" but "i"
The alphabet is the same as in English except for the ñ or in HTML "&ntilde ;"
Alex
TL;DR;
Unicode - (nchar, nvarchar, and ntext)
Non-unicode - (char, varchar, and text).
From MSDN
Collations in SQL Server provide sorting rules, case, and accent
sensitivity properties for your data. Collations that are used with
character data types such as char and varchar dictate the code page
and corresponding characters that can be represented for that data
type.
Assuming you are using default SQL collation SQL_Latin1_General_CP1_CI_AS then following script should print out all the symbols that you can fit in VARCHAR since it uses one byte to store one character (256 total) if you don't see it on the list printed - you need NVARCHAR.
declare #i int = 0;
while (#i < 256)
begin
print cast(#i as varchar(3)) + ' '+ char(#i) collate SQL_Latin1_General_CP1_CI_AS
print cast(#i as varchar(3)) + ' '+ char(#i) collate Japanese_90_CI_AS
set #i = #i+1;
end
If you change collation to lets say japanese you will notice that all the weird European letters turned into normal and some symbols into ? marks.
Unicode is a standard for mapping code points to characters. Because
it is designed to cover all the characters of all the languages of the
world, there is no need for different code pages to handle different
sets of characters. If you store character data that reflects multiple
languages, always use Unicode data types (nchar, nvarchar, and ntext)
instead of the non-Unicode data types (char, varchar, and text).
Otherwise your sorting will go weird.
If anyone is facing this issue in Mysql there is no need to change varchar to nvarchar you can just change the collation of the column to utf8

Resources