text encodings in .net, sql server processing - sql-server

I have an application that gets terms from a DB to run as a list of string terms. The DB table was set up with nvarchar for that column to include all foreign characters. Now in some cases where characters like ä will come through clearly when getting the terms from the DB and even show that way in the table.
When importing japanese or arabic characters, all I see are ????????.
Now I have tried converting it using different methods, first converting it into utf8 encoding and then back and also secondly using the httputility.htmlencode which works perfectly when it is these characters but then converts quotes and other stuff which I dont need it to do.
Now I accused the db designer that he needs to do something on his part but am I wrong in that the DB should display all these characters and make it easy to just query it and add to my ssearch list. If not is there a consistent way of getting all international characters to display correctly in SQL and VB.net
I know when I have read from text files I just used the Microsoft.visualbasic.textfieldparser reader tool with encoding set to utf8 and this would not be an issue.

If the database field is nvarchar, then it will store data correctly. As you have seen.
Somewhere before it gets to the database, the data is being lost or changed to varchar: stored procedure, parameters, file encoding, ODBC translation etc.
DECLARE #foo nvarchar(100), #foo2 varchar(100)
--with arabic and japanese and proper N literal
SELECT #foo = N'العربي 日本語', #foo2 = N'العربي 日本語'
SELECT #foo, #foo2 -- gives العربي 日本語
--now a varchar literal
SELECT #foo = 'العربي 日本語', #foo2 = 'العربي 日本語'
SELECT #foo, #foo2 --gives ?????? ???
--from my Swiss German keyboard. These are part of my code page.
SELECT #foo = 'öéäàüè', #foo2 = 'öéäàüè'
SELECT #foo, #foo2 --gives ?????? ???
So, apologise to the nice DB monkey... :-)

Always try to use NVARCHAR or NTEXT to store foreign charactesr.
you cannot store UNICODE in varchar ot text datatype.
Also put a N before string value
like
UPDATE [USER]
SET Name = N'日本語'
WHERE ID = XXXX;

Related

UTF-8 characters get saved as ?? on insert, but gets saved correctly on update

I have a table on MS SQLServer with an nVarchar column. I am saving a UTF-8 character using an insert statement. It gets saved as ???. If I update the same column using the same value via an update statement, it gets saved correctly.
Any hint on what would be the issue here? The collation used is : SQL_Latin1_General_CP1_CI_AS
Show your insert statement. There is - quite probably - an N missing:
DECLARE #v NVARCHAR(100)='Some Hindi from Wikipedia मानक हिन्दी';
SELECT #v;
Result: Some Hindi from Wikipedia ???? ??????
SET #v=N'Some Hindi from Wikipedia मानक हिन्दी';
SELECT #v;
Result: Some Hindi from Wikipedia मानक हिन्दी
The N in front of the string literal tells SQL-Server to interpret the content as unicode (to be exact: as ucs-2). Otherwise it will be treated as a 1-byte-encoded extended ASCII, which is not able to deal with all characters...

Central european characters in SQL

I have an issue. I have data stored on SQL server with central european characters like "č", "ř", "ž" etc. On the database I have the "Czech_CI_AS" collation which should accepted these characters. But when I try to select for example name of the street with this characters like this:
SELECT *
FROM Street where Name = 'Čáslavská'
It returns me nothing
When I remove the "č" it returns me what I need.
SELECT *
FROM Street where Name like '%áslavská'
I have this column in nvarchar type. But I cannot use the N character before my string because the external applications use this table for read and selects are made automaticlly.
Is here any solution? Or have I got something wrong?
Thanks for any help
#YuriyTsarkov really deservers the credit here. To elaborate on his answer.
From MSDN:
Prefix Unicode character string constants with the letter N. Without the N prefix, the string is converted to the default code page of the database. This default code page may not recognize certain characters.
Example
-- Storing Čáslavská in two vars, with and without N prefix.
DECLARE #Test_001 NVARCHAR(255) = 'Čáslavská' COLLATE Czech_CI_AS;
DECLARE #Test_002 NVARCHAR(255) = N'Čáslavská' COLLATE Czech_CI_AS;
-- Test output.
SELECT
#Test_001 AS T1,
#Test_002 AS T2
;
Returns
T1 T2
Cáslavská Čáslavská
You need to update all your external applications code to use selects with N, or, you need to change collation of your column to same, as used by external applications. It may cause some data loss.

Why can I store an Ukrainian string in a varchar column?

I got a little surprised as I was able to store an Ukrainian string in a varchar column .
My table is:
create table delete_collation
(
text1 varchar(100) collate SQL_Ukrainian_CP1251_CI_AS
)
and using this query I am able to insert:
insert into delete_collation
values(N'використовується для вирішення квитки')
but when I am removing 'N' it is showing ?????? in the select statement.
Is it okay or am I missing something in understanding unicode and non-unicode with collate?
From MSDN:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
UPDATE:
Please see a similar questions::
What is the meaning of the prefix N in T-SQL statements?
Cyrillic symbols in SQL code are not correctly after insert
sql server 2012 express do not understand Russian letters
To expand on MegaTron's answer:
Using collate SQL_Ukrainian_CP1251_CI_AS, SQL server is able to store ukrainian characters in a varchar column by using CodePage 1251.
However, when you specify a string without the N prefix, that string will be converted to the default non-unicode codepage before it is sent to the database, and that is why you see ??????.
So it is completely fine to use varchar and collate as you do, but you must always include the N prefix when sending strings to the database, to avoid the intermediate conversion to default (non-ukrainian) codepage.

How to Show Eastern Letter(Chinese Character) on SQL Server/SQL Reporting Services?

I need to insert chinese characters in my database but it always show ???? ..
Example:
Insert this record.
微波室外单元-Apple
Then it became ???
Result:
??????-Apple
I really Need Help...thanks in regard.
I am using MSSQL Server 2008
Make sure you specify a unicode string with a capital N when you insert like:
INSERT INTO Table1 (Col1) SELECT N'微波室外单元-Apple' AS [Col1]
and that Table1 (Col1) is an NVARCHAR data type.
Make sure the column you're inserting to is nchar, nvarchar, or ntext. If you insert a Unicode string into an ANSI column, you really will get question marks in the data.
Also, be careful to check that when you pull the data back out you're not just seeing a client display problem but are actually getting the question marks back:
SELECT Unicode(YourColumn), YourColumn FROM YourTable
Note that the Unicode function returns the code of only the first character in the string.
Once you've determined whether the column is really storing the data correctly, post back and we'll help you more.
Try adding the appropriate languages to your Windows locale setings. you'll have to make sure your development machine is set to display Non-Unicode characters in the appropriate language.
And ofcourse u need to use NVarchar for foreign language feilds
Make sure that you have set an encoding for the database to one that supports these characters. UTF-8 is the de facto encoding as it's ASCII compatible but supports all 1114111 Unicode code points.
SELECT 'UPDATE table SET msg=UNISTR('''||ASCIISTR(msg)||''') WHERE id='''||id||''' FROM table WHERE id= '123344556' ;

SQL Server Text Datatype Maxlength = 65,535?

Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?
that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me
MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).
If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).
You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.
Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;

Resources