SQL: Update Unicode Data in column to Accented characters - sql-server

I have a column which was set to Varchar and the database set to SQL_Latin1_General_CP1_CI_AS.
When a user entered their name into our web front end and save the data, it was not saving accented characters correctly.
The web user was entering the following, "Béala" but this was being saved on the database as the following, "Béala".
I believe that changing the column from Varchar to NVarchar should prevent this from happening going forward(?), however, I have two questions.
1) How do I perform a select on the existing data in the column and display it correctly?
select CONVERT(NVARCHAR(100),strAddress1) from [dbo].[tblCustomer]
This still shows the data incorrectly.
2) How do I update the data in the column once converted to NVarchar to save the accented characters correctly?
Many thanks,
Ray.

The only idea that came to my mind is that you have to prepare an update that will fool this badly loaded data, i.e. a sign
'é' will always match exactly one character (in this case 'é'), you have to catch all special characters and this
as have been changed (just a simple update with cases and replace). Of course, the first column must be of the nvarchar type.
It solves the problem 1 and 2 (the data will be correct in the table, the data will be displayed correctly, I described the update above)

Here is way to get it in normal characters scheme.
select 'Réunion', cast('Réunion' as varchar(100)) COLLATE SQL_Latin1_General_CP1253_CI_AI
Moreover to check all possible collations in SQL Server you can try this query
SELECT name, description
FROM sys.fn_helpcollations();

Related

How to see Arabic characters entered into SQL Server instead of?

I have a client who inserted some Arabic text into SQL Server database column. Now the text displays as ????? only.
I know that this is related to collation and that he should have modified the collation of SQL Server before entering his data. Also I know that he should have used nVarchar, instead of Varchar in his column.
How can I retrieve the data entered in his column or convert it into Arabic from ???? since the data is already entered and we need to convert it.
Thanks in advance.
Here are some basic rules which you may want to keep in mind. While storing Unicode data, the column must be of Unicode data type (nchar, nvarchar, ntext). Another rule is that the value must be prefixed with N while insertion like the below.
INSERT INTO TBL_LANG VALUES ('English',N'I am American')
INSERT INTO TBL_LANG VALUES ('Tamil',N'நான் தமிழன்')
If the data inserted in the correct manner, then you will get the correct response.
As well have a look at the link.
Convert “???? ??????” in to arabic language

Can sql server give me a warning if I try and insert unicode without the N prefix

ok, so the support team have once again updated a value in the database and forgot the N prefix so replaced it with ???s.
Is there something that can be done on either the database (sqlserver 2012) or sqlserver management studio 2012 that can stop or warn people?
And why does the database automagically change the update to ?s, if it's a nvarchar column and I'm passing in Unicode without N why not have it error?
This is not an issue with the driver being used to connect to SQL Server. It is simply an implicit conversion happening due to using the wrong datatype in the string literal. Everything has a type. A number 2 by itself is, by default, an INT, and not a DECIMAL or FLOAT or anything else. A number 2.0 is, by default, a NUMERIC (same as DECIMAL), and not a FLOAT, etc. Strings are no different. A string expressed as 'something' is 8-bit ASCII, using the Code Page of the database that the query is running in. If you had used '随机字符中国' in a database set to one of the collations that supports those characters in an 8-bit encoding (it would be a Double-Byte Character Set (DBCS)) then it would not have translated to ? since it would have had the character in its Code Page.
CREATE DATABASE [ChineseSimplifiedPinyin] COLLATE Chinese_Simplified_Pinyin_100_CI_AS;
Then, run this:
USE [ChineseSimplifiedPinyin];
SELECT '随机字符中国';
and it will return those characters and not ??????.
And why does the database automagically change the update to ?s, if it's a nvarchar column and I'm passing in Unicode without N why not have it error?
The UPDATE is not being changed. An implicit conversion is happening because you are using the wrong datatype for string literals when not prefixing with the N. This is no different than doing the following:
DECLARE #Test INT;
SET #Test = 2.123;
SELECT #Test;
which returns simply a 2.
Now, it might be possible to set up a Policy to trap implicit conversions, but that would be too far reaching and would likely break lots of stuff. Even if you could narrow it down to implicit conversions going from VARCHAR to NVARCHAR that would still break code that would otherwise work in the current situation: inserting 'bob' into an NVARCHAR field would be an implicit conversion yet there would be no data loss. And you can't trap any of this in a Trigger because that is after-the-fact of receiving the implicitly converted data.
The best way to ensure nobody forgets to insert or update without the N prefix is to create a web app or console app that would be an interface for this (which is probably a good idea anyway since that will also prevent someone from using the wrong WHERE clause or forgetting to use one altogether, both of which do happen). Creating a small .NET web or console app is pretty easy and .NET strings are all Unicode (UTF-16 Little Endian). Then the app takes the data and submits the INSERT or UPDATE statement. Be sure to use a parameter and not dynamic SQL.
Given that the ? character is valid in this field, if it can be determined that multiple ?s would never naturally occur, then you can probably prevent this issue on cases involving more than a single character getting converted by creating an INSERT, UPDATE Trigger that cancels the operation if multiple ?s in a row are present. Using a Trigger as opposed to a Check Constraint allows for a little more control, especially over the error message:
CREATE TRIGGER tr_PreventLosingUnicodeCharacters
ON SchemaName.TableName
AFTER INSERT, UPDATE
AS
BEGIN
SET NOCOUNT ON;
IF (EXISTS (SELECT *
FROM INSERTED ins
WHERE ins.column1 LIKE N'%??%')
)
BEGIN
ROLLBACK; -- cancel the INSERT or UPDATE operation
DECLARE #Message NVARCHAR(1000);
SET #Message =
N'INSERT or UPDATE of [column1] without "N" prefix results in data loss. '
+ NCHAR(13) + NCHAR(10)
+ N'Please try again using N''string'' instead of just ''string''.';
RAISERROR(#Message, 16, 1);
RETURN;
END;
END;
And if 2 ?s can naturally happen, then do the search for ??? and then it is only 1 or 2 character items that might slip by. In either case, this should catch enough erroneous entries so that you only need to fix things on rare occasions (hopefully :).
Is there something that can be done on either the database (sqlserver 2012) or sqlserver management studio 2012 that can stop or warn people?
Not to my knowledge. About the only thing I can think of would be:
ALTER TABLE some_table ADD CONSTRAINT stop_messing_it_up CHECK (NOT column1 LIKE '%?%');
but you can't tell the difference between a question mark that came from prior content-mangling and a real question mark, so that would only be workable if it were also invalid to put a question mark in the database.
why does the database automagically change the update to ?s, if it's a nvarchar column
It doesn't matter what the column is, it's the type of the string literal in the query expression. In SQL Server (only), non-NATIONAL string literals can only contain characters in the locale-specific (“ANSI”) code page, so the data loss occurs before the content gets anywhere near your table:
SELECT '随机字符中国';
??????
SELECT N'随机字符中国';
随机字符中国

Prevent bad data being selected with SQL Server COLLATE

I have two databases and I am moving data from one database another. On some columns I need to resolve a collation conflict between "Latin1_General_CI_AS" and "Latin1_General_CI_AI". So I currently do:
SELECT [column] COLLATE Latin1_General_CI_AI
FROM xxx
This works in the main but in some fields I see characters like:
�
In the data that has been copied across. It seems some characters from the original DB are lost when the collation is changed on the new DB. Now I know I can use REPLACE to get rid of them, but I wondered rather than search the entire record set - is there a way to filter out or prevent the special characters coming across in the first place?

How to get matched data into database?

I took a flat file and looked up a field in a database and added another field as a new column to the flat file.
But when I directed the matched output to another database, the matched field is NULL upon inspection with a Select statement.
What did I do wrong?
I would check for any of the following on either the flat file or lookup data, which may cause a non-match:
- text data with trailing blanks
- text data with upper case vs lower case
- numeric data of varying datatypes, even just precisions
- probably other issues I haven't listed above - it's just ridiculously fussy
To avoid these issues I always explicitly use SQL CAST or Derived Column transforms to make sure the key fields are all text, all upper case and all exactly the same, byte by byte.

How can I recover Unicode data which displays in SQL Server as?

I have a database in SQL Server containing a column which needs to contain Unicode data (it contains user's addresses from all over the world e.g. القاهرة‎ for Cairo)
This column is an nvarchar column with a collation of database default (Latin1_General_CI_AS), but I've noticed data inserted into it via SQL statements containing non English characters and displays as ?????.
The solution seems to be that I wasn't using the n prefix e.g.
INSERT INTO table (address) VALUES ('القاهرة')
Instead of:
INSERT INTO table (address) VALUES (n'القاهرة')
I was under the impression that Unicode would automatically be converted for nvarchar columns and I didn't need this prefix, but this appears to be incorrect.
The problem is I still have some data in this column which appears as ????? in SQL Server Management Studio and I don't know what it is!
Is the data still there but in an incorrect character encoding preventing it from displaying but still salvageable (and if so how can I recover it?), or is it gone for good?
Thanks,
Tom
To find out what SQL Server really stores, use
SELECT CONVERT(VARBINARY(MAX), 'some text')
I just tried this with umlauted characters and Arabic (copied from Wikipedia, I have no idea) both as plain strings and as N'' Unicode strings.
The results are that Arabic non-Unicode strings really end up as question marks (0x3F) in the conversion to VARCHAR.
SSMS sometimes won't display all characters, I just tried what you had and it worked for me, copy and paste it into Word and it might display it corectly
Usually if SSMS can't display it it should be boxes not ?
Try to write a small client that will retrieve these data to a file or web page. Check ALL your code if there are no other inserts or updates that might convertthe data to varchar before storing them in tables.

Resources