SQL Server replace function has odd behavior with unicode modifier character - sql-server

I have data with some half-width unicode characters in it (specifically U+FF9F) which seem to be interfering with the replace function:
select replace(convert(nvarchar(max), N'"'), N'"', N'""')
returns "" as expected, but
select replace(convert(nvarchar(max), N'"゚'), N'"', N'""')
returns the input string unaffected: "゚. I believe this is due to SQL Server interpreting the two-character sequence as being distinct from " by itself.
Is there any way to get the replace function to replace that double-quotation character with two?
EDIT: This question is slightly different from the answer posed in Replacing a specific Unicode Character in MS SQL Server. That question is asking about removing a half-width character, while this question is about modifying a character next to a half-width character. In both questions it was needed to set the collation to Latin1_General_BIN

Use COLLATE Latin1_General_BIN to work on Unicode characters.
SELECT REPLACE(CONVERT(nvarchar(max), N'"゚') COLLATE Latin1_General_BIN, N'"', N'""')

Related

How to remove weird Excel character in SQL Server?

There is a weird whitespace character I can't seem to get rid of that occasionally shows up in my data when importing from Excel. Visibly, it comes across as a whitespace character BUT SQL Server sees it as a question mark (ASCII 63).
declare #temp nvarchar(255); set #temp = 'carolg#c?am.com'
select #temp
returns:
?carolg#c?am.com
How can I get rid of the whitespace without getting rid of real question marks? If I look at the ASCII code for each of those "?" characters I get 63 when in fact, only one of them is a real qustion mark.
Have a look at this answer for someone with a similar issue. Sorry if this is a bit long winded:
SQL Server seems to flatten Unicode to ASCII by mapping unrepresentable characters (for which there is no suitable substitution) to a question mark. To replicate this, try opening the Character Map Windows program (should be installed on most machines), select Arial as the font and find U+034f "Combining Grapheme Joiner". select this character, copy to clipboard and paste it between the single quotes below:
declare #t nvarchar(10)
set #t = '͏'
select rtrim(ltrim(#t)) -- we can try and trim it, but by this stage it's already a '?'
You'll get a question mark out, because it doesn't know how to represent this non-ASCII character when it casts it to varchar. To force it to accept it as a double-byte character (nvarchar) you need to use N'' instead, as has already been mentioned. Add an N before the quotes above and the question mark disappears (but the original invisible character is preserved in the output - and ltrim and rtrim won't remove it as demonstrated below):
declare #t nvarchar(10),
#s varchar(10) -- note: single-byte string
set #t = rtrim(ltrim(N'͏')) -- trimming doesn't work here either
set #s = #t
select #s -- still outputs a question mark
Imported data can definitely do this, I've seen it before, and characters like the one I've shown above are particularly hard to diagnose because you can't see them! You will need to create some sort of scrubbing process to remove these unprintables (and any other junk characters, for that matter), and make sure that you use nvarchar everywhere, or you'll end up with this issue. Worse, those phantom question marks will become real question marks that you won't be able to distinguish from legitimate ones.
To see what character code you're dealing with, you can cast as varbinary as follows:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- returns 0x4F0374006500730074003F00
-- Returns:
-- 0x4F03 7400 6500 7300 7400 3F00
-- badchar t e s t ?
Now to get rid of it:
declare #t nvarchar(10)
set #t = N'͏test?'
select cast(#t as varbinary) -- bad char
set #t = replace(#t COLLATE Latin1_General_100_BIN2, nchar(0x034f), N'');
select cast(#t as varbinary) -- gone!
Note I had to swap the byte order from 0x4f03 to 0x034f (same reason "t" appears in the output as 0x7400, not 0x0074). For some notes on why we're using binary collation, see this answer.
This is kind of messy, because you don't know what the dirty characters are, and they could be one of thousands of possibilities. One option is to iterate over strings using like or even the unicode() function and discard characters in strings that aren't in a list of acceptable characters, but this could be slow. It may be that most of your bad characters are either at the start or end of the string, which might speed this process up if that's an assumption you think you can make.
You may need to build additional processes either external to SQL Server or as part of a SSIS import based on what I've shown you above to strip this out quickly if you have a lot of data to import. If you aren't sure the best way to do this, that's probably best answered in a new question.
I hope that helps.

Why can I store an Ukrainian string in a varchar column?

I got a little surprised as I was able to store an Ukrainian string in a varchar column .
My table is:
create table delete_collation
(
text1 varchar(100) collate SQL_Ukrainian_CP1251_CI_AS
)
and using this query I am able to insert:
insert into delete_collation
values(N'використовується для вирішення квитки')
but when I am removing 'N' it is showing ?????? in the select statement.
Is it okay or am I missing something in understanding unicode and non-unicode with collate?
From MSDN:
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
UPDATE:
Please see a similar questions::
What is the meaning of the prefix N in T-SQL statements?
Cyrillic symbols in SQL code are not correctly after insert
sql server 2012 express do not understand Russian letters
To expand on MegaTron's answer:
Using collate SQL_Ukrainian_CP1251_CI_AS, SQL server is able to store ukrainian characters in a varchar column by using CodePage 1251.
However, when you specify a string without the N prefix, that string will be converted to the default non-unicode codepage before it is sent to the database, and that is why you see ??????.
So it is completely fine to use varchar and collate as you do, but you must always include the N prefix when sending strings to the database, to avoid the intermediate conversion to default (non-ukrainian) codepage.

SQL Server: Replace Arabic Character in Select Statement

I would like to replace Arabic character in select statement with another Arabic character in SQL server, example query:
select replace(ArabicName, 'أ', 'ا') as ArabicName from dataIndividual;
But it does not replace, the reason probably that SQL server does not read the character as Unicode but as ASCII (I guess).
Is there a way to pass a Unicode characters for replace function?
Note: I already tried collate after the Arabic character.
Prefix your string with N
select replace(ArabicName, N'أ' , N'ا') as ArabicName
from dataIndividual;

SQL Server character set and N prefix

[THIS IS NOT A QUESTION ABOUT NVARCHAR OR HOW TO STORE CHINESE CHARACTER]
SQL Server 2008 Express
Database collation is SQL_Latin1_General_CP1_CI_AS
create table sample1(val varchar(2))
insert into sample1 values(N'中文')
I know these Chinese characters would become junk characters.
I know I can use nvarchar to overcome all problem.
What I don't know is: why there isn't "string too long" error when I run the insert statement?
N prefix means that client will encode the string using UNICODE.
2 Chinese characters will become 4 bytes.
varchar(2) can only contain 2 bytes.
Why people down vote this question? really?
An implied cast takes place. This would work if "val" was created as nvarchar(2).
More explanation to #marc_s answer.
The character N'中文' will be converted to varchar with the collation SQL_Latin1_General_CP1_CI_AS. Since there is no such character in the code page, it will converted to not defined, and 0x3f3f in the end. 0x3f is the question mark, so there will be two question marks in this case and it won't exceed the column length.
Try to use NVARCHAR(...), NCHAR(...) datatypes -
CREATE TABLE dbo.sample1
(
val NVARCHAR(4)
)
INSERT INTO dbo.sample1
SELECT N'中文'

WHERE clause on VARCHAR column seems to operate as a LIKE

I've stumbled across a situation I've never seen before. I hope that someone can explain the following.
I've ran the following query, hoping to get only the results of columns whoes value is exactly equal to 1101
select '--' + MyColumn + '--' SeeSpaces, Len(MyColumn) as LengthOfColumn
from MyTable
where MyColumn = '1101'
However, I also see values where 1101 is followed by (what I believe are) spaces.
So SeeSpaces returns
--1101 --
And LengthOfColumn returns 4
MyColumn is a VARCHAR(8), NOT NULL column. Its values (including the spaces) are inserted through a separate workflow.
Why does this select not return only the exact results?
Thanks in advance
The reason is to do with the way that SQL server compares strings with trailing spaces, it follows the ANSI standard and so the strings '1101' and '1101 ' are equivalent.
See the following for more details:
INF: How SQL Server Compares Strings with Trailing Spaces
I think you have to use LTRIM() and RTRIM() function while comparing like :
LTRIM(RTRIM(MYCOLUMN))='1101'
Also LEN function does not count spaces, it only count characters in string. Please refere : http://msdn.microsoft.com/en-us/library/ms190329%28SQL.90%29.aspx

Resources