Replacing a specific Unicode Character in MS SQL Server - sql-server

I'm using MS SQL Server Express 2012.
I'm having trouble removing the unicode character U+02CC (Decimal : 716) in the grid results. The original text is 'λeˌβár'.
I tried it like this, it doesn't work:
SELECT ColumnTextWithUnicode, REPLACE(ColumnTextWithUnicode , 'ˌ','')
FROM TableName
The column has Latin1_General_CI_AS collation and datatype is nvarchar. I tried changing the collation to something binary, but no success as well:
SELECT ColumnTextWithUnicode, REPLACE(ColumnTextWithUnicode collate Latin1_General_BIN, 'ˌ' collate Latin1_General_BIN,'')
FROM TableName
Or even using the NChar() function like:
SELECT ColumnTextWithUnicode, REPLACE(ColumnTextWithUnicode , NCHAR(716),'')
FROM TableName
The results are 'λeˌβár' for all three.
But if I cast the column to varchar like:
SELECT ColumnTextWithUnicode, REPLACE(CAST(ColumnTextWithUnicode as varchar(100)), 'ˌ','')
FROM TableName
the result becomes 'eßár', removing both the first character and 'ˌ'.
Any ideas to remove just the 'ˌ'?

you just need to put N before string pattern too (if you want look for unicode char):
SELECT REPLACE (N'λeˌβár' COLLATE Latin1_General_BIN, N'ˌ', '')

It is working fine by following select query as we are getting U+FFFD � REPLACEMENT CHARACTER when we bulk inserting address filled from txt to sql.
select Address, REPLACE(Address COLLATE Latin1_General_BIN,N'�',' ') from #Temp

Related

How to validate that UTF-8 columns actually save space?

SQL Server 2019 introduces support for the widely used UTF-8 character encoding.
I have a large table that stores sent emails. So I'd like to give this feature a try.
ALTER TABLE dbo.EmailMessages
ALTER COLUMN Body NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AI_SC_UTF8;
ALTER TABLE dbo.EmailMessages REBUILD;
My concern is that I don't know how to verify size gains. It seems that popular scripts for size estimation do not properly report size in this case.
Basically, column type must be converted to VARCHAR(MAX) then data is stored in a more compact manner:
To limit the amount of changes required for the above scenarios, UTF-8
is enabled in existing the data types CHAR and VARCHAR. String data is
automatically encoded to UTF-8 when creating or changing an object’s
collation to a collation with the “_UTF8” suffix, for example from
LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.
Size can be inspected using sp_spaceused:
sp_spaceused N'EmailMessages';
If unused space is high then you might need to reorganize:
ALTER INDEX ALL ON dbo.EmailMessages REORGANIZE WITH (LOB_COMPACTION = ON);
In my case size was reduced by a factor of ~2 (mostly English text).
As others have already mentioned, you should use VARCHAR instead of NVARCHAR to store UTF-8 encoded text.
You can use a query like the following to compare string lengths. It assumes a table named #Data with an NVARCHAR column called String.
SELECT *
FROM #Data
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), String COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8) AS Utf8String
) U
CROSS APPLY (
SELECT
LEN(String) AS Length,
--LEN(Utf8String) AS Utf8Length,
DATALENGTH(String) AS NVarcharBytes,
DATALENGTH(Utf8String) AS Utf8Bytes
) L
CROSS APPLY (
SELECT
CASE WHEN Utf8Bytes < NVarcharBytes THEN 'Yes' ELSE '' END AS IsShorter,
CASE WHEN Utf8Bytes > NVarcharBytes THEN 'Yes' ELSE '' END AS IsLonger
) C
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), String), 1) AS NVarcharHex,
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), Utf8String), 1) AS Utf8Hex
) H
You can replace FROM #Data with something like FROM (SELECT Email AS String FROM YourTable) D to query your specific data. Replace SELECT * with SELECT SUM(NVarcharBytes) AS NVarcharBytes, SUM(Utf8Bytes) AS Utf8Bytes to get totals.
See this db<>fiddle.
See also: Storage differences between UTF-8 and UTF-16.

Kurdish Sorani Letters sql server

I am trying to create a database containing Kurdish Sorani Letters.
My Database fields has to be varchar cause of project is started that vay.
First I create database with Arabic_CI_AS
I can store all arabic letters on varchar fields but when it comes to kurdish letters for example
ڕۆ these special letters are show like ?? on the table after entering data, I think my collation is wrong. Have anybody got and idea for collation ?
With that collation, no, you need to use nvarchar and always prefix such strings with the N prefix:
CREATE TABLE dbo.floo
(
UseNPrefix bit,
a varchar(32) collate Arabic_CI_AS,
b nvarchar(32) collate Arabic_CI_AS
);
INSERT dbo.floo(UseNPrefix,a,b) VALUES(0,'ڕۆ','ڕۆ');
INSERT dbo.floo(UseNPrefix,a,b) VALUES(1,N'ڕۆ',N'ڕۆ');
SELECT * FROM dbo.floo;
Output:
UseNPrefix
a
b
False
??
??
True
??
ڕۆ
Example db<>fiddle
In SQL Server 2019, you can use a different SC + UTF-8 collation with varchar, but you will still need to prefix string literals with N to prevent data from being lost:
CREATE TABLE dbo.floo
(
UseNPrefix bit,
a varchar(32) collate Arabic_100_CI_AS_KS_SC_UTF8,
b nvarchar(32) collate Arabic_100_CI_AS_KS_SC_UTF8
);
INSERT dbo.floo(UseNPrefix,a,b) VALUES(0,'ڕۆ','ڕۆ');
INSERT dbo.floo(UseNPrefix,a,b) VALUES(1,N'ڕۆ',N'ڕۆ');
SELECT * FROM dbo.floo;
Output:
UseNPrefix
a
b
False
??
??
True
ڕۆ
ڕۆ
Example db<>fiddle
Basically, even if you are on SQL Server 2019, your requirements of "I need to store Sorani" and "I can't change the table" are incompatible. You will need to either change the data type of the column or at least change the collation, and you will need to adjust any code that expects to pass this data to SQL Server without an N prefix on strings.

Is there a SQL Server collation option that will allow matching different apostrophes?

I'm currently using SQL Server 2016 with SQL_Latin1_General_CP1_CI_AI collation. As expected, queries with the letter e will match values with the letters e, è, é, ê, ë, etc because of the accent insensitive option of the collation. However, queries with a ' (U+0027) do not match values containing a ’ (U+2019). I would like to know if such a collation exists where this case would match, since it's easier to type ' than it is to know that ’ is keystroke Alt-0146.
I'm confident in saying no. The main thing, here, is that the two characters are different (although similar). With accents, e and ê are still both an e (just one has an accent). This enables you (for example) to do searches for things like SELECT * FROM Games WHERE [Name] LIKE 'Pokémon%'; and still have rows containing Pokemon return (because people haven't used the accent :P).
The best thing I could suggest would be to use REPLACE (at least in your WHERE clause) so that both rows are returned. That is, however, likely going to get expensive.
If you know what columns are going to be a problem, you could, therefore, add a PERSISTED Computed Column to that table. Then you could use that column in your WHERE clause, but display the one the original one. Something like:
USE Sandbox;
--Create Sample table and data
CREATE TABLE Sample (String varchar(500));
INSERT INTO Sample
VALUES ('This is a string that does not contain either apostrophe'),
('Where as this string, isn''t without at least one'),
('’I have one of them as well’'),
('’Well, I''m going to use both’');
GO
--First attempt (without the column)
SELECT String
FROM Sample
WHERE String LIKE '%''%'; --Only returns 2 of the rows
GO
--Create a PERSISTED Column
ALTER TABLE Sample ADD StringRplc AS REPLACE(String,'’','''') PERSISTED;
GO
--Second attempt
SELECT String
FROM Sample
WHERE StringRplc LIKE '%''%'; --Returns 3 rows
GO
--Clean up
DROP TABLE Sample;
GO
The other answer is correct. There is no such collation. You can easily verify this with the below.
DECLARE #dynSql NVARCHAR(MAX) =
'SELECT * FROM (' +
(
SELECT SUBSTRING(
(
SELECT ' UNION ALL SELECT ''' + name + ''' AS name, IIF( NCHAR(0x0027) = NCHAR(0x2019) COLLATE ' + name + ', 1,0) AS Equal'
FROM sys.fn_helpcollations()
FOR XML PATH('')
), 12, 0+ 0x7fffffff)
)
+ ') t
ORDER BY Equal, name';
PRINT #dynSql;
EXEC (#dynSql);

SQL Server unable to trim data of COLLATION type SQL_Latin1_General_CP1_CI_AS

We've two database one is the old one which has COLLATION - SQL_Latin1_General_CP1_CI_AS and the new one with COLLATION - Latin1_General_CI_AI (probably the default one).
There's a simple Table1 (ID (int), code(nvarchar(50))) in both the databases. What I'm suppose to do is compare both the tables for its data and find the missing or extra records.
Sample data in old table has code like : 'Code1 '
Sample data in new table has code like : 'Code1 '
What I need to be able to do is compare both the data (from the 'Name' column). I'm unable to trim the data from the old table -
EXAMPLE:
SELECT LTRIM(RTRIM([Name])) from [OLDDB].dbo.Table1
returns 'Code1 ' -- NOT as expected (probably due to mis-match in charset
SELECT LTRIM(RTRIM([Name])) from [NEWDB].dbo.Table1
'Code1' -- as expected
I hope it makes sense. Besides, even if I changes the COLLATION at column level still I was not able to get the ltrim / rtrim work!
Thanks.
If the CHAR(160) is the problem, then you don't have to change collation. Just replace those CHAR(160) with proper spaces and then RTRIM will work.
SELECT LTRIM(RTRIM(REPLACE([Name], CHAR(160), ' '))) from [OLDDB].dbo.Table1
Try the below update, Hope this will fix the issue.
update [OLDDB].dbo.Table1 set Name RTRIM(replace(NAME, char(160), char(32)))

SQL Server string getting truncated

I have a column (col1) with nvarchar(max).
I am trying to do
DECLARE #my_string NVARCHAR(max)
set #my_string = N'test'
UPDATE dbo.tab1
SET col1 = #my_string + ISNULL(col1, N'')
no luck , I have no idea why it is happening. #marc_s
The string value in col1 getting truncated after 250 characters. This happening in both SQL Server 2005 and 2008.
First of all - I'm not seeing this behavior you're reporting. How and why do you think your column gets truncated at 250 characters?? Are you using a tool to inspect the data that might be truncating the output??
You could check the length of the column:
SELECT col1, LEN(col1) FROM dbo.tab1
Is it really only 250 characters long???
Also: you're mixing VARCHAR (in #my_string) and NVARCHAR (col1) which can lead to messy results. Avoid this!
Next: if you want NVARCHAR(MAX), you need to cast your other strings to that format.
Try this:
DECLARE #my_string NVARCHAR(200)
set #my_string = N'test'
UPDATE dbo.tab1
SET col1 = CAST(#my_string AS NVARCHAR(MAX)) + ISNULL(col1, N'')
As I said - in my tests, I didn't need to do this - but maybe it works in your case?
Go to menu - Query --> Query options --> Results --> Text
There is an option Maximun number of characters displayed in each column and mine was defaulted to 256.
Once I set this to 1000 the problem was fixed.
Do you test in SSMS? If so, check in options Query Results > SQL Server > Results to Grid - Maximum characters retrieved > Non XML data. Is there a value 250 or similar?

Resources