Number of bytes used for Unicode characters in varchar - sql-server

A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored
According to this statement from Microsoft, I assume, n is the data length of a string and when we store unicode characters in varchar, a single character should take 2 bytes. But, when I try with a sample as below, I see varchar data taking 1 byte instead of 2 bytes.
declare #varchar varchar(6), #nvarchar nvarchar(6)
set #varchar = 'Ø'
select #varchar as VarcharString, len(#varchar) as VarcharStringLength, DATALENGTH(#varchar) as VarcharStringDataLength
Could someone explain the reason behind it?

Found time to test the assumptions of my first answer:
Create UTF8-enabled database
CREATE DATABASE [test-sc] COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8
Create table with all kinds of N/VARCHAR columns
CREATE TABLE [dbo].[UTF8Test](
[Id] [int] IDENTITY(1,1) NOT NULL,
[VarcharText] [varchar](50) COLLATE Latin1_General_100_CI_AI NULL,
[VarcharTextSC] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
[VarcharUTF8] [varchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL,
[NVarcharText] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS NULL,
[NVarcharTextSC] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC NULL,
[NVarcharUTF8] [nvarchar](50) COLLATE Latin1_General_100_CI_AI_KS_SC_UTF8 NULL)
Insert test data from various Unicode ranges
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES ('a','a','a','a','a','a')
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES ('ö','ö','ö',N'ö',N'ö',N'ö')
-- U+56D7
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES (N'囗',N'囗',N'囗',N'囗',N'囗',N'囗')
-- U+2000B
INSERT INTO [dbo].[UTF8Test] ([VarcharText],[VarcharTextSC],[VarcharUTF8],[NVarcharText],[NVarcharTextSC],[NVarcharUTF8])
VALUES (N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋',N'𠀋')
SELECT lengths
SELECT TOP (1000) [Id]
,[VarcharText]
,[VarcharTextSC]
,[VarcharUTF8]
,[NVarcharText]
,[NVarcharTextSC]
,[NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
,LEN([VarcharText]) VT
,LEN([VarcharTextSC]) VTSC
,LEN([VarcharUTF8]) VU
,LEN([NVarcharText]) NVT
,LEN([NVarcharTextSC]) NVTSC
,LEN([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
SELECT TOP (1000) [Id]
,DATALENGTH([VarcharText]) VT
,DATALENGTH([VarcharTextSC]) VTSC
,DATALENGTH([VarcharUTF8]) VU
,DATALENGTH([NVarcharText]) NVT
,DATALENGTH([NVarcharTextSC]) NVTSC
,DATALENGTH([NVarcharUTF8]) NVU
FROM [test-sc].[dbo].[UTF8Test]
I was surprised to find that the old mantra "a VARCHAR only stores single byte characters" needs to be revised when using UTF8 collations.
Note that only table columns are associated with collations, but not T-SQL variables:
SELECT #VarcharText = [VarcharText],
#NVarcharText = [NVarcharText]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT #VarcharText, Len(#VarcharText), DATALENGTH(#VarcharText), #NVarcharText, Len(#NVarcharText), DATALENGTH(#NVarcharText)
SELECT #VarcharText = [VarcharTextSC],
#NVarcharText = [NVarcharTextSC]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT #VarcharText, Len(#VarcharText), DATALENGTH(#VarcharText), #NVarcharText, Len(#NVarcharText), DATALENGTH(#NVarcharText)
SELECT #VarcharText = [VarcharUTF8],
#NVarcharText = [NVarcharUTF8]
FROM [test-sc].[dbo].[UTF8Test]
WHERE [Id] = 4
SELECT #VarcharText, Len(#VarcharText), DATALENGTH(#VarcharText), #NVarcharText, Len(#NVarcharText), DATALENGTH(#NVarcharText)

I thought the original quote was a bit confusion, as it continues
The misconception happens because when using single-byte encoding, the
storage size of CHAR and VARCHAR is n bytes and the number of
characters is also n.
but since it mentions encodings, my guess is that the statement refers to the UTF encodings supported in SQL Server 2019 and higher which seem to allow (I haven't tried yet) to store Unicode in VARCHAR columns.

declare #char varchar(4)
declare #nvarchar nvarchar(4)
Set #char = '#'
Set #nvarchar = '#'
select #char as charString,
LEN(#char) as charStringLength,
DATALENGTH(#char) as charStringDataLength
select #nvarchar as nvarcharString,
LEN(#nvarchar) as nvarcharStringLength,
DATALENGTH(#nvarchar) as nvarcharStringDataLength

You can store unicode in varchar (if you want to), however every byte is interpreted as a single character, while unicode (for sql server, utf16, ucs2) uses 2 bytes for a single character and you have to account for that, when displaying unicode stored in varchar.
declare #nv nvarchar(10) = N'❤'
select #nv;
declare #v varchar(10) = cast(cast(#nv as varbinary(10)) as varchar(10))
select #v, len(#v); --two chars
select cast(#nv as varbinary(10)), cast(#v as varbinary(10)); --same bytes in both n/var char
--display nchar from char
select cast(cast(#v as varbinary(10)) as nvarchar(10));

Related

Overcome 255 char limitations in Sybase ASE system while Concat Multiple Column

I have a table with 39 column and 30 rows in Sybase.I am trying to Concat all the 39 columns in a single column with 30 rows.
Tools used:
Winsql professional 4.5 connect to Sybase DB
table1 has actual data
Created a temp table2 of data type text. Create table #temp2 (Line text)
Insert and formatted using trim for space,null values and tried concat using + symbol into temp table2 from table1
Result: data gets truncated at 256 char
Findings: Sybase ASE text data type supports only 255 char
Can someone suggest on how to overcome with this issue!
Sybase ASE's text data type is not limited to 256 characters, but there are some tricks to using it successfully such as specifying textsize and being aware that these settings may be session and stored procedure specific.
Consider the following example Sybase ASE 16.0 GA on Linux:
Create then 39 column table.
create table table_1 (
col_1 Varchar(255) null,
col_2 Varchar(255) null,
col_3 Varchar(255) null,
col_4 Varchar(255) null,
col_5 Varchar(255) null,
col_6 Varchar(255) null,
col_7 Varchar(255) null,
col_8 Varchar(255) null,
col_9 Varchar(255) null,
col_10 Varchar(255) null,
col_11 Varchar(255) null,
col_12 Varchar(255) null,
col_13 Varchar(255) null,
col_14 Varchar(255) null,
col_15 Varchar(255) null,
col_16 Varchar(255) null,
col_17 Varchar(255) null,
col_18 Varchar(255) null,
col_19 Varchar(255) null,
col_20 Varchar(255) null,
col_21 Varchar(255) null,
col_22 Varchar(255) null,
col_23 Varchar(255) null,
col_24 Varchar(255) null,
col_25 Varchar(255) null,
col_26 Varchar(255) null,
col_27 Varchar(255) null,
col_28 Varchar(255) null,
col_29 Varchar(255) null,
col_30 Varchar(255) null,
col_31 Varchar(255) null,
col_32 Varchar(255) null,
col_33 Varchar(255) null,
col_34 Varchar(255) null,
col_35 Varchar(255) null,
col_36 Varchar(255) null,
col_37 Varchar(255) null,
col_38 Varchar(255) null,
col_39 Varchar(255) null)
go
I receive a warning about the potential row sizes not fitting on the page. My Sybase ASE instance is configured for 2K pages. 16K page size instances will not receive this warning. Truncation will only occur should the row size become larger than the page size:
Warning: Row size (10028 bytes) could exceed row size limit, which is 1962
bytes.
Insert the rows into table_1. Ideally, with 16K pages and columns with 255 characters, this insert statement could be used:
insert into table_1 values (
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL01',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL02',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL03',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL04',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL05',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL06',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL07',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL08',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL09',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL10',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL11',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL12',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL13',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL14',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL15',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL16',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL17',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL18',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL19',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL20',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL21',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL22',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL23',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL24',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL25',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL26',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL27',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL28',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL29',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL30',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL31',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL32',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL33',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL34',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL35',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL36',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL37',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL38',
'0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789COL39')
go 30
The "go 30" at the end, submits the SQL batch 30 times, inserting 30 rows.
Since my own instance only has 2K pages, I'm limiting myself to 45 characters per column which is 1755 characters.
insert into table_1 values (
'0123456789012345678901234567890123456789COL01',
'0123456789012345678901234567890123456789COL02',
'0123456789012345678901234567890123456789COL03',
'0123456789012345678901234567890123456789COL04',
'0123456789012345678901234567890123456789COL05',
'0123456789012345678901234567890123456789COL06',
'0123456789012345678901234567890123456789COL07',
'0123456789012345678901234567890123456789COL08',
'0123456789012345678901234567890123456789COL09',
'0123456789012345678901234567890123456789COL10',
'0123456789012345678901234567890123456789COL11',
'0123456789012345678901234567890123456789COL12',
'0123456789012345678901234567890123456789COL13',
'0123456789012345678901234567890123456789COL14',
'0123456789012345678901234567890123456789COL15',
'0123456789012345678901234567890123456789COL16',
'0123456789012345678901234567890123456789COL17',
'0123456789012345678901234567890123456789COL18',
'0123456789012345678901234567890123456789COL19',
'0123456789012345678901234567890123456789COL20',
'0123456789012345678901234567890123456789COL21',
'0123456789012345678901234567890123456789COL22',
'0123456789012345678901234567890123456789COL23',
'0123456789012345678901234567890123456789COL24',
'0123456789012345678901234567890123456789COL25',
'0123456789012345678901234567890123456789COL26',
'0123456789012345678901234567890123456789COL27',
'0123456789012345678901234567890123456789COL28',
'0123456789012345678901234567890123456789COL29',
'0123456789012345678901234567890123456789COL30',
'0123456789012345678901234567890123456789COL31',
'0123456789012345678901234567890123456789COL32',
'0123456789012345678901234567890123456789COL33',
'0123456789012345678901234567890123456789COL34',
'0123456789012345678901234567890123456789COL35',
'0123456789012345678901234567890123456789COL36',
'0123456789012345678901234567890123456789COL37',
'0123456789012345678901234567890123456789COL38',
'0123456789012345678901234567890123456789COL39')
go 30
Check the correct number of characters have been entered, this should be the length of string in each column. In my own case, 45.
select
char_length(col_1),
char_length(col_2),
char_length(col_3),
char_length(col_4),
char_length(col_5),
char_length(col_6),
char_length(col_7),
char_length(col_8),
char_length(col_9),
char_length(col_10),
char_length(col_11),
char_length(col_12),
char_length(col_13),
char_length(col_14),
char_length(col_15),
char_length(col_16),
char_length(col_17),
char_length(col_18),
char_length(col_19),
char_length(col_20),
char_length(col_21),
char_length(col_22),
char_length(col_23),
char_length(col_24),
char_length(col_25),
char_length(col_26),
char_length(col_27),
char_length(col_28),
char_length(col_29),
char_length(col_30),
char_length(col_31),
char_length(col_32),
char_length(col_33),
char_length(col_34),
char_length(col_35),
char_length(col_36),
char_length(col_37),
char_length(col_38),
char_length(col_39)
from table_1
go
Now create table_2 with the text column.
create table table_2 (col_1 text null)
go
Insert the rows into table_2 from the concatenated values of the columns in table_1. There will be one row in table_2 for each row in table_1.
insert into table_2 select
col_1 +
col_2 +
col_3 +
col_4 +
col_5 +
col_6 +
col_7 +
col_8 +
col_9 +
col_10 +
col_11 +
col_12 +
col_13 +
col_14 +
col_15 +
col_16 +
col_17 +
col_18 +
col_19 +
col_20 +
col_21 +
col_22 +
col_23 +
col_24 +
col_25 +
col_26 +
col_27 +
col_28 +
col_29 +
col_30 +
col_31 +
col_32 +
col_33 +
col_34 +
col_35 +
col_36 +
col_37 +
col_38 +
col_39 as result
from table_1
go
Check the length of the column in table_2. If using 45 characters it should be 1755; if using 255 characters it should be 9945.
select char_length(col_1) from table_2
go
Confirm the last value of the last row ends in "COL39".
select col_1 from table_2
go
Something like...
0123456789012345678901234567890123456789COL39
Given the above test case was conducted using Sybase's isql utility we can show that Sybase ASE does correctly concatenate the values and store them in a text column. You are using "winsql", a tool that I am not familiar with nor do I have access to. I imagine this may be imposing some limits on what is being displayed. I suspect somewhere it maybe running:
set textsize 255
or simply truncating the data. The above test case should be able to confirm this. The values returned by char_length() will not be subjected to the truncation unless the input data is truncated.
After installing Sybase ASE 12.5.1 on Windows 2000 and configuring it for 16K pages, the commands above were executed in a database called "rwc". The commands worked as advertised.

How to store different collation text in SQL Server sql_variant type?

SQL Server storing for each sql_variant text value own collation, so I was trying for test purposes to store strings from german to french into sql_variant.
CREATE TABLE [dbo].[VarCollation]
(
[uid] [INT] IDENTITY (1, 1) NOT NULL,
[comment] NVARCHAR(100),
[variant_ger] [sql_variant] NULL,
[variant_rus] [sql_variant] NULL,
[variant_jap] [sql_variant] NULL,
[variant_ser] [sql_variant] NULL,
[variant_kor] [sql_variant] NULL,
[variant_fre] [sql_variant] NULL
) ON [PRIMARY]
GO
INSERT INTO VarCollation(comment, variant_ger, variant_rus, variant_jap, variant_ser, variant_kor, variant_fre)
VALUES('NVarChar',
CONVERT(NVARCHAR, N'Öl fließt') COLLATE SQL_Latin1_General_CP1_CI_AS,
CONVERT(NVARCHAR, N'Москва') COLLATE Cyrillic_General_CI_AS,
CONVERT(NVARCHAR, N' ♪リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リンゴの') COLLATE Japanese_CI_AS,
CONVERT(NVARCHAR, N'ŠšĐđČčĆ掞') COLLATE Serbian_Latin_100_CI_AS,
CONVERT(NVARCHAR, N'향찰/鄕札 구결/口訣 이두/吏讀') COLLATE Korean_100_CI_AS,
CONVERT(NVARCHAR, N'le caractère') COLLATE French_CS_AS);
GO
INSERT INTO VarCollation (comment, variant_ger, variant_rus, variant_jap, variant_ser, variant_kor, variant_fre)
VALUES('VarChar',
CONVERT(VARCHAR, N'Öl fließt') COLLATE SQL_Latin1_General_CP1_CI_AS,
CONVERT(VARCHAR, N'Москва') COLLATE Cyrillic_General_CI_AS,
CONVERT(VARCHAR, N' ♪リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リンゴの') COLLATE Japanese_CI_AS,
CONVERT(VARCHAR, N'ŠšĐđČčĆ掞') COLLATE Serbian_Latin_100_CI_AS,
CONVERT(VARCHAR, N'향찰/鄕札 구결/口訣 이두/吏讀') COLLATE Korean_100_CI_AS,
CONVERT(VARCHAR, N'le caractère') COLLATE French_CS_AS);
GO
By analyzing data of each sql_variant I see that each value stored with exact collation assigned for both NVARCHAR and VARCHAR.
German
collationId 0x3400d008
codepage 0x000004e4
Russian
collationId 0x0000d015
codepage 0x000004e3
Japanese
collationId 0x0000d010
codepage 0x000003a4
Serbian
collationId 0x0004d04c
codepage 0x000004e2
Korean
collationId 0x0004d040
codepage 0x000003b5
French
collationId 0x0000c00b
codepage 0x000004e4
but SSMS shows proper values for NVARCHAR and garbage for VARCHAR
uid comment variant_ger variant_rus variant_jap variant_ser variant_kor variant_fre
1 NVarChar Öl fließt Москва  ♪リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リン ŠšĐđČčĆ掞 향찰/鄕札 구결/口訣 이두/吏讀 le caractère
2 VarChar Ol flie?t Москва ?d???????????????????????????? SsDdCcCcZz ??/?? ??/?? ??/?? le caractere
From what I see in sql_variant data for VARCHAR japanese text stored with some characters already replaced by 0x3f ('?'). I tried to INSERT without convert and N but result the same. Is it possible to insert such text into sql_variant and how to do that?
To answer your question, yes, you can store different collations in a sql_variant, however, your COLLATE statement is in the wrong place. You are changing the collation of the value after the nvarchar has been converted to a varchar, so the characters have already been lost. Converting a varchar back to an nvarchar, or changing it's collation afterwards doesn't restore "lost" data; it has already been lost.
Even if you fix that, you'll notice, however, you don't get the results you want:
USE Sandbox;
GO
CREATE TABLE TestT (TheVarchar sql_variant)
INSERT INTO dbo.TestT (TheVarchar)
SELECT CONVERT(varchar, N'향찰/鄕札 구결/口訣 이두/吏讀' COLLATE Korean_100_CI_AS)
INSERT INTO dbo.TestT (TheVarchar)
SELECT CONVERT(varchar, N' ♪リンゴ可愛いや可愛いやリンゴ。半世紀も前に流行した「リンゴの' COLLATE Japanese_CI_AS);
SELECT *
FROM dbo.TestT;
GO
DROP TABLE dbo.TestT;
Notice that the second string has the value ' ♪リンゴ可愛いや可愛いやリン' (it's been truncated). That's because you haven't declared your length value for varchar. Always declare your lengths, precisions, scales, etc. You know your data better than I, so you will know an appropriate value for it.

Convert utf-8 encoded varbinary(max) data to nvarchar(max) string

Is there a simple way to convert a utf-8 encoded varbinary(max) column to varchar(max) in T-SQL. Something like CONVERT(varchar(max), [MyDataColumn]). Best would be a solution that does not need custom functions.
Currently, i convert the data on the client side, but this has the downside, that correct filtering and sorting is not as efficient as done server-side.
XML trick
Following solution should work for any encoding.
There is a tricky way of doing exactly what the OP asks. Edit: I found the same method discussed on SO (SQL - UTF-8 to varchar/nvarchar Encoding issue)
The process goes like this:
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA[' --start CDATA
+ REPLACE(
LB.LongBinary,
']]>', --we need only to escape ]]>, which ends CDATA section
']]]]><![CDATA[>' --we simply split it into two CDATA sections
) + ']]>' AS XML --finish CDATA
).value('.', 'nvarchar(max)')
Why it works: varbinary and varchar are the same string of bits - only the interpretation differs, so the resulting xml truly is utf8 encoded bitstream and the xml interpreter is than able to reconstruct the correct utf8 encoded characters.
BEWARE the 'nvarchar(max)' in the value function. If you used varchar, it would destroy multi-byte characters (depending on your collation).
BEWARE 2 XML cannot handle some characters, i.e. 0x2. When your string contains such characters, this trick will fail.
Database trick (SQL Server 2019 and newer)
This is simple. Create another database with UTF8 collation as the default one. Create function that converts VARBINARY to VARCHAR. The returned VARCHAR will have that UTF8 collation of the database.
Insert trick (SQL Server 2019 and newer)
This is another simple trick. Create a table with one column VARCHAR COLLATE ...UTF8. Insert the VARBINARY data into this table. It will get saved correctly as UTF8 VARCHAR. It is sad that memory optimized tables cannot use UTF8 collations...
Alter table trick (SQL Server 2019 and newer)
(don't use this, it is unnecessary, see Plain insert trick)
I was trying to come up with an approach using SQL Server 2019's Utf8 collation and I have found one possible method so far, that should be faster than the XML trick (see below).
Create temporary table with varbinary column.
Insert varbinary values into the table
Alter table alter column to varchar with utf8 collation
drop table if exists
#bin,
#utf8;
create table #utf8 (UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8);
create table #bin (BIN VARBINARY(MAX));
insert into #utf8 (UTF8) values ('Žluťoučký kůň říčně pěl ďábelské ódy za svitu měsíce.');
insert into #bin (BIN) select CAST(UTF8 AS varbinary(max)) from #utf8;
select * from #utf8; --here you can see the utf8 string is stored correctly and that
select BIN, CAST(BIN AS VARCHAR(MAX)) from #bin; --utf8 binary is converted into gibberish
alter table #bin alter column BIN varchar(max) collate Czech_100_CI_AI_SC_UTF8;
select * from #bin; --voialá, correctly converted varchar
alter table #bin alter column BIN nvarchar(max);
select * from #bin; --finally, correctly converted nvarchar
Speed difference
The Database trick together with the Insert trick are the fastest ones.
The XML trick is slower.
The Alter table trick is stupid, don't do it. It loses out when you change lots of short texts at once (the altered table is large).
The test:
first string contains one replace for the XML trick
second string is plain ASCII with no replaces for XML trick
#TextLengthMultiplier determines length of the converted text
#TextAmount determines how many of them at once will be converted
------------------
--TEST SETUP
--DECLARE #LongText NVARCHAR(MAX) = N'český jazyk, Tiếng Việt, русский язык, 漢語, ]]>';
--DECLARE #LongText NVARCHAR(MAX) = N'JUST ASCII, for LOLZ------------------------------------------------------';
DECLARE
#TextLengthMultiplier INTEGER = 100000,
#TextAmount INTEGER = 10;
---------------------
-- TECHNICALITIES
DECLARE
#StartCDATA DATETIME2(7), #EndCDATA DATETIME2(7),
#StartTable DATETIME2(7), #EndTable DATETIME2(7),
#StartDB DATETIME2(7), #EndDB DATETIME2(7),
#StartInsert DATETIME2(7), #EndInsert DATETIME2(7);
drop table if exists
#longTexts,
#longBinaries,
#CDATAConverts,
#DBConverts,
#INsertConverts;
CREATE TABLE #longTexts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #longBinaries (LongBinary VARBINARY(MAX) NOT NULL);
CREATE TABLE #CDATAConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #DBConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #InsertConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
insert into #longTexts --make the long text longer
(LongText)
select
REPLICATE(#LongText, #TextLengthMultiplier)
from
TESTES.dbo.Numbers --use while if you don't have number table
WHERE
Number BETWEEN 1 AND #TextAmount; --make more of them
insert into #longBinaries (LongBinary) select CAST(LongText AS varbinary(max)) from #longTexts;
--sanity check...
SELECT TOP(1) * FROM #longTexts;
------------------------------
--MEASURE CDATA--
SET #StartCDATA = SYSDATETIME();
INSERT INTO #CDATAConverts
(
LongText
)
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA['
+ REPLACE(
LB.LongBinary,
']]>',
']]]]><![CDATA[>'
) + ']]>' AS XML
).value('.', 'Nvarchar(max)')
FROM
#longBinaries AS LB;
SET #EndCDATA = SYSDATETIME();
--------------------------------------------
--MEASURE ALTER TABLE--
SET #StartTable = SYSDATETIME();
DROP TABLE IF EXISTS #AlterConverts;
CREATE TABLE #AlterConverts (UTF8 VARBINARY(MAX));
INSERT INTO #AlterConverts
(
UTF8
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
ALTER TABLE #AlterConverts ALTER COLUMN UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8;
--ALTER TABLE #AlterConverts ALTER COLUMN UTF8 NVARCHAR(MAX);
SET #EndTable = SYSDATETIME();
--------------------------------------------
--MEASURE DB--
SET #StartDB = SYSDATETIME();
INSERT INTO #DBConverts
(
LongText
)
SELECT
FUNCTIONS_ONLY.dbo.VarBinaryToUTF8(LB.LongBinary)
FROM
#longBinaries AS LB;
SET #EndDB = SYSDATETIME();
--------------------------------------------
--MEASURE Insert--
SET #StartInsert = SYSDATETIME();
INSERT INTO #INsertConverts
(
LongText
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
SET #EndInsert = SYSDATETIME();
--------------------------------------------
-- RESULTS
SELECT
DATEDIFF(MILLISECOND, #StartCDATA, #EndCDATA) AS CDATA_MS,
DATEDIFF(MILLISECOND, #StartTable, #EndTable) AS ALTER_MS,
DATEDIFF(MILLISECOND, #StartDB, #EndDB) AS DB_MS,
DATEDIFF(MILLISECOND, #StartInsert, #EndInsert) AS Insert_MS;
SELECT TOP(1) '#CDATAConverts ', * FROM #CDATAConverts ;
SELECT TOP(1) '#DBConverts ', * FROM #DBConverts ;
SELECT TOP(1) '#INsertConverts', * FROM #INsertConverts;
SELECT TOP(1) '#AlterConverts ', * FROM #AlterConverts ;
SQL-Server does not know UTF-8 (at least all versions you can use productivly). There is limited support starting with v2014 SP2 (and some details about the supported versions)
when reading an utf-8 encoded file from disc via BCP (same for writing content to disc).
Important to know:
VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.
NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).
utf-8 will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.
A VARBINARY(x) will hold the utf-8 as a meaningless chain of bytes.
A simple CAST or CONVERT will not work: VARCHAR will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR would take each chunk of 2 bytes as one character. Again not the thing you need.
You might try to write this out to a file and read it back with BCP (v2014 SP2 or higher). But the better chance I see for you is a CLR function.
you can use the following to post string into varbinary field
Encoding.Unicode.GetBytes(Item.VALUE)
then use the following to retrive data as string
public string ReadCString(byte[] cString)
{
var nullIndex = Array.IndexOf(cString, (byte)0);
nullIndex = (nullIndex == -1) ? cString.Length : nullIndex;
return System.Text.Encoding.Unicode.GetString(cString);
}

Detect UNICODE characters that are not ASCII in table

I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result

Select statement returns nothing when column collation SQL_Latin1_General_CP1_CI_AS in T-sql

I have a select statement as below:
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))=
CAST('БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as varchar(10))
Column Veri has collation of type SQL_Latin1_General_CP1_CI_AS.
There is a row with Veri equals БHО. However, select statement returns nothing.
Table tblTest's collation is also SQL_Latin1_General_CP1_CI_AS.
What am I doing wrong?
Edit: Column definition for column Veri is as follow:
CONDENSED_TYPE: nvarchar(50)
TABLE_SCHEMA: dbo
TABLE_NAME: tblTest
COLUMN_NAME: Veri
ORDINAL_POSITION: 2
COLUMN_DEFAULT: NULL
IS_NULLABLE: NO
DATA_TYPE: nvarchar
CHARACTER_MAXIMUM_LENGTH: 50
CHARACTER_OCTET_LENGTH: 100
NUMERIC_PRECISION:NULL
NUMERIC_PRECISION_RADIX: NULL
NUMERIC_SCALE: NULL
DATETIME_PRECISION: NULL
CHARACTER_SET_CATALOG: NULL
CHARACTER_SET_SCHEMA: NULL
COLLATION_NAME: SQL_Latin1_General_CP1_CI_AS
CHARACTER_SET_NAME: UNICODE
COLLATION_CATALOG: NULL
DOMAIN_SCHEMA: NULL
DOMAIN_NAME: NULL
In T/SQL the string constant 'БHО' is an ANSI string, and 'Б' is not available so you'll get the question marks that #EduardUta queried. You need to work with Unicode strings, using the N prefix for string constants and nvarchar. Try this;
SELECT Veri from tblTest
where CAST(Veri COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10)) =
CAST(N'БHО' COLLATE SQL_Latin1_General_CP1_CI_AS as nvarchar(10))
You may be able to remove the COLLATE directives - depends on your schema.
Another thing you can do is to examine a string character by character to see what each character actually is. For example, in your string 'БHО' it might look like the Cyrillic capital letter Be followed by the English letters H and O, but it's not, that's why you are not getting a match.
declare #s nvarchar(100) = N'БНО'
declare #i int = 0
while (#i <= len(#s))
begin
print substring(#s, #i, 1) + N' - 0x' + convert(varchar(8), convert(varbinary(4), unicode(substring(#s, #i, 1))), 2)
set #i = #i + 1
end
Try typing the Н and О in the string N'БНО' above and running again - you'll see 0x48 and 0x4F respectively.
Hope this helps,
Rhys

Resources