How to validate that UTF-8 columns actually save space? - sql-server

SQL Server 2019 introduces support for the widely used UTF-8 character encoding.
I have a large table that stores sent emails. So I'd like to give this feature a try.
ALTER TABLE dbo.EmailMessages
ALTER COLUMN Body NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AI_SC_UTF8;
ALTER TABLE dbo.EmailMessages REBUILD;
My concern is that I don't know how to verify size gains. It seems that popular scripts for size estimation do not properly report size in this case.

Basically, column type must be converted to VARCHAR(MAX) then data is stored in a more compact manner:
To limit the amount of changes required for the above scenarios, UTF-8
is enabled in existing the data types CHAR and VARCHAR. String data is
automatically encoded to UTF-8 when creating or changing an object’s
collation to a collation with the “_UTF8” suffix, for example from
LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.
Size can be inspected using sp_spaceused:
sp_spaceused N'EmailMessages';
If unused space is high then you might need to reorganize:
ALTER INDEX ALL ON dbo.EmailMessages REORGANIZE WITH (LOB_COMPACTION = ON);
In my case size was reduced by a factor of ~2 (mostly English text).

As others have already mentioned, you should use VARCHAR instead of NVARCHAR to store UTF-8 encoded text.
You can use a query like the following to compare string lengths. It assumes a table named #Data with an NVARCHAR column called String.
SELECT *
FROM #Data
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), String COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8) AS Utf8String
) U
CROSS APPLY (
SELECT
LEN(String) AS Length,
--LEN(Utf8String) AS Utf8Length,
DATALENGTH(String) AS NVarcharBytes,
DATALENGTH(Utf8String) AS Utf8Bytes
) L
CROSS APPLY (
SELECT
CASE WHEN Utf8Bytes < NVarcharBytes THEN 'Yes' ELSE '' END AS IsShorter,
CASE WHEN Utf8Bytes > NVarcharBytes THEN 'Yes' ELSE '' END AS IsLonger
) C
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), String), 1) AS NVarcharHex,
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), Utf8String), 1) AS Utf8Hex
) H
You can replace FROM #Data with something like FROM (SELECT Email AS String FROM YourTable) D to query your specific data. Replace SELECT * with SELECT SUM(NVarcharBytes) AS NVarcharBytes, SUM(Utf8Bytes) AS Utf8Bytes to get totals.
See this db<>fiddle.
See also: Storage differences between UTF-8 and UTF-16.

Related

Is there a SQL Server collation option that will allow matching different apostrophes?

I'm currently using SQL Server 2016 with SQL_Latin1_General_CP1_CI_AI collation. As expected, queries with the letter e will match values with the letters e, è, é, ê, ë, etc because of the accent insensitive option of the collation. However, queries with a ' (U+0027) do not match values containing a ’ (U+2019). I would like to know if such a collation exists where this case would match, since it's easier to type ' than it is to know that ’ is keystroke Alt-0146.
I'm confident in saying no. The main thing, here, is that the two characters are different (although similar). With accents, e and ê are still both an e (just one has an accent). This enables you (for example) to do searches for things like SELECT * FROM Games WHERE [Name] LIKE 'Pokémon%'; and still have rows containing Pokemon return (because people haven't used the accent :P).
The best thing I could suggest would be to use REPLACE (at least in your WHERE clause) so that both rows are returned. That is, however, likely going to get expensive.
If you know what columns are going to be a problem, you could, therefore, add a PERSISTED Computed Column to that table. Then you could use that column in your WHERE clause, but display the one the original one. Something like:
USE Sandbox;
--Create Sample table and data
CREATE TABLE Sample (String varchar(500));
INSERT INTO Sample
VALUES ('This is a string that does not contain either apostrophe'),
('Where as this string, isn''t without at least one'),
('’I have one of them as well’'),
('’Well, I''m going to use both’');
GO
--First attempt (without the column)
SELECT String
FROM Sample
WHERE String LIKE '%''%'; --Only returns 2 of the rows
GO
--Create a PERSISTED Column
ALTER TABLE Sample ADD StringRplc AS REPLACE(String,'’','''') PERSISTED;
GO
--Second attempt
SELECT String
FROM Sample
WHERE StringRplc LIKE '%''%'; --Returns 3 rows
GO
--Clean up
DROP TABLE Sample;
GO
The other answer is correct. There is no such collation. You can easily verify this with the below.
DECLARE #dynSql NVARCHAR(MAX) =
'SELECT * FROM (' +
(
SELECT SUBSTRING(
(
SELECT ' UNION ALL SELECT ''' + name + ''' AS name, IIF( NCHAR(0x0027) = NCHAR(0x2019) COLLATE ' + name + ', 1,0) AS Equal'
FROM sys.fn_helpcollations()
FOR XML PATH('')
), 12, 0+ 0x7fffffff)
)
+ ') t
ORDER BY Equal, name';
PRINT #dynSql;
EXEC (#dynSql);

Query to search an alphanumeric string in a non-alphanumeric column

Here is the issue - I have a database column that holds product serial number that are filled in by users, but without any kind of filters. For example, the user can fill the field as: DC-538, DC 538 or DC538, depending on his own interpretation - since the serial number is usually in the metal part of the product and it can be difficult to know If there's a blank space for examplo.
I can't format the current column values, because that are so many brands and we couldn't know for sure If taking out a non alpha numeric character can lead to problems. I mean, If they consider these kinds of character as part of an official number. For example: "DC-538-XXX" and "DC538-XXX" could be related to 2 different products. Very unlikely, but we cannot assume it doesn't happen.
Now I need to offer a search by serial number in my website... but, If the user searchs for "DC538" instead of "DC 538" he won't find it. What's the best approach ?
I believe that the perfect solution would be to have a kind of select that would search the exact string and also strip the non-alpha-num from the search term and compare to a stripped string in the database (that I don't have). But I don't know If there's a way to do that with SQL only.
Any ideas ?
Cheers
By using the below function, which was offered as an answer here and modifying it to return numeric characters:
CREATE FUNCTION [dbo].[RemoveNonAlphaCharacters] (#Temp VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE #KeepValues AS VARCHAR(MAX)
SET #KeepValues = '%[^a-z0-9]%'
WHILE PatIndex(#KeepValues, #Temp) > 0
SET #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
RETURN #Temp
END
You can do the following:
DECLARE Input NVARCHAR(MAX)
SET #Input = '%' + dbo.RemoveNonAlphaCharacters('text inputted by user') + '%'
SELECT *
FROM Table
WHERE dbo.RemoveNonAlphaCharacters(ColumnCode) LIKE #Input
Here is a sample working SQLFiddle

What is the use of writing N' ' in query sql server

I am using sql-server 2012 and I have this query
create table t
(
id int not null,
name varchar(10)
);
select OBJECT_NAME(object_id) as table_name,type,name as table_name,type_dec
from sys.indexes
where object_id=OBJECT_ID(N'dbo.t',N'U')
whats the difference in object_id and OBJECT_ID
and what is the use of writing N''
The query returns same result: with or without N
In SQL Server, the prefix N' is used to specify a nvarchar type, which stands for national character.
From the doc :
Prefix Unicode character string constants with the letter N. Without
the N prefix, the string is converted to the default code page of the
database. This default code page may not recognize certain characters.
In other world, it is an unicode character.
The N in N'xxx' means "national language", denoting a unicode string.
If you use it to store data into a VARCHAR as opposed to a NVARCHAR column, it has little use.
You can read more about it under the "Unicode strings" sub-heading on this page: Constants (Transact-SQL).
Q1: Object_id and OBJECT_ID are one and the same.
Q2 is already answered [here][1]

text encodings in .net, sql server processing

I have an application that gets terms from a DB to run as a list of string terms. The DB table was set up with nvarchar for that column to include all foreign characters. Now in some cases where characters like ä will come through clearly when getting the terms from the DB and even show that way in the table.
When importing japanese or arabic characters, all I see are ????????.
Now I have tried converting it using different methods, first converting it into utf8 encoding and then back and also secondly using the httputility.htmlencode which works perfectly when it is these characters but then converts quotes and other stuff which I dont need it to do.
Now I accused the db designer that he needs to do something on his part but am I wrong in that the DB should display all these characters and make it easy to just query it and add to my ssearch list. If not is there a consistent way of getting all international characters to display correctly in SQL and VB.net
I know when I have read from text files I just used the Microsoft.visualbasic.textfieldparser reader tool with encoding set to utf8 and this would not be an issue.
If the database field is nvarchar, then it will store data correctly. As you have seen.
Somewhere before it gets to the database, the data is being lost or changed to varchar: stored procedure, parameters, file encoding, ODBC translation etc.
DECLARE #foo nvarchar(100), #foo2 varchar(100)
--with arabic and japanese and proper N literal
SELECT #foo = N'العربي 日本語', #foo2 = N'العربي 日本語'
SELECT #foo, #foo2 -- gives العربي 日本語
--now a varchar literal
SELECT #foo = 'العربي 日本語', #foo2 = 'العربي 日本語'
SELECT #foo, #foo2 --gives ?????? ???
--from my Swiss German keyboard. These are part of my code page.
SELECT #foo = 'öéäàüè', #foo2 = 'öéäàüè'
SELECT #foo, #foo2 --gives ?????? ???
So, apologise to the nice DB monkey... :-)
Always try to use NVARCHAR or NTEXT to store foreign charactesr.
you cannot store UNICODE in varchar ot text datatype.
Also put a N before string value
like
UPDATE [USER]
SET Name = N'日本語'
WHERE ID = XXXX;

SQL Server Text Datatype Maxlength = 65,535?

Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?
that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me
MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).
If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).
You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.
Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;

Resources