t-sql unicode function returns 0

t-sql unicode function returns 0 - sql-server

I have string saved on nvarchar(100) column named
address.
SELECT UNICODE(address) FROM clients
Here is the result set
0
What is a symbol it is? How I can find index of this symbol in another locations of my string?

As stated in BOL
Returns the integer value, as defined by the Unicode standard, for the
first character of the input expression.
So you also can display the first symbol as binary to examine it
SELECT UNICODE(address), CAST(SUBSTRING(address, 1, 1) as varbinary(2)) FROM clients
The following also return 0
SELECT UNICODE(CAST(0x00 AS NVARCHAR))
SELECT UNICODE(NCHAR(0))
So - the simplest answer - it is a NCHAR(0) - character
To find the NCHAR(0) in string see the following example:
DECLARE #str NVARCHAR(100) = N'John'+NCHAR(0)+N'Smith'
SELECT CHARINDEX(NCHAR(0) COLLATE Latin1_General_BIN, #str)

Related

REPLACE function does not replace desired character string

I want to replace all occurrences of a particular single character string (eg.:'^'or ',') when creating a view that is based on a single table. But id does not replace the desired single character in all the the data rows. I know it when I query the newly created view. All fields have varchar datatype.
This is a specific a example where the desire string does not get replaced MAINTENANCEÃ‚Â¿ENHANCED
I tried the following and none worked.
SELECT REPLACE('MAINTENANCEÃ‚Â¿ENHANCED',',','')
SELECT REPLACE('MAINTENANCEÃ‚Â¿ENHANCED',char(33),'')
SELECT REPLACE(N'MAINTENANCEÃ‚Â¿ENHANCED',',','')
SELECT REPLACE('MAINTENANCEÃ‚Â¿ENHANCED',N',','')
SELECT REPLACE(CAST('MAINTENANCEÃ‚Â¿ENHANCED' as NVARCHAR(50)),N',','')
SELECT REPLACE(CAST('MAINTENANCEÃ‚Â¿ENHANCED' as VARCHAR(50)),N',','')
SELECT REPLACE(TRY_CAST('MAINTENANCEÃ‚Â¿ENHANCED' as VARCHAR(50)),N',','')
SELECT REPLACE(CONVERT(VARCHAR(50),'MAINTENANCEÃ‚Â¿ENHANCED'), N',','')
Also I performed simple test I copied the comma from the string from where I need it to be replaced (see my code below)
if ',' = '‚' print 1 -- DOES NOT return TRUE. 1st comma is the one I typed in the second argument of the REPLACE function, the 2nd comma is the one copied from the string above.
if ',' = ',' print 1 -- RETURNs TRUE. Both of the commas that I typed in the second argument of the REPLACE function.
Apparently the issue is with my comma in the data source which is being treated as equally. Though the functions below shows that both are varchar. ( https://blog.sqlauthority.com/2013/12/15/sql-server-how-to-identify-datatypes-and-properties-of-variable )
**-- comma from the string**
DECLARE #myVar VARCHAR(100)
SET #myVar = '‚'
SELECT SQL_VARIANT_PROPERTY(#myVar,'BaseType') BaseType,
SQL_VARIANT_PROPERTY(#myVar,'Precision') Precisions,
SQL_VARIANT_PROPERTY(#myVar,'Scale') Scale,
SQL_VARIANT_PROPERTY(#myVar,'TotalBytes') TotalBytes,
SQL_VARIANT_PROPERTY(#myVar,'Collation') Collation,
SQL_VARIANT_PROPERTY(#myVar,'MaxLength') MaxLengths
--**regular comma**
SET #myVar = ','
SELECT SQL_VARIANT_PROPERTY(#myVar,'BaseType') BaseType,
SQL_VARIANT_PROPERTY(#myVar,'Precision') Precisions,
SQL_VARIANT_PROPERTY(#myVar,'Scale') Scale,
SQL_VARIANT_PROPERTY(#myVar,'TotalBytes') TotalBytes,
SQL_VARIANT_PROPERTY(#myVar,'Collation') Collation,
SQL_VARIANT_PROPERTY(#myVar,'MaxLength') MaxLengths
Partially this can be resolved using this code below
select Stuff('MAINTENANCEÃ‚Â¿ENHANCED', PatIndex('%[^a-z0-9]%', 'MAINTENANCEÃ‚Â¿ENHANCED'), 1, '')
OUTPUT
-- the comma is replaced. That is what is expected.
MAINTENANCEÃÂ¿ENHANCED
But it does not work in I have more than 1 comma regardless if I copy it from the data source or type it in myself.
('‚MAINTENANCEÃ‚Â¿ENHANCED')
select Stuff('‚MAINTENANCEÃ‚Â¿ENHANCED', PatIndex('%[^a-z0-9]%', '‚MAINTENANCEÃ‚Â¿ENHANCED'), 1, '')
select REPLACE(Stuff('‚MAINTENANCEÃ‚Â¿ENHANCED', PatIndex('%[^a-z0-9]%', '‚MAINTENANCEÃ‚Â¿ENHANCED'), 1, ''),',','')
OUTPUT
-- the comma is back again. The is the Issues. Only one (first) comma is replaced.
AINTENANCEÃ‚Â¿ENHANCED
P.S.
Please refer to my answer below where I resolved all the above described issues except that I need to figure out how to preserver from removal special characters like question marks, parenthetic, etc.

PARTIALLY this can be resolved using this code below that I got from here
use MyDB;
go
drop function if exists [dbo].[RemoveNonAlphaCharacters]
go
Create Function [dbo].[RemoveNonAlphaCharacters](#Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^ a-z0-9]%'
While PatIndex(#KeepValues, #Temp) > 0
Set #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
Return #Temp
End
SELECT MyDB.dbo.RemoveNonAlphaCharacters(', (/!:\£&^?-:;|\)?%$"éè§°àçò*MAIN,2TENANCEÃ‚Â¿ENHANCED 123 asds %[ ..')
I got this from
How to strip all non-alphabetic characters from string in SQL Server?
OUTPUT
éèàçòMAIN2TENANCEÃÂENHANCED 123 asds
The issues here that it removes all non-alphabetic string characters such as &^?-:;|)? ]% ;:_|!"
I could not fugure out how to pass regular expression to preserver all (except for comma which needs to be replaced) characters in the printable section of the ASCII table (see example above and follow the links below)
https://www.rexegg.com/regex-quickstart.html
http://www.asciitable.com/

Function to remove all Non-alpha-numeric characters, superscripts, and subscripts, except a dash '-'

I need to create a T-SQL function that only keeps a hyphen (dash '-') and removes all non-alphanumeric characters (plus all spaces, superscripts and subscripts) from a given string.
You can test Superscript/Subscripts in SSMS:
select 'Hello® World™ '
Example:
input string
output string:
HelloWorld-ThisIsATest123
Any solutions or thoughts will be appreciated.

Check this link. This removes all alpha numeric characters. You can include '-' also to the included list.
How to strip all non-alphabetic characters from string in SQL Server?
In this example for the answer from #George Mastros, use '%[^a-zA-Z0-9-]%' for regular expression instead of '%[^a-z]%'
Here is the reformatted function to include '-' and numeric characters:
-- Reformatted function
Create Function [dbo].[RemoveNonAlphaCharacters](#Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^a-zA-Z0-9\-]%'
While PatIndex(#KeepValues, #Temp) > 0
Set #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
Return #Temp
End
--Call function
Select dbo.RemoveNonAlphaCharacters('Hello® World™ -123 !##$%^')
OUTPUT: HelloWorld-123

I identified my code's issue - I previously had exact same function which was NOT removing superscript / subscript, and I was wondering why. Here was the issue: The input/output datatype should NOT be NVARCHAR , but mere varchar, else it will contain superscripts in the return string:
problem code :
Create Function [dbo].[RemoveNonAlphaCharacters](#Temp NVarChar(1000))
Returns NVarChar(1000)
AS
...

"Create sql function , select english characters?"

I am looking for a function that selects English numbers and letters only:
Example:
TEKA תנור ביל דין in HLB-840 P-WH לבן
I want to run a function and get the following result:
TEKA HLB-840 P-WH
I'm using MS SQL Server 2012

What you really need here is regex replacement, which SQL Server does not support. Broadly speaking, you would want to find [^A-Za-z0-9 -]+\s* and then replace with empty string. Here is a demo showing that this works as expected:
Demo
This would output TEKA in HLB-840 P-WH for the input you provided. You might be able to do this in SQL Server using a regex package or UDF. Or, you could do this replacement outside of SQL using any number of tools which support regex (e.g. C#).

SQL-Server is not the right tool for this.
The following might work for you, but there is no guarantee:
declare #yourString NVARCHAR(MAX)=N'TEKA תנור ביל דין in HLB-840 P-WH לבן';
SELECT REPLACE(REPLACE(REPLACE(REPLACE(CAST(#yourString AS VARCHAR(MAX)),'?',''),' ','|~'),'~|',''),'|~',' ');
The idea in short:
A cast of NVARCHAR to VARCHAR will return all characters in your string, which are not known in the given collation, as question marks. The rest is replacements of question marks and multi-blanks.
If your string can include a questionmark, you can replace it first to a non-used character, which you re-replace at the end.
If you string might include either | or ~ you should use other characters for the replacements of multi-blanks.
You can influence this approach by specifying a specific collation, if some characters pass by...

there is no build in function for such purpose, but you can create your own function, should be something like this:
--create function (split string, and concatenate required)
CREATE FUNCTION dbo.CleanStringZZZ ( #string VARCHAR(100))
RETURNS VARCHAR(100)
BEGIN
DECLARE #B VARCHAR(100) = '';
WITH t --recursive part to create sequence 1,2,3... but will better to use existing table with index
AS
(
SELECT n = 1
UNION ALL
SELECT n = n+1 --
FROM t
WHERE n <= LEN(#string)
)
SELECT #B = #B+SUBSTRING(#string, t.n, 1)
FROM t
WHERE SUBSTRING(#string, t.n, 1) != '?' --this is just an example...
--WHERE ASCII(SUBSTRING(#string, t.n, 1)) BETWEEN 32 AND 127 --you can use something like this
ORDER BY t.n;
RETURN #B;
END;
and then you can use this function in your select statement:
SELECT dbo.CleanStringZZZ('TEKA תנור ביל דין in HLB-840 P-WH לבן');

create function dbo.AlphaNumericOnly(#string varchar(max))
returns varchar(max)
begin
While PatIndex('%[^a-z0-9]%', #string) > 0
Set #string = Stuff(#string, PatIndex('%[^a-z0-9]%', #string), 1, '')
return #string
end

Detect UNICODE characters that are not ASCII in table

I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?

You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))

If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.
Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));

There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare #i int = 1
declare #x nvarchar(10) = N'vsdǣf'
declare #result nvarchar(100) = N''
while (#i < len(#x))
begin
if char(ascii(substring(#x,#i,1))) = '?'
begin
set #result = #result + substring(#x,#i,1)
end
set #i = #i+1
end
select #result

In SQL Server, can we change the Unicode replacement character from "?" to something else?

I understand the vagaries of Unicode in SQL Server - varchar vs nvarchar, etc. I don't have a problem storing and retrieving Unicode data. However, there are some fields we have chosen to keep as varchar, since a non-ASCII character in those fields is considered anomalous.
When a Unicode character makes it into one of those non-Unicode fields, SQL Server converts it to a question mark: "?". BUT, sometimes it's hard to tell when a substitution has occurred because a question mark is a valid character in those fields.
My question: Can I get SQL Server to use a different substitution character, rather than a question mark? For instance, an underscore or even an empty string ('')?

Straight answer to your question is, you can not 'set' that character. As others suggested and you probably already knew, need to check for valid data to your 'special' varchar fields.

Because I was bored. I'm near positive this won't be useful in application, but it does do what you asked for. You could create a function with this if you really wanted to...
Declare #Nvarchar Nvarchar(25) = N'Hɶppy',
#NVbinary Varchar(128),
#parse Int = 3,
#NVunit Varchar(4),
#result Varchar(64) = '0x',
#SQL Nvarchar(Max);
Select #NVbinary = master.sys.fn_varbintohexstr(Convert(Varbinary(128),#Nvarchar))
While (#parse < Len(#NVbinary))
Begin
Select #NVunit = Substring(#NVbinary,#parse,4),
#parse = #parse + 4
If Substring(#NVunit,3,2) = '00'
Begin
Set #result = #result + Substring(#NVunit,1,2)
End
Else
Begin
Set #result = #result + '22' -- 22 is the hex value for quotation mark; Use Select Convert(Varbinary(8),'"') to get the value for whatever non-unicode character you want.
End
End
Set #SQL = 'Select Convert(Varchar(128),' + #result + '), ''' + #result + ''''
Select #Nvarchar, #NVbinary
Exec sp_executeSQL #SQL

You are right, any unicode character that does not have an ASCII equivalent leads to data loss when you put it into a varchar, and leaves behind a question mark:
select ascii(cast(nchar(1000) as varchar));
I agree with R. Martinho Fernandes, you need to solve this at the application layer. You could have the app replace any 2-byte unicode pair that has value over 255 with whatever you want. Maybe you can change your application-layer encoding to accept ASCII and Extended ASCII data only. But trying to fault the data layer in this case is like saying, "Our data field only accepts 'M' or 'F.' Why is the database complaining when the user sends us a 'Z'?"