Defeat these dashed dashes in SQL server - sql-server

I have a table that contains the names of various recording artists. One of them has a dash in their name. If I run the following:
Select artist
, substring(artist,8,1) as substring_artist
, ascii(substring(artist,8,1)) as ascii_table
, ascii('-') as ascii_dash_key /*The dash key next to zero */
, len(artist) as len_artist
From [dbo].[mytable] where artist like 'Sleater%'
Then the following is returned. This seems to indicate that a dash (ascii 45) is being stored in the artist column
However, if I change the where clause to:
From [dbo].[mytable] where artist like 'Sleater' + char(45) + '%'
I get no results returned. If I copy and paste the output from the artist column into a hex editor, I can see that the dash is actually stored as E2 80 90, the Unicode byte sequence for the multi-byte hyphen character.
So, I'd like to find and replace such occurrences with a standard ascii hyphen, but I'm am at a loss as to what criteria to use to find these E2 80 90 hyphens?

Your char is the hyphen, information on it here :
https://www.charbase.com/2010-unicode-hyphen
You can see that the UTF16 code is 2010 so in T-SQL you can build it with
SELECT NCHAR(2010)
From there you can use any SQL command with that car, for example in a select like :
Select artist
From [dbo].[mytable] where artist like N'Sleater' + NCHAR(2010) + '%'
or as you want in a
REPLACE( artist, NCHAR(2010), '-' )
with a "real" dash
EDIT:
If the collation of your DB give you some trouble with the NCHAR(2010) you can also try to use the car N'‐' that you'll copy/paste from the charbase link I gave you so :
REPLACE( artist , N'‐' , '-' )
that you can even take from the string here (made with the special car) so all made for you :
update mytable set artist=REPLACE( artist, N'‐' , '-' )

I don't know your table definition and COLLATION but I'm almost sure that you are mixing NCHAR and CHAR types and convert unicode, multibyte characters to sinle byte representations. Take a look at this demo:
WITH Demo AS
(
SELECT N'ABC'+NCHAR(0x2010)+N'DEF' T
)
SELECT
T,
CASE WHEN T LIKE 'ABC'+CHAR(45)+'%' THEN 1 ELSE 0 END [Char],
CASE WHEN T LIKE 'ABC-%' THEN 1 ELSE 0 END [Hyphen],
CASE WHEN T LIKE N'ABC‐%' THEN 1 ELSE 0 END [Unicode-Hyphen],--unicode hyphen us used here
CASE WHEN T LIKE N'ABC'+NCHAR(45)+N'%' THEN 1 ELSE 0 END [NChar],
CASE WHEN CAST(T AS varchar(MAX)) LIKE 'ABC-%' THEN 1 ELSE 0 END [ConvertedToAscii],
ASCII(NCHAR(0x2010)) ConvertedToAscii,
CAST(SUBSTRING(T, 4, 1) AS varbinary) VarbinaryRepresentation
FROM Demo
My results:
T Char Hyphen Unicode-Hyphen NChar ConvertedToAscii ConvertedToAscii VarbinaryRepresentation
------- ----------- ----------- -------------- ----------- ---------------- ---------------- --------------------------------------------------------------
ABC‐DEF 0 0 1 0 1 45 0x1020
UTF-8 (3 bytes) representation is the same as 2010 in unicode.

Related

SQL Server - How to get last numeric value in the given string

I am trying to get last numeric part in the given string.
For Example, below are the given strings and the result should be last numeric part only
SB124197 --> 124197
287276ACBX92 --> 92
R009321743-16 --> 16
How to achieve this functionality. Please help.
Try this:
select right(#str, patindex('%[^0-9]%',reverse(#str)) - 1)
Explanation:
Using PATINDEX with '%[^0-9]%' as a search pattern you get the starting position of the first occurrence of a character that is not a number.
Using REVERSE you get the position of the first non numeric character starting from the back of the string.
Edit:
To handle the case of strings not containing non numeric characters you can use:
select case
when patindex(#str, '%[^0-9]%') = 0 then #str
else right(#str, patindex('%[^0-9]%',reverse(#str)) - 1)
end
If your data always contains at least one non-numeric character then you can use the first query, otherwise use the second one.
Actual query:
So, if your table is something like this:
mycol
--------------
SB124197
287276ACBX92
R009321743-16
123456
then you can use the following query (works in SQL Server 2012+):
select iif(x.i = 0, mycol, right(mycol, x.i - 1))
from mytable
cross apply (select patindex('%[^0-9]%', reverse(mycol) )) as x(i)
Output:
mynum
------
124197
92
16
123456
Demo here
Here is one way using Patindex
SELECT RIGHT(strg, COALESCE(NULLIF(Patindex('%[^0-9]%', Reverse(strg)), 0) - 1, Len(strg)))
FROM (VALUES ('SB124197'),
('287276ACBX92'),
('R009321743-16')) tc (strg)
After reversing the string, we are finding the position of first non numeric character and extracting the data from that position till the end..
Result :
-----
124197
92
16

Changing character in a string of characters

I was wondering regarding how to edit the following column that exists in oracle DB
PPPPFPPPPPPPPPPPPPPPPPPPPPPPPFPPPPPPPP
I want to only set the 5th F with P without affecting other structure.
I've around 700 records and I want to change that position (5th) on all users to P
I was thinking of PLSQL instead of a query, so could you please advice.
Thanks
Use REGEXP_REPLACE:
> SELECT REGEXP_REPLACE('PPPPFPPPPPPPPPPPPPPPPPPPPPPPPFPPPPPPPP', '^(\w{4}).(.*)', '\1P\2') AS COL_REGX FROM dual
COL_REGX
--------------------------------------
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPFPPPPPPPP
Klashxx answer is a good one - REGEXP_REPLACE is the way to go. Old fashioned way built up bit by bit so you can see what's going on :
WITH
test_data (text)
AS (SELECT '1234F1234F1234F1234F1234F1234F1234' FROM DUAL
)
SELECT
text
,INSTR(text,'F',1,5) --fifth occurence
,SUBSTR(text,1,INSTR(text,'F',1,5)-1) --substr up to that point
,SUBSTR(text,1,INSTR(text,'F',1,5)-1)||'P' --add P
,SUBSTR(text,1,INSTR(text,'F',1,5)-1)||'P'||SUBSTR(text,INSTR(text,'F',1,5)+1) --add remainder of string
FROM
test_data
;
So what you're trying to do would be something like
UPDATE <your table>
SET <your column> = SUBSTR(<your column>,1,INSTR(<your column>,'F',1,5)-1)||'P'||SUBSTR(<your column>,INSTR(<your column>,'F',1,5)+1)
..assuming you want to update all rows
The solution below looks for the first five characters at the beginning of the input string. If found, it keeps the first four unchanged and it replaces the fifth with the letter P. Note that if the input string is four characters or less, it is left unchanged. (This includes NULL as the input string, shown in the WITH clause which creates sample strings and also in the output - note that the output has FIVE rows, even though there is nothing visible in the last one.)
with
test_data ( str ) as (
select 'ABCDEFGH' from dual union all
select 'PPPPF' from dual union all
select 'PPPPP' from dual union all
select '1234' from dual union all
select null from dual
)
select str, regexp_replace(str, '^(.{4}).', '\1P') as replaced
from test_data
;
STR REPLACED
-------- --------
ABCDEFGH ABCDPFGH
PPPPF PPPPP
PPPPP PPPPP
1234 1234
5 rows selected.
Flip the 5th 'bit' to a 'P' where it's currently an 'F'.
update table
set column = regexp_replace(column , '^(.{4}).', '\1P')
where regexp_like(column , '^.{4}F');

Dutch zipcodes regex in SQL Server 2014 always returns 0

I might be missing something. I have to check Dutch zipcodes, but I got some user entered data in my database. I want to check if the zipcode can be an actual zipcode. Format for Dutch zipcodes: 1000-9999AA-ZZ
So any integer between 1000 and 9999 in combination with 2 lettres can be a valid zipcode (there are some additional parameters, but I am not worrying about them for now).
I didn't get my regex to work with this code:
iif(ZipCode like '^[1-9][0-9]{3}\s[a-zA-Z]{2}$',1,0) as MatchIndicator
Yet it always returns zero.
I even tried it with a simpler regex
iif(ZipCode like '^[1-9]',1,0) as MatchIndicator
Returns 0 everytime as well.
I found myself an alternative, but I think the regex code is better to use in the long run for more complicated text.
Alternative
case when LEFT(ZipCode,1) between '1' and '9'
and substring(ZipCode,2,1) between '0' and '9'
and substring(ZipCode,3,1) between '0'and '9'
and substring(ZipCode,4,1) between '0' and '9'
and substring(ZipCode,5,1) between 'A' and 'Z'
and substring(ZipCode,6,1) between 'A' and 'Z' then 1 else 0 end as MatchIndicator
And
patindex('[1-9][0-9][0-9][0-9][a-zA-z][a-zA-z]',ZipCode)
Any thoughts?
SQL Server doesn't support 'proper' regex. So how about:
CREATE TABLE #Test (Postcode VARCHAR(6))
INSERT INTO #Test
VALUES
('1234AZ'),
('9876ZQ'),
('1900Sz'),
('ABCDe1'),
('XwYx1A'),
('5000A1')
SELECT
PostCode,
CASE WHEN
TRY_CAST(SUBSTRING(PostCode, 1, 4) AS INT) BETWEEN 1000 AND 9999
AND
PATINDEX('%[A-Z][A-Z]%' COLLATE Latin1_General_Bin, SUBSTRING(PostCode, 5, 2)) > 0 THEN 1 ELSE 0
END IsValid
FROM #Test
PostCode IsValid
-------- -----------
1234AZ 1
9876ZQ 1
1900Sz 0
ABCDe1 0
XwYx1A 0
5000A1 0

Sql Server's regex LIKE - behaviour clarification?

Someone asked here how to get only values which are a number :
So , if the table is :
DECLARE #Table TABLE(
Col nVARCHAR(50)
)
INSERT INTO #Table SELECT 'ABC'
INSERT INTO #Table SELECT '234.62'
INSERT INTO #Table SELECT '10:10:10:10'
INSERT INTO #Table SELECT 'France'
INSERT INTO #Table SELECT '2'
then - the desired results are :
234.62
2
But when I tested this query :
SELECT * FROM #Table WHERE Col LIKE '%[0-9.]%' --expected to see only 234.62
it showed :
234.62
10:10:10:10
2
Question #1
How come 10:10:10:10 , 2 satisfies the condition ?
Question #2
I saw this answer here which does work
SELECT * FROM #Table WHERE Col NOT LIKE '%[^0-9.]%'
But I don't understand why this works. AFAIU - it selects all values which are not like (not(has number) and not( has dot)) which is ===>(de morgan)===> not like ( has number or has dot)
Can someone please shed light ?
nb I already know that isnumeric can be used also , but it's unsafe (+). also valid wildcards are %,_,[],[^]
Any particular use of [set] within a LIKE expression is a check against one character in the target string.
So, LIKE '%[0-9.]%' says - % - match 0-to-many arbitrary characters, then [0-9.] match one character in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character in the set 0-9.". So, 10:10:10:10 can be matched as 0 arbitrary characters, then 1 matches [0-9.], and then 0:10:10:10 matches the final %.
LIKE '%[^0-9.]%' says - % - match 0-to-many arbitrary characters, then [^0-9.] match one character not in the set 0-9., and then % match 0-to-many arbitrary characters. Paraphrased, it says "match any string that contains at least one character outside of the set 0-9.. So when we apply the NOT to the front of that, we are saying "match any string that doesn't contain at least one character outside of the set 0-9." or "match strings that only contain characters in the set 0-9..
Essentially, the double-negative is a way to make an assertion about all characters in the string.

T-SQL field parsing

There is a field in a 3rd party database that I need to group on for a report I'm writing. The field can contain a few different types of data. First it could contain a 3 digit number. I need to break these out into groups such as 101 to 200 and 201 to 300. In addition to this the field could also be prefaced with a particular letter such a M or K then a few numbers. It is defined as VARCHAR(8) and any help in how I could handle both cases where it may start with a particular letter or fall within a numeric range would be appreciated. If I could write it as a case statement and return a department based either on the numeric value or the first letter that would be the best so I can group in my report.
Thanks,
Steven
If I could write it as a case statement and return a department based either on the numeric value or the first letter that would be the best so I can group in my report.
case when substring( field, 1, 1 ) = 'M' then ...
when substring( field, 1, 1 ) = 'K" then ...
else floor( (cast( field as int) - 1 ) / 100) end
select ....
group by
case when substring( field, 1, 1 ) = 'M' then ...
when substring( field, 1, 1 ) = 'K" then ...
else floor( (cast( field as int) - 1 ) / 100) end
Matt Hamilton asks,
Any reason why you've opted to use substring(field, 1, 1) rather than simply left(field, 1)? I notice that #jms did it too, in the other answer.
I know substring is specified in ANSI-92; I don't know that left is. And anyway, left isn't a primitive, as it can be written in terms of substring, so using substring seems a little cleaner.
select
CASE (CASE WHEN substring(field,1,1) between 0 and 9 then 'N' Else 'C' END)
WHEN 'N' THEN
CASE field
WHEN ... THEN ...
WHEN ... THEN ...
END
WHEN 'C' THEN
CASE field
WHEN ... THEN ...
WHEN ... THEN ...
END
END

Resources