Change wrong french characters on table - sql-server

For some reason we found on the database characters like this: Ã
I can assume this character represent the character: é
Now I need to revise the whole table but checking all other characters to make sure there are no others.
Where I can find the relation of characters for example between this à and é? or probably find an SQL function that is already done to make those replacement.
I'm using SQL Server 2014

As mentioned by Daniel E, your dirty data might have been caused by the use of incorrect code pages (UTF-8 that was interpreted as ISO 8859-1).
One way to find entries with dirty data, is to use a "not exists" ("^") like expression with the list of valid characters in that expression. See example below.
declare #t table (name varchar(20))
insert into #t values ('touché')
insert into #t values ('encore touché')
insert into #t values ('reçu')
insert into #t values ('hello world')
select * from #t where name like '%[^a-zA-Z., -]%'
select * from #t where name like '%[^a-zA-Z.,èêé -]%' COLLATE Latin1_General_BIN

Related

Why two string do not match although they are exactly the same?

I have a lookup table with a list of values. Lookup Table I need to filter on a value from the LUT table in a simple where condition. It works with all table values except one and I don't know why. I have tried using trim function and lower function to change the string but nothing helped. Does anyone have the same experience? Why it does work for all table values except one? My code:
SELECT * FROM "PossibleNewGMCIssues" WHERE "gmcIssue" = 'Suspended account for policy violation'
Thank you in advance.
It's too hard for anybody to say without seeing the strings themselves. They probably look similar but have different unicode values. You can convert to the hex values to see where they are different though by using hex_encode:
Below I create a table that uses two strings that look the same but aren't. One contains an m-dash and the other an en-dash.
-- Create a table with two columns with strings in them that look the same but arent
create or replace transient table test_table as (
select 'a-string'::string col1, 'a—string'::string col2
);
-- This returns 0 results
select * from test_table where col1=col2;
-- You can tell that the strings are different by checking the hex representation of them
select hex_encode(col1), hex_encode(col2)
from test_table
;
-- The above returns:
-- +----------------+--------------------+
-- |HEX_ENCODE(COL1)|HEX_ENCODE(COL2) |
-- +----------------+--------------------+
-- |612D737472696E67|61E28094737472696E67|
-- +----------------+--------------------+

not able to identify difference between same value

I have data inside a table's column. I SELECT DISTINCT of that column, i also put LTRIM(RTRIM(col_name)) as well while writing SELECT. But still I am getting duplicate column record.
How can we identify why it is happening and how we can avoid it?
I tried RTRIM, LTRIM, UPPER function. Still no help.
Query:
select distinct LTRIM(RTRIM(serverstatus))
from SQLInventory
Output:
Development
Staging
Test
Pre-Production
UNKNOWN
NULL
Need to be decommissioned
Production
Pre-Produc​tion
Decommissioned
Non-Production
Unsupported Edition
Looks like there's a unicode character in there, somewhere. I copied and pasted the values out initially as a varchar, and did the following:
SELECT DISTINCT serverstatus
FROM (VALUES('Development'),
('Staging'),
('Test'),
('Pre-Production'),
('UNKNOWN'),
('NULL'),
('Need to be decommissioned'),
('Production'),
(''),
('Pre-Produc​tion'),
('Decommissioned'),
('Non-Production'),
('Unsupported Edition'))V(serverstatus);
This, interestingly, returned the values below:
Development
Staging
Test
Pre-Production
UNKNOWN
NULL
Need to be decommissioned
Production
Pre-Produc?tion
Decommissioned
Non-Production
Unsupported Edition
Note that one of the values is Pre-Produc?tion, meaning that there is a unicode character between the c and t.
So, let's find out what it is:
SELECT 'Pre-Produc​tion', N'Pre-Produc​tion',
UNICODE(SUBSTRING(N'Pre-Produc​tion',11,1));
The UNICODE function returns back 8203, which is a zero-width space. I assume you want to remove these, so you can update your data by doing:
UPDATE SQLInventory
SET serverstatus = REPLACE(serverstatus, NCHAR(8203), N'');
Now your first query should work as you expect.
(I also suggest you might therefore want a lookup table for your status' with a foreign key, so that this can't happen again).
DB<>fiddle
I deal with this type of thing all the time. For stuff like this NGrams8K and PatReplace8k and PATINDEX are your best friends.
Putting what you posted in a table variable we can analyze the problem:
DECLARE #table TABLE (txtID INT IDENTITY, txt NVARCHAR(100));
INSERT #table (txt)
VALUES ('Development'),('Staging'),('Test'),('Pre-Production'),('UNKNOWN'),(NULL),
('Need to be decommissioned'),('Production'),(''),('Pre-Produc​tion'),('Decommissioned'),
('Non-Production'),('Unsupported Edition');
This query will identify items with characters other than A-Z, spaces and hyphens:
SELECT t.txtID, t.txt
FROM #table AS t
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;
This returns:
txtID txt
----------- -------------------------------------------
10 Pre-Produc​tion
To identify the bad character we can use NGrams8k like this:
SELECT t.txtID, t.txt, ng.position, ng.token -- ,UNICODE(ng.token)
FROM #table AS t
CROSS APPLY dbo.NGrams8K(t.txt,1) AS ng
WHERE PATINDEX('%[^a-zA-Z -]%',ng.token)>0;
Which returns:
txtID txt position token
------ ----------------- -------------------- ---------
10 Pre-Produc​tion 11 ?
PatReplace8K makes cleaning up stuff like this quite easily and quickly. First note this query:
SELECT OldString = t.txt, p.NewString
FROM #table AS t
CROSS APPLY dbo.patReplace8K(t.txt,'%[^a-zA-Z -]%','') AS p
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;
Which returns this on my system:
OldString NewString
------------------ ----------------
Pre-Produc?tion Pre-Production
To fix the problem you can use patreplace8K like this:
UPDATE t
SET txt = p.newString
FROM #table AS t
CROSS APPLY dbo.patReplace8K(t.txt,'%[^a-zA-Z -]%','') AS p
WHERE PATINDEX('%[^a-zA-Z -]%',t.txt) > 0;

Separating Numeric Values from string

I'm having an issue, whereby I need to separate the following BPO007 to show BPO and 007. In some cases another example would be GFE0035, whereby I still need to split the numeric value from the characters.
When I do the following select isnumeric('BPO007') the result is 0, which is correct, however, I'm not sure how split these from each other.
I've had a look at this link, but it does not really answer my question.
I need the above split for a separate validation purpose in my trigger.
How would I develop something like this?
Thank you in advance.
As told in a comment before:
About your question How would I develop something like this?:
Do not store two pieces of information within the same column (read about 1.NF). You should keep separate columns in your database for BPO and for 007 (or rather an integer 7).
Then use some string methods to compute the BPO007 when you need it in your output.
Just not to let you alone in the rain.
DECLARE #tbl TABLE(YourColumn VARCHAR(100));
INSERT INTO #tbl VALUES('BPO007'),('GFE0035');
SELECT YourColumn,pos
,LEFT(YourColumn,pos) AS CharPart
,CAST(SUBSTRING(YourColumn,pos+1,1000) AS INT) AS NumPart
FROM #tbl
CROSS APPLY(SELECT PATINDEX('%[0-9]%',YourColumn)-1) AS A(pos);
Will return
YourColumn pos CharPart NumPart
BPO007 3 BPO 7
GFE0035 3 GFE 35
Hint: I use a CROSS APPLY here to compute the position of the first numeric character and then use pos in the actual query like you'd use a variable. Otherwise the PATINDEX would have to be repeated...
Since the number and text varies, you can use the following codes
DECLARE #NUMERIC TABLE (Col VARCHAR(50))
INSERT INTO #NUMERIC VALUES('BPO007'),('GFE0035'),('GFEGVT003509'),('GFEMTS10035')
SELECT
Col,
LEFT(Col,LEN(Col)-LEN(SUBSTRING(Col,PATINDEX('%[0-9]%',Col),DATALENGTH(Col)))) AS TEXTs,
RIGHT(Col,LEN(Col)-LEN(LEFT(Col,LEN(Col)-LEN(SUBSTRING(Col,PATINDEX('%[0-9]%',Col),DATALENGTH(Col)))))) AS NUMERICs
FROM #NUMERIC

Ms Sql Convert and insert between 2 databases

I have two databases.I insert data from DATABASE_2 TO DATABASE_1 however i need to convert some column.
I must convert Customer_Telephone_Number from varchar to bigint after that insert it.
So,
My Question is in below.
SET IDENTITY_INSERT DATABASE_1.dbo.CUSTOMER_TABLE ON
INSERT INTO DATABASE_1.dbo.CUSTOMER_TABLE
(
Customer_Id,
Customer_Telephone_Number
)
Select
Customer_Id,
Customer_Telephone_Number // This is varchar so i need to convert it to Big int.
from
DATABASE_2.DBO.CUSTOMER_TABLE
Any help will be appreciated.
Thanks.
If the data stored is without spaces or other non numeric symbols:
Select
Customer_Id,
CONVERT(BIGINT,Customer_Telephone_Number)
from
DATABASE_2.DBO.CUSTOMER_TABLE
For instance, if there is value with (222)-3333-333 it would fail. If the value was 2223333333 it would succeed

How to convert varchar to ASCII7

I want to replace any latin / accented characters with their basic alphabet letters and strip out everything that cant be converted
examples:
'ë' to be replaced with 'e'
'ß' to be replaced with 's' , 'ss' if possible, if neither then strip it
i am able to do this in c# code but im just not well experienced in MSSQL to solve this without taking many days
UPDATE: the data in the varchar column is populated from a trigger on another table which should have normal UNICODE text. i want to convert the text to ascii7 in a function to use for further processing.
UPDATE: i prefer a solution where this can be done in SQL only and avoiding custom character mapping. can this be done, or is it currently just not possible?
As Aaron said, I don't think you can dispose of mapping tables entirely in SQL, but mapping characters to ASCII-7 should involve some fairly simple tables, used in conjunction with AI collations. Here there are two tables, one to map characters in the column, and one for the letter of the alphabet (which could be expanded if necessary).
By using the AI collations, I get around a lot of explicit mapping definitions.
-----------------------------------------------
-- One time mapping table setup
CREATE TABLE t4000(i INT PRIMARY KEY);
GO
INSERT INTO t4000 --Just a simple list of integers from 1 to 4000
SELECT ROW_NUMBER()OVER(ORDER BY a.x)
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) a(x)
CROSS APPLY (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) b(x)
CROSS APPLY (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) c(x)
CROSS APPLY (VALUES(1),(2),(3),(4)) d(x)
GO
CREATE TABLE TargetChars(ch NVARCHAR(2) COLLATE Latin1_General_CS_AI PRIMARY KEY);
GO
INSERT TargetChars -- A-Z, a-z, ss
SELECT TOP(128) CHAR(i)
FROM t4000
WHERE i BETWEEN 65 AND 90
OR i BETWEEN 97 AND 122
UNION ALL
SELECT 'ss'
-- plus any other special targets here
GO
-----------------------------------------------
-- function
CREATE FUNCTION dbo.TrToA7(#str NVARCHAR(4000))
RETURNS NVARCHAR(4000)
AS
BEGIN
DECLARE #mapped NVARCHAR(4000) = '';
SELECT TOP(LEN(#str))
#mapped += ISNULL(tc.ch, SUBSTRING(#str, i, 1))
FROM t4000
LEFT JOIN TargetChars tc ON tc.ch = SUBSTRING(#str, i, 1)
COLLATE Latin1_General_CS_AI;
RETURN #mapped;
END
GO
Usage example:
SELECT dbo.TrToA7('It was not á tötal löß.');
Result:
--------------------------
It was not a total loss.

Resources