Separating firstname surname - sql-server

If the original field looks like paul#yates then this syntax picks out the surname correctly
substring(surname,CHARINDEX('#',surname+'#')+1,LEN(name3))
however if the field is paul#b#yates then the surname looks like #b#yates. I want the middle letter to be dropped so it picks only the surname out.
any ideas?

You can;
;with T(name) as (
select 'paul#yates' union
select 'paul#b#yates'
)
select
right(name, charindex('#', reverse(name) + '#') - 1)
from T
>>
yates
yates

you could reverse the array, split it till you find the first "#", take that part and reverse it again.
if this is java, there should be an array.reverse function, otherwise you possibly need to write it on your own.
also you could cut the string in pieces until there are no mor "#" signs left and then take the last part (the substring should return "-1" or something), but i like my first idea better.

Here's an example for you
declare #t table (name varchar(max));
insert #t select
'john' union all select
'john#t#bill' union all select
'joe#public';
select firstname=left(name,-1+charindex('#',name+'#')),
surname=case when name like '%#%' then
stuff(name,1,len(name)+1-charindex('#',reverse(name)+'#'),'')
end
from #t;
-- results
FIRSTNAME SURNAME
john (null)
john bill
joe public

Related

Get percentage of matching strings

I have two string to match and get the percentage of matching.
Given:
String 1: John Smith Makde
String 2: Makde John Smith
Used the following user defined scalar function.
CREATE FUNCTION [dbo].[udf_GetPercentageOfTwoStringMatching]
(
#string1 NVARCHAR(1000)
,#string2 NVARCHAR(1000)
)
RETURNS INT
--WITH ENCRYPTION
AS
BEGIN
DECLARE #levenShteinNumber INT
DECLARE #string1Length INT = LEN(#string1), #string2Length INT = LEN(#string2)
DECLARE #maxLengthNumber INT = CASE WHEN #string1Length > #string2Length THEN #string1Length ELSE #string2Length END
SELECT #levenShteinNumber = [dbo].[f_ALGORITHM_LEVENSHTEIN] (#string1 ,#string2)
DECLARE #percentageOfBadCharacters INT = #levenShteinNumber * 100 / #maxLengthNumber
DECLARE #percentageOfGoodCharacters INT = 100 - #percentageOfBadCharacters
-- Return the result of the function
RETURN #percentageOfGoodCharacters
END
Function calling:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','Makde John Smith')
Output:
7
But when I give both the string as same with same position:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','John Smith Makde')
Output:
100
Expected Result: As the both strings words are same but with different sequence I want 100% matching percentage.
100
+1 for the question. It appears you are trying to determine how similar two names are. It's hard to determine how you are doing that. I'm very familiar with the Levenshtein Distance for example but don't understand how you are trying to use it. To get you started I put together two ways you might approach this. This won't be a complete answer but rather the tools you will need to do whatever you're trying.
To compare the number of matching "name parts" you could use DelimitedSplit8K like this:
DECLARE
#String1 VARCHAR(100) = 'John Smith Makde Sr.',
#String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT COUNT(*)/(1.*LEN(#String1)-LEN(REPLACE(#string1,' ',''))+1)
FROM
(
SELECT s1.item
FROM dbo.delimitedSplit8K(#String1,' ') AS s1
INTERSECT
SELECT s2.item
FROM dbo.delimitedSplit8K(#String2,' ') AS s2
) AS a
Here Im splitting the names into atomic values and counting which ones match. Then we divide that number by the number of values. 3/4 = .75 for 75%; 3 of the four names match.
Another method would be to use NGrams8K like so:
DECLARE
#String1 VARCHAR(100) = 'John Smith Makde Sr.',
#String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT (1.*f.L-f.MM)/f.L
FROM
(
SELECT
MM = SUM(ABS(s1.C-s2.C)),
L = CASE WHEN LEN(#String1)>LEN(#string2) THEN LEN(#String1) ELSE LEN(#string2) END
FROM
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(#String1,1) AS s1
GROUP BY s1.token
) AS s1(T,C)
JOIN
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(#String2,1) AS s1
GROUP BY s1.token
) AS s2(T,C)
ON s1.T=s2.T -- Letters that are equal
AND s1.C<>s2.C -- ... but the QTY is different
) AS f;
Here we're counting the characters and substracting the mismatches. There are two (one extra J and one extra S). The longer of the two strings is 20, there are 18 characters where the letter and qty are equal. 18/20 = .9 OR 90%.
Again, what you are doing is not complicated, I would just need more detail for a better answer.
Doing this for millions of rows again and again will be a nightmare... I'd add another column (or a 1:1 related side table) to permantently store a normalized string. Try this:
--Create a mockup table and fill it with some dummy data
CREATE TABLE #MockUpYourTable(ID INT IDENTITY, SomeName VARCHAR(1000));
INSERT INTO #MockUpYourTable VALUES('Makde John Smith')
,('Smith John Makde')
,('Some other string')
,('string with with duplicates with');
GO
--Add a column to store the normalized strings
ALTER TABLE #MockupYourTable ADD NormalizedName VARCHAR(1000);
GO
--Use this script to split your string in fragments and re-concatenate them as canonically ordered, duplicate-free string.
UPDATE #MockUpYourTable SET NormalizedName=CAST('<x>' + REPLACE((SELECT LOWER(SomeName) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
--Check the result
SELECT * FROM #MockUpYourTable
ID SomeName NormalizedName
----------------------------------------------------------
1 Makde John Smith john makde smith
2 Smith John Makde john makde smith
3 Some other string other some string
4 string with with duplicates with duplicates string with
--Clean-Up
GO
DROP TABLE #MockUpYourTable
Hint Use a trigger ON INSERT, UPDATE to keep these values synced.
Now you can use the same transformation against your strings you want this to compare with and use your former approach. Due to the re-sorting, identical fragments will return 100% similarity.

Expression to find multiple spaces in string

We handle a lot of sensitive data and I would like to mask passenger names using only the first and last letter of each name part and join these by three asterisks (***),
For example: the name 'John Doe' will become 'J***n D***e'
For a name that consists of two parts this is doable by finding the space using the expression:
LEFT(CardHolderNameFromPurchase, 1) +
'***' +
CASE WHEN CHARINDEX(' ', PassengerName) = 0
THEN RIGHT(PassengerName, 1)
ELSE SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) -1, 1) +
' ' +
SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) +1, 1) +
'***' +
RIGHT(PassengerName, 1)
END
However, the passenger name can have more than two parts, there is no real limit to it. How should can I find the indices of all spaces within an expression? Or should I maybe tackle this problem in a different way?
Any help or pointer is much appreciated!
This solution does what you want it to, but is really the wrong approach to use when trying to hide personally identifiable data, as per Gordon's explanation in his answer.
SQL:
declare #t table(n nvarchar(20));
insert into #t values('John Doe')
,('JohnDoe')
,('John Doe Two')
,('John Doe Two Three')
,('John O''Neill');
select n
,stuff((select ' ' + left(s.item,1) + '***' + right(s.item,1)
from dbo.fn_StringSplit4k(t.n,' ',null) as s
for xml path('')
),1,1,''
) as mask
from #t as t;
Output:
+--------------------+-------------------------+
| n | mask |
+--------------------+-------------------------+
| John Doe | J***n D***e |
| JohnDoe | J***e |
| John Doe Two | J***n D***e T***o |
| John Doe Two Three | J***n D***e T***o T***e |
| John O'Neill | J***n O***l |
+--------------------+-------------------------+
String splitting function based on Jeff Moden's Tally Table approach:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return, null returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
GO
If you consider PassengerName as sensitive information, then you should not be storing it in clear text in generally accessible tables. Period.
There are several different options.
One is to have reference tables for sensitive information. Any table that references this would have an id rather than the name. Viola. No sensitive information is available without access to the reference table, and that would be severely restricted.
A second method is a reversible compression algorithm. This would allow the the value to be gibberish, but with the right knowledge, it could be transformed back into a meaningful value. Typical methods for this are the public key encryption algorithms devised by Rivest, Shamir, and Adelman (RSA encoding).
If you want to do first and last letters of names, I would be really careful about Asian names. Many of them consist of two or three letters, when written in Latin script. That isn't much hiding. SQL Server does not have simple mechanisms to do this. You can write a user-defined function with a loop to manager the process. However, I view this as the least secure and least desirable approach.
This uses Jeff Moden's DelimitedSplit8K, as well as the new functionality in SQL Server 2017 STRING_AGG. As I don't know what version you're using, I've just gone "whole hog" and assumed you're using the latest version.
Jeff's function is invaluable here, as it returns the ordinal position, something which Microsoft have foolishly omitted from their own function, STRING_SPLIT (and didn't add in 2017 either). Ordinal position is key here, so we can't make use of the built in function.
WITH VTE AS(
SELECT *
FROM (VALUES ('John Doe'),('Jane Bloggs'),('Edgar Allan Poe'),('Mr George W. Bush'),('Homer J Simpson')) V(FullName)),
Masking AS (
SELECT *,
ISNULL(STUFF(Item, 2, LEN(item) -2,'***'), Item) AS MaskedPart
FROM VTE V
CROSS APPLY dbo.delimitedSplit8K(V.Fullname, ' '))
SELECT STRING_AGG(MaskedPart,' ') AS MaskedFullName
FROM Masking
GROUP BY Fullname;
Edit: Nevermind, OP has commented they are using 2008, so STRING_AGG is out of the question. #iamdave, however, has posted an answer which is very similar to my own, just do it the "old fashioned XML way".
Depending on your version of SQL Server, you may be able to use the built-in string split to rows on spaces in the name, do your string formatting, and then roll back up to name level using an XML path.
create table dataset (id int identity(1,1), name varchar(50));
insert into dataset (name) values
('John Smith'),
('Edgar Allen Poe'),
('One Two Three Four');
with split as (
select id, cs.Value as Name
from dataset
cross apply STRING_SPLIT (name, ' ') cs
),
formatted as (
select
id,
name,
left(name, 1) + '***' + right(name, 1) as out
from split
)
SELECT
id,
(SELECT ' ' + out
FROM formatted b
WHERE a.id = b.id
FOR XML PATH('')) [out_name]
FROM formatted a
GROUP BY id
Result:
id out_name
1 J***n S***h
2 E***r A***n P***e
3 O***e T***o T***e F***r
You can do that using this function.
create function [dbo].[fnMaskName] (#var_name varchar(100))
RETURNS varchar(100)
WITH EXECUTE AS CALLER
AS
BEGIN
declare #var_part varchar(100)
declare #var_return varchar(100)
declare #n_position smallint
set #var_return = ''
set #n_position = 1
WHILE #n_position<>0
BEGIN
SET #n_position = CHARINDEX(' ', #var_name)
IF #n_position = 0
SET #n_position = LEN(#var_name)
SET #var_part = SUBSTRING(#var_name, 1, #n_position)
SET #var_name = SUBSTRING(#var_name, #n_position+1, LEN(#var_name))
if #var_part<>''
SET #var_return = #var_return + stuff(#var_part, 2, len(#var_part)-2, replicate('*',len(#var_part)-2)) + ' '
END
RETURN(#var_return)
END

Updating a column based on first occurance of a text in another column

I have a requirement to convert a 150 char free form text and map it to one of two texts: Sibling or Spouse in the member_type field in SQL server database.
I came up with the below update statement that does the update:
Update my_table
set member_type = CASE
WHEN (relationship_description like 'brother%' OR
relationship_description like 'sister%' OR
relationship_description like 'sibling%'
THEN 'Sibling'
WHEN ( relationship_description like 'spouse%' OR
relationship_description like 'husband%' OR
relationship_description like 'wife%' OR
) THEN 'Spouse'
ELSE '' END;
But there is an additional requirement: If there are multiple key words for the relationship_description, convert using the first key word.
For example:
case 1: relationship_description = "Mark's brother" : contains "brother" in it so will be treated as Sibling.
case2: relationship_description = "Walter is brother of Mark and Greg Howard. Mark is the husband of Julie":First occurance is brother so will be treated as Sibling.
case3:relationship_description = "John's Wife": contains the keyword wife, so should be treated as Spouse
case 4: relationship_description = "Johns's wife and Peter's sister": contains 2 keywords wife and sister, so should consider wife should be treated as spouse.
I came to know there is a SQL keyword STUFF that might work. Can someone help?. I need to do it via an SQL script and not via Java.
Try with pathindex function:
declare #t table(relationship_description varchar(max), member_type varchar(10))
insert into #t values
('Mark''s brother in', null),
('Walter is brother of Mark and Greg Howard. Mark is the husband of Julie', null),
('John''s Wife', null),
('Johns''s wife and Peter''s sister', null)
update t
set member_type = case when m = 0
then null else
case when ca.t = 1
then 'Sibling'
else 'Spouse'
end
end
from #t t
cross apply (select top 1 *
from (values
(PATINDEX('%brother%', relationship_description), 1),
(PATINDEX('%sister%', relationship_description), 1),
(PATINDEX('%sibling%', relationship_description), 1),
(PATINDEX('%spouse%', relationship_description), 2),
(PATINDEX('%husband%', relationship_description), 2),
(PATINDEX('%wife%', relationship_description), 2)) t(m, t)
order by ROW_NUMBER() over(order by case when m = 0 then 1000000 else m end))ca
select * from #t
In cross apply you are finding out the indexes of first occurrences of key words. Then you just picking the minimal index different from 0(0 means that there is no occurrence, so I mark it with 1000000). In the main case you are just verifying that if there is no occurrence of any word then just select NULL else look at occurrence type and select appropriate description.
Output:
relationship_description member_type
Mark's brother Sibling
Walter is brother of Mark and Greg Howard. Mark is the husband of Julie Sibling
John's Wife Spouse
Johns's wife and Peter's sister Spouse
If you want to know which one occurs first you'll need patindex.
I would recommend creating a CTE based on the key of your source table getting the PATINDEX value for each string you're searching for. SOmething like the following...
WITH Relations as
( SELECT 'brother' as searchstring, 'Sibling' as relationship
UNION
SELECT 'sister' as searchstring, 'Sibling' as relationship
UNION
SELECT 'sibling' as searchstring, 'Sibling' as relationship
UNION
SELECT 'spouse' as searchstring, 'Spouse' as relationship
UNION
SELECT 'husband' as searchstring, 'Spouse' as relationship
UNION
SELECT 'wife' as searchstring, 'Spouse' as relationship
)
,
FoundPat as
( SELECT my.keyfield, r.relationship,
RANK() OVER (PARTITION BY my.keyfield ORDER BY PATINDEX(r.searchstring, my.relationship_description)) as positionrank
FROM my_table my
cross apply Relations r
)
Update my_table
set member_type = ISNULL(fp.relationship,'')
from my_table my
left join FoundPat fp on fp.keyfield = my.keyfield and positionrank = 1
I haven't tested the above code, so some syntax may be slightly off

Replace function in a column where comma is replaced by apostrophe and comma?

I have a table Employees
EmpGid Employees
1 john
2 john,kevin
3 Tom,Peter,harry
4 Peter,Mike,Frank
I need ouput look likes below when i select employees
Employees
john
john','kevin
Tom','Peter','harry
Peter','Mike','Frank
I am using Replace but i am unable replace ',' where , is present .
Can some one help me how to replace.
this will replace , with ',' :
SELECT *, replace(Employees, ',',''',''') FROM Employees
It should be like
select REPLACE('John,Kevin',',',''',''')
Which will result in John','Kevin

Make unique colume in SQL

I have table which has a duplicate data.
This is my Now table
Id Name
1 shahin Zen
2 shahin Zen & Aaron Henley
3 Fred Sayz feat. Antonia Lucas
4 Fred Sayz feat. Lawrence Alexander
5 Fred Sayz feat. Sibel
Note: I can not use distinct beacuse name has not fully match.
I want to make a table form this table like,
ID Name
1 shahin
2 Fred
Please anyone solved this kind of problem.
Thanks advance
if you just want to get distinct first words of the rows:
select distinct substring(Name, 0, charindex(' ', Name, 0))
from myTable
you can also add a check for the rows that contains space character by adding a where clause:
where charindex(' ', myTable, 0) > 0
If you just need the first names, try this:
SELECT
LEFT(name, CHARINDEX(' ', name))
FROM Table1
GROUP BY LEFT(name, CHARINDEX(' ', name))
You need to account for those records that don't have a space...
Select Distinct Left(name,CharIndex(' ',name+' '))
From myTable

Resources