We handle a lot of sensitive data and I would like to mask passenger names using only the first and last letter of each name part and join these by three asterisks (***),
For example: the name 'John Doe' will become 'J***n D***e'
For a name that consists of two parts this is doable by finding the space using the expression:
LEFT(CardHolderNameFromPurchase, 1) +
'***' +
CASE WHEN CHARINDEX(' ', PassengerName) = 0
THEN RIGHT(PassengerName, 1)
ELSE SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) -1, 1) +
' ' +
SUBSTRING(PassengerName, CHARINDEX(' ', PassengerName) +1, 1) +
'***' +
RIGHT(PassengerName, 1)
END
However, the passenger name can have more than two parts, there is no real limit to it. How should can I find the indices of all spaces within an expression? Or should I maybe tackle this problem in a different way?
Any help or pointer is much appreciated!
This solution does what you want it to, but is really the wrong approach to use when trying to hide personally identifiable data, as per Gordon's explanation in his answer.
SQL:
declare #t table(n nvarchar(20));
insert into #t values('John Doe')
,('JohnDoe')
,('John Doe Two')
,('John Doe Two Three')
,('John O''Neill');
select n
,stuff((select ' ' + left(s.item,1) + '***' + right(s.item,1)
from dbo.fn_StringSplit4k(t.n,' ',null) as s
for xml path('')
),1,1,''
) as mask
from #t as t;
Output:
+--------------------+-------------------------+
| n | mask |
+--------------------+-------------------------+
| John Doe | J***n D***e |
| JohnDoe | J***e |
| John Doe Two | J***n D***e T***o |
| John Doe Two Three | J***n D***e T***o T***e |
| John O'Neill | J***n O***l |
+--------------------+-------------------------+
String splitting function based on Jeff Moden's Tally Table approach:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return, null returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
GO
If you consider PassengerName as sensitive information, then you should not be storing it in clear text in generally accessible tables. Period.
There are several different options.
One is to have reference tables for sensitive information. Any table that references this would have an id rather than the name. Viola. No sensitive information is available without access to the reference table, and that would be severely restricted.
A second method is a reversible compression algorithm. This would allow the the value to be gibberish, but with the right knowledge, it could be transformed back into a meaningful value. Typical methods for this are the public key encryption algorithms devised by Rivest, Shamir, and Adelman (RSA encoding).
If you want to do first and last letters of names, I would be really careful about Asian names. Many of them consist of two or three letters, when written in Latin script. That isn't much hiding. SQL Server does not have simple mechanisms to do this. You can write a user-defined function with a loop to manager the process. However, I view this as the least secure and least desirable approach.
This uses Jeff Moden's DelimitedSplit8K, as well as the new functionality in SQL Server 2017 STRING_AGG. As I don't know what version you're using, I've just gone "whole hog" and assumed you're using the latest version.
Jeff's function is invaluable here, as it returns the ordinal position, something which Microsoft have foolishly omitted from their own function, STRING_SPLIT (and didn't add in 2017 either). Ordinal position is key here, so we can't make use of the built in function.
WITH VTE AS(
SELECT *
FROM (VALUES ('John Doe'),('Jane Bloggs'),('Edgar Allan Poe'),('Mr George W. Bush'),('Homer J Simpson')) V(FullName)),
Masking AS (
SELECT *,
ISNULL(STUFF(Item, 2, LEN(item) -2,'***'), Item) AS MaskedPart
FROM VTE V
CROSS APPLY dbo.delimitedSplit8K(V.Fullname, ' '))
SELECT STRING_AGG(MaskedPart,' ') AS MaskedFullName
FROM Masking
GROUP BY Fullname;
Edit: Nevermind, OP has commented they are using 2008, so STRING_AGG is out of the question. #iamdave, however, has posted an answer which is very similar to my own, just do it the "old fashioned XML way".
Depending on your version of SQL Server, you may be able to use the built-in string split to rows on spaces in the name, do your string formatting, and then roll back up to name level using an XML path.
create table dataset (id int identity(1,1), name varchar(50));
insert into dataset (name) values
('John Smith'),
('Edgar Allen Poe'),
('One Two Three Four');
with split as (
select id, cs.Value as Name
from dataset
cross apply STRING_SPLIT (name, ' ') cs
),
formatted as (
select
id,
name,
left(name, 1) + '***' + right(name, 1) as out
from split
)
SELECT
id,
(SELECT ' ' + out
FROM formatted b
WHERE a.id = b.id
FOR XML PATH('')) [out_name]
FROM formatted a
GROUP BY id
Result:
id out_name
1 J***n S***h
2 E***r A***n P***e
3 O***e T***o T***e F***r
You can do that using this function.
create function [dbo].[fnMaskName] (#var_name varchar(100))
RETURNS varchar(100)
WITH EXECUTE AS CALLER
AS
BEGIN
declare #var_part varchar(100)
declare #var_return varchar(100)
declare #n_position smallint
set #var_return = ''
set #n_position = 1
WHILE #n_position<>0
BEGIN
SET #n_position = CHARINDEX(' ', #var_name)
IF #n_position = 0
SET #n_position = LEN(#var_name)
SET #var_part = SUBSTRING(#var_name, 1, #n_position)
SET #var_name = SUBSTRING(#var_name, #n_position+1, LEN(#var_name))
if #var_part<>''
SET #var_return = #var_return + stuff(#var_part, 2, len(#var_part)-2, replicate('*',len(#var_part)-2)) + ' '
END
RETURN(#var_return)
END
Related
I have two string to match and get the percentage of matching.
Given:
String 1: John Smith Makde
String 2: Makde John Smith
Used the following user defined scalar function.
CREATE FUNCTION [dbo].[udf_GetPercentageOfTwoStringMatching]
(
#string1 NVARCHAR(1000)
,#string2 NVARCHAR(1000)
)
RETURNS INT
--WITH ENCRYPTION
AS
BEGIN
DECLARE #levenShteinNumber INT
DECLARE #string1Length INT = LEN(#string1), #string2Length INT = LEN(#string2)
DECLARE #maxLengthNumber INT = CASE WHEN #string1Length > #string2Length THEN #string1Length ELSE #string2Length END
SELECT #levenShteinNumber = [dbo].[f_ALGORITHM_LEVENSHTEIN] (#string1 ,#string2)
DECLARE #percentageOfBadCharacters INT = #levenShteinNumber * 100 / #maxLengthNumber
DECLARE #percentageOfGoodCharacters INT = 100 - #percentageOfBadCharacters
-- Return the result of the function
RETURN #percentageOfGoodCharacters
END
Function calling:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','Makde John Smith')
Output:
7
But when I give both the string as same with same position:
SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','John Smith Makde')
Output:
100
Expected Result: As the both strings words are same but with different sequence I want 100% matching percentage.
100
+1 for the question. It appears you are trying to determine how similar two names are. It's hard to determine how you are doing that. I'm very familiar with the Levenshtein Distance for example but don't understand how you are trying to use it. To get you started I put together two ways you might approach this. This won't be a complete answer but rather the tools you will need to do whatever you're trying.
To compare the number of matching "name parts" you could use DelimitedSplit8K like this:
DECLARE
#String1 VARCHAR(100) = 'John Smith Makde Sr.',
#String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT COUNT(*)/(1.*LEN(#String1)-LEN(REPLACE(#string1,' ',''))+1)
FROM
(
SELECT s1.item
FROM dbo.delimitedSplit8K(#String1,' ') AS s1
INTERSECT
SELECT s2.item
FROM dbo.delimitedSplit8K(#String2,' ') AS s2
) AS a
Here Im splitting the names into atomic values and counting which ones match. Then we divide that number by the number of values. 3/4 = .75 for 75%; 3 of the four names match.
Another method would be to use NGrams8K like so:
DECLARE
#String1 VARCHAR(100) = 'John Smith Makde Sr.',
#String2 VARCHAR(100) = 'Makde John Smith Jr.';
SELECT (1.*f.L-f.MM)/f.L
FROM
(
SELECT
MM = SUM(ABS(s1.C-s2.C)),
L = CASE WHEN LEN(#String1)>LEN(#string2) THEN LEN(#String1) ELSE LEN(#string2) END
FROM
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(#String1,1) AS s1
GROUP BY s1.token
) AS s1(T,C)
JOIN
(
SELECT s1.token, COUNT(*)
FROM samd.NGrams8k(#String2,1) AS s1
GROUP BY s1.token
) AS s2(T,C)
ON s1.T=s2.T -- Letters that are equal
AND s1.C<>s2.C -- ... but the QTY is different
) AS f;
Here we're counting the characters and substracting the mismatches. There are two (one extra J and one extra S). The longer of the two strings is 20, there are 18 characters where the letter and qty are equal. 18/20 = .9 OR 90%.
Again, what you are doing is not complicated, I would just need more detail for a better answer.
Doing this for millions of rows again and again will be a nightmare... I'd add another column (or a 1:1 related side table) to permantently store a normalized string. Try this:
--Create a mockup table and fill it with some dummy data
CREATE TABLE #MockUpYourTable(ID INT IDENTITY, SomeName VARCHAR(1000));
INSERT INTO #MockUpYourTable VALUES('Makde John Smith')
,('Smith John Makde')
,('Some other string')
,('string with with duplicates with');
GO
--Add a column to store the normalized strings
ALTER TABLE #MockupYourTable ADD NormalizedName VARCHAR(1000);
GO
--Use this script to split your string in fragments and re-concatenate them as canonically ordered, duplicate-free string.
UPDATE #MockUpYourTable SET NormalizedName=CAST('<x>' + REPLACE((SELECT LOWER(SomeName) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
.query(N'
for $fragment in distinct-values(/x/text())
order by $fragment
return $fragment
').value('.','nvarchar(1000)');
GO
--Check the result
SELECT * FROM #MockUpYourTable
ID SomeName NormalizedName
----------------------------------------------------------
1 Makde John Smith john makde smith
2 Smith John Makde john makde smith
3 Some other string other some string
4 string with with duplicates with duplicates string with
--Clean-Up
GO
DROP TABLE #MockUpYourTable
Hint Use a trigger ON INSERT, UPDATE to keep these values synced.
Now you can use the same transformation against your strings you want this to compare with and use your former approach. Due to the re-sorting, identical fragments will return 100% similarity.
create table tbl1
(
name varchar(50)
);
insert into tbl1 values ('Mircrosoft SQL Server'),
('Office Microsoft');
create table tbl2
(
name varchar(50)
);
insert into tbl2 values ('SQL Server Microsoft'),
('Microsoft Office');
I want to get the percentage of matching string between two tables column name.
I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.
Tried: LEVENSHTEIN
SELECT [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;
Result:
MatchedPercentage tbl1_name tbl2_name
-----------------------------------------------------------------
5 Mircrosoft SQL Server SQL Server Microsoft
10 Office Microsoft SQL Server Microsoft
15 Mircrosoft SQL Server Microsoft Office
13 Office Microsoft Microsoft Office
As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.
Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.
Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:
declare #t1 table(v nvarchar(50));
declare #t2 table(v nvarchar(50));
insert into #t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values'); -- Add in some extra values, with the same number of words and some with the same number of characters
insert into #t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');
with c as
(
select t1.v as v1
,t2.v as v2
,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords -- String Length - String Length without spaces = Number of words - 1
from #t1 as t1
cross join #t2 as t2 -- Cross join the two tables to get all comparisons
where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
,c.v2
,c.NumWords
,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
,c.v2
,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);
Output
+----------------------+----------------------+----------+--------------+
| v1 | v2 | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft | 3 | 3 |
| Office Microsoft | Microsoft Office | 2 | 2 |
+----------------------+----------------------+----------+--------------+
Function
create function dbo.fn_StringSplit4k
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value to return.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
I am querying for a date value that follows a specific phrase in the same text column value. There are a number of date values in this text column but I'm needing only the DATE following the phrase I've found via a patindex value. Any help/direction would be appreciated. Thanks.
Here is my SQL code:
SELECT
NotesSysID,
PATINDEX('%Demand Due Date:%', NoteText) AS [Index of DemandDueDate text],
PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NoteText) AS [Index of DemandDueDate date],
SUBSTRING(NoteText, PATINDEX('%Demand Due Date:%', NoteText), (PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NoteText)))
FROM
#temp_ExtractedNotes;
Here is a pic of my data, please note that the 2nd index is LESS than the Index of DemandDueDate text found and I'm needing the subsequent date AFTER the Index of DemandDueDate text column. Hopefully this makes sense.
I think your code will be easier for you to work with if you use a CTE and some stepwise refinement. That'll free you from trying to do everything with one hugely-nested SELECT statement.
;
WITH FirstCut as (
SELECT
NotesSysID,
LocationOfText = PATINDEX('%Demand Due Date:%', NoteText),
NoteText
FROM #temp_ExtractedNotes
),
SecondCut as (
SELECT
NotesSysID,
NoteText,
-- making assumption date will be within first 250 chars of text
DemandDueDateSection = SUBSTRING( NoteText, [LocationOfText], 250)
FROM FirstCut
),
ThirdCut as (
SELECT
NotesSysID,
NoteText,
DemandDueDateSection,
LocationOfDate = PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', DemandDueDateSection)
FROM SecondCut
),
FourthCut as (
SELECT
NotesSysID,
NoteText,
DateAsText = SUBSTRING( DemandDueDateSection, LocationOfDate, 10 )
FROM ThirdCut
)
SELECT
NotesSysID,
NoteText,
DemandDueDate = CONVERT( DateTime, DateAsText)
FROM FourthCut
The issue you're having is because your second PATINDEX isn't finding the date you want; it's finding the first date in the string, which, as you noted, appears before the PATINDEX of the phrase you're searching for %Demand Due Date:%. The third parameter of SUBSTRING is LENGTH, which is to say how many characters after the second parameter you want to pull. By using that second PATINDEX value as the third parameter in your SUBSTRING you're returning a sub-string that starts where you want it to, and is of LENGTH equal to the number of characters into the string where that first date appears.
Which, of course, isn't what you want. To #ZLK's point in the comments, first, you need to do a nested PATINDEX. That's going to be pretty slow, so I'm hoping there aren't a bazillion records in your temp table.
Based on the sample image, it looks like the date you're interested in can appear a variable number of characters after %Demand Due Date:%. We'll start by adding 16 to the PATINDEX of %Demand Due Date:% (because that's how many characters are in %Demand Due Date:%, so we'll just start right after that). Then we'll pick up the next 100 characters. You can tweak that later if you need more or not that many.
So you're first SUBSTRING will look like this:
SUBSTRING(NoteText, PATINDEX('%Demand Due Date:%', NoteText) + 16, 100)
Now we have to search that sub-string for the second pattern, the one that should yield a date for you.
PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', SUBSTRING(NoteText, PATINDEX('%Demand Due Date:%', NoteText) + 16, 100))
The number returned there is the point where your date value starts within the 100 characters following %Demand Due Date:%. Armed with that number, you just need to SUBSTRING out the next ten characters, and, just for fun, CAST it as a DATE. That big ugly formula will look like this:
DECLARE #test VARCHAR(200) = 'foo bar Demand Due Date: 12/21/2018 bar foo foo bar';
SELECT
CAST(
SUBSTRING(
SUBSTRING
(#test, PATINDEX('%Demand Due Date:%', #test) + 16, 100),
PATINDEX
( '%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%',
SUBSTRING
(#test, PATINDEX('%Demand Due Date:%', #test) + 16, 100)
)
,10)
AS DATE);
Result:
2018-12-21
Rextester: https://rextester.com/KCY79989
Grab a copy of PatternSplitCM.
CREATE FUNCTION dbo.PatternSplitCM
(
#List VARCHAR(8000) = NULL
,#Pattern VARCHAR(50)
) RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
WITH numbers AS (
SELECT TOP(ISNULL(DATALENGTH(#List), 0))
n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n))
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),
Item = SUBSTRING(#List,MIN(n),1+MAX(n)-MIN(n)),
Matched
FROM (
SELECT n, y.Matched, Grouper = n - ROW_NUMBER() OVER(ORDER BY y.Matched,n)
FROM numbers
CROSS APPLY (
SELECT Matched = CASE WHEN SUBSTRING(#List,n,1) LIKE #Pattern THEN 1 ELSE 0 END
) y
) d
GROUP BY Matched, Grouper;
Then it's pretty easy:
-- Sample Data
DECLARE #temp_ExtractedNotes TABLE (someID INT IDENTITY, NoteText VARCHAR(8000));
INSERT #temp_ExtractedNotes (NoteText) VALUES
('blah blah blah 2/4/2016... Demand Due Date: 1/05/2011; blah blah 12/1/2017...'),
('Yada yad..... Demand Due Date: 11/21/2016;...'),
('adas dasd asd a sd asdas Demand Due Date: 09/09/2019... 5/05/2005, 10/10/2010....'),
('nothing to see here - moving on... 01/02/2003!');
-- Solution
SELECT TOP (1) WITH TIES t.someID, ns.s, DueDate = CASE SIGN(nt.s) WHEN 1 THEN f.item END
FROM #temp_ExtractedNotes AS t
CROSS APPLY (VALUES(CHARINDEX('Demand Due Date:',t.NoteText))) AS nt(s)
CROSS APPLY (VALUES(SUBSTRING(t.noteText, nt.s+16, 8000))) AS ns(s)
CROSS APPLY dbo.patternsplitCM(ns.s,'[0-9/]') AS f
WHERE f.matched = 1
ORDER BY ROW_NUMBER() OVER (PARTITION BY t.someID ORDER BY f.itemNumber);
Results:
someID s DueDate
----------- ---------------------------------------- ------------
1 1/05/2011; blah blah 12/1/2017... 1/05/2011
2 11/21/2016;... 11/21/2016
3 09/09/2019... 5/05/2005, 10/10/2010.... 09/09/2019
4 here - moving on... 01/02/2003! NULL
I have two tables and the values like this
`create table InputLocationTable(SKUID int,InputLocations varchar(100),Flag varchar(100))
create table Location(SKUID int,Locations varchar(100))
insert into InputLocationTable(SKUID,InputLocations) values(11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6')
insert into InputLocationTable(SKUID,InputLocations) values(12,'Loc1, Loc2')
insert into InputLocationTable(SKUID,InputLocations) values(13,'Loc4,Loc5')
insert into Location(SKUID,Locations) values(11,'Loc3')
insert into Location(SKUID,Locations) values(11,'Loc4')
insert into Location(SKUID,Locations) values(11,'Loc5')
insert into Location(SKUID,Locations) values(11,'Loc7')
insert into Location(SKUID,Locations) values(12,'Loc10')
insert into Location(SKUID,Locations) values(12,'Loc1')
insert into Location(SKUID,Locations) values(12,'Loc5')
insert into Location(SKUID,Locations) values(13,'Loc4')
insert into Location(SKUID,Locations) values(13,'Loc2')
insert into Location(SKUID,Locations) values(13,'Loc2')`
I need to get the output by matching SKUID's from Each tables and Update the value in Flag column as shown in the screenshot, I have tried something like this code
`SELECT STUFF((select ','+ Data.C1
FROM
(select
n.r.value('.', 'varchar(50)') AS C1
from InputLocation as T
cross apply (select cast('<r>'+replace(replace(Location,'&','&'), ',', '</r><r>')+'</r>' as xml)) as S(XMLCol)
cross apply S.XMLCol.nodes('r') as n(r)) DATA
WHERE data.C1 NOT IN (SELECT Location
FROM Location) for xml path('')),1,1,'') As Output`
But not convinced with output and also i am trying to avoid xml path code, because performance is not first place for this code, I need the output like the below screenshot. Any help would be greatly appreciated.
I think you need to first look at why you think the XML approach is not performing well enough for your needs, as it has actually been shown to perform very well for larger input strings.
If you only need to handle input strings of up to either 4000 or 8000 characters (non max nvarchar and varchar types respectively), you can utilise a tally table contained within an inline table valued function which will also perform very well. The version I use can be found at the end of this post.
Utilising this function we can split out the values in your InputLocations column, though we still need to use for xml to concatenate them back together for your desired format:
-- Define data
declare #InputLocationTable table (SKUID int,InputLocations varchar(100),Flag varchar(100));
declare #Location table (SKUID int,Locations varchar(100));
insert into #InputLocationTable(SKUID,InputLocations) values (11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6'),(12,'Loc1, Loc2'),(13,'Loc4,Loc5'),(14,'Loc1');
insert into #Location(SKUID,Locations) values (11,'Loc3'),(11,'Loc4'),(11,'Loc5'),(11,'Loc7'),(12,'Loc10'),(12,'Loc1'),(12,'Loc5'),(13,'Loc4'),(13,'Loc2'),(13,'Loc2'),(14,'Loc1');
--Query
-- Derived table splits out the values held within the InputLocations column
with i as
(
select i.SKUID
,i.InputLocations
,s.item as Loc
from #InputLocationTable as i
cross apply dbo.fn_StringSplit4k(replace(i.InputLocations,' ',''),',',null) as s
)
select il.SKUID
,il.InputLocations
,isnull('Add ' -- The split Locations are then matched to those already in #Location and those not present are concatenated together.
+ stuff((select ', ' + i.Loc
from i
left join #Location as l
on i.SKUID = l.SKUID
and i.Loc = l.Locations
where il.SKUID = i.SKUID
and l.SKUID is null
for xml path('')
)
,1,2,''
)
,'No Flag') as Flag
from #InputLocationTable as il
order by il.SKUID;
Output:
+-------+------------------------------------+----------------------+
| SKUID | InputLocations | Flag |
+-------+------------------------------------+----------------------+
| 11 | Loc1, Loc2, Loc3, Loc4, Loc5, Loc6 | Add Loc1, Loc2, Loc6 |
| 12 | Loc1, Loc2 | Add Loc2 |
| 13 | Loc4,Loc5 | Add Loc5 |
| 14 | Loc1 | No Flag |
+-------+------------------------------------+----------------------+
For nvarchar input (I have different functions for varchar and max type input) this is my version of the string splitting function linked above:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value in the list to return. NULL returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
go
I'm doing some reporting against a silly database and I have to do
SELECT [DESC] as 'Description'
FROM dbo.tbl_custom_code_10 a
INNER JOIN dbo.Respondent b ON CHARINDEX(',' + a.code + ',', ',' + b.CC10) > 0
WHERE recordid = 116
Which Returns Multiple Rows
Palm
Compaq
Blackberry
Edit *
Schema is
Respondent Table (At a Glance) ...
*recordid lname fname address CC10 CC11 CC12 CC13*
116 Smith John Street 1,4,5, 1,3,4, 1,2,3, NULL
Tbl_Custom_Code10
*code desc*
0 None
1 Palm
10 Samsung
11 Treo
12 HTC
13 Nokia
14 LG
15 HP
16 Dash
Result set will always be 1 row, so John Smith: | 646-465-4566 | Has a Blackberry, Palm, Compaq | Likes: Walks on the beach, Rainbows, Saxophone
However I need to be able to use this within another query ... like
Select b.Name, c.Number, d.MulitLineCrap FROM Tables
How can I go about this, Thanks in advance ...
BTW I could also do it in LINQ if any body had any ideas ...
Here is one way to make a comma-separated list based on a query (just replace the query inside the first WITH block). Now, how that joins up with your query against b and c, I have no idea. You'll need to supply a more complete question - including specifics on how many rows come back from the second query and whether "MultilineCrap" is the same for each of those rows or if it depends on data in b/c.
;WITH x([DESC]) AS
(
SELECT d FROM (VALUES('Palm'),('Compaq'),('Blackberry')) AS x(d)
)
SELECT STUFF((SELECT ',' + [DESC]
FROM x
FOR XML PATH(''), TYPE).value(N'./text()[1]', N'varchar(max)'),1,1,'');
EDIT
Given the new requirements, perhaps this is the best way:
CREATE FUNCTION dbo.GetMultiLineCrap
(
#s VARCHAR(MAX)
)
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #x VARCHAR(MAX) = '';
SELECT #x += ',' + [desc]
FROM dbo.tbl_custom_code_10
WHERE ',' + #s LIKE '%,' + RTRIM(code) + ',%';
RETURN (SELECT STUFF(#x, 1, 1, ''));
END
GO
SELECT r.LName, r.FName, MultilineCrap = dbo.GetMultiLineCrap(r.CC10)
FROM dbo.Respondent AS r
WHERE recordid = 116;
Please use aliases that make a little bit of sense, instead of just serially applying a, b, ,c, etc. Your queries will be easier to read, I promise.