Splitting multiple fields by delimiter - sql-server

I have to write an SP that can perform Partial Updates on our databases, the changes are stored in a record of the PU table. A values fields contains all values, delimited by a fixed delimiter. A tables field refers to a Schemes table containing the column names for each table in a similar fashion in a Colums fiels.
Now for my SP I need to split the Values field and Columns field in a temp table with Column/Value pairs, this happens for each record in the PU table.
An example:
Our PU table looks something like this:
CREATE TABLE [dbo].[PU](
[Table] [nvarchar](50) NOT NULL,
[Values] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','John Doe;26');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Jane Doe;22');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mike Johnson;20');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Person','Mary Jane;24');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Mathematics');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','English');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Course','Geography');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus A;Schools Road 1;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus B;Schools Road 31;Educationville');
INSERT INTO [dbo].[PU]([Table],[Values]) VALUES ('Campus','Campus C;Schools Road 22;Educationville');
And we have a Schemes table similar to this:
CREATE TABLE [dbo].[Schemes](
[Table] [nvarchar](50) NOT NULL,
[Columns] [nvarchar](max) NOT NULL
)
Insert SQL for this example:
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Person','[Name];[Age]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Course','[Name]');
INSERT INTO [dbo].[Schemes]([Table],[Columns]) VALUES ('Campus','[Name];[Address];[City]');
As a result the first record of the PU table should result in a temp table like:
The 5th will have:
Finally, the 8th PU record should result in:
You get the idea.
I tried use the following query to create the temp tables, but alas it fails when there's more that one value in the PU record:
DECLARE #Fields TABLE
(
[Column] INT,
[Value] VARCHAR(MAX)
)
INSERT INTO #Fields
SELECT TOP 1
(SELECT Value FROM STRING_SPLIT([dbo].[Schemes].[Columns], ';')),
(SELECT Value FROM STRING_SPLIT([dbo].[PU].[Values], ';'))
FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
TOP 1 correctly gets the first PU record as each PU record is removed once processed.
The error is:
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
In the case of a Person record, the splits are indeed returning 2 values/colums at a time, I just want to store the values in 2 records instead of getting an error.
Any help on rewriting the above query?
Also do note that the data is just generic nonsense. Being able to have 2 fields that both have delimited values, always equal in amount (e.g. a 'person' in the PU table will always have 2 delimited values in the field), and break them up in several column/header rows is the point of the question.
UPDATE: Working implementation
Based on the (accepted) answer of Sean Lange, I was able to work out followin implementation to overcome the issue:
As I need to reuse it, the combine column/value functionality is performed by a new function, declared as such:
CREATE FUNCTION [dbo].[JoinDelimitedColumnValue]
(#splitValues VARCHAR(8000), #splitColumns VARCHAR(8000),#pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH MyValues AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnValue = x.Item
FROM dbo.DelimitedSplit8K(#splitValues, #pDelimiter) x
)
, ColumnData AS
(
SELECT ColumnPosition = x.ItemNumber,
ColumnName = x.Item
FROM dbo.DelimitedSplit8K(#splitColumns, #pDelimiter) x
)
SELECT cd.ColumnName,
v.ColumnValue
FROM MyValues v
JOIN ColumnData cd ON cd.ColumnPosition = v.ColumnPosition
;
In case of the above sample data, I'd call this function with the following SQL:
DECLARE #FieldValues VARCHAR(8000), #FieldColumns VARCHAR(8000)
SELECT TOP 1 #FieldValues=[dbo].[PU].[Values], #FieldColumns=[dbo].[Schemes].[Columns] FROM [dbo].[PU] INNER JOIN [dbo].[Schemes] ON [dbo].[PU].[Table] = [dbo].[Schemes].[Table]
INSERT INTO #Fields
SELECT [Column] = x.[ColumnName],[Value] = x.[ColumnValue] FROM [dbo].[JoinDelimitedColumnValue](#FieldValues, #FieldColumns, #Delimiter) x

This data structure makes this way more complicated than it should be. You can leverage the splitter from Jeff Moden here. http://www.sqlservercentral.com/articles/Tally+Table/72993/ The main difference of that splitter and all the others is that his returns the ordinal position of each element. Why all the other splitters don't do this is beyond me. For things like this it is needed. You have two sets of delimited data and you must ensure that they are both reassembled in the correct order.
The biggest issue I see is that you don't have anything in your main table to function as an anchor for ordering the results correctly. You need something, even an identity to ensure the output rows stay "together". To accomplish I just added an identity to the PU table.
alter table PU add RowOrder int identity not null
Now that we have an anchor this is still a little cumbersome for what should be a simple query but it is achievable.
Something like this will now work.
with MyValues as
(
select p.[Table]
, ColumnPosition = x.ItemNumber
, ColumnValue = x.Item
, RowOrder
from PU p
cross apply dbo.DelimitedSplit8K(p.[Values], ';') x
)
, ColumnData as
(
select ColumnName = replace(replace(x.Item, ']', ''), '[', '')
, ColumnPosition = x.ItemNumber
, s.[Table]
from Schemes s
cross apply dbo.DelimitedSplit8K(s.Columns, ';') x
)
select cd.[Table]
, v.ColumnValue
, cd.ColumnName
from MyValues v
join ColumnData cd on cd.[Table] = v.[Table]
and cd.ColumnPosition = v.ColumnPosition
order by v.RowOrder
, v.ColumnPosition

I recommended not storing values like this in the first place. I recommend having a key value in the tables and preferably not using Table and Columns as a composite key. I recommend to avoid using reserved words. I also don't know what version of SQL you are using. I am going to assume you are using a fairly recent version of Microsoft SQL Server that will support my provided stored procedure.
Here is an overview of the solution:
1) You need to convert both the PU and the Schema table into a table where you will have each "column" value in the list of columns isolated in their own row. If you can store the data in this format rather than the provided format, you will be a little better off.
What I mean is
Table|Columns
Person|Jane Doe;22
needs converted to
Table|Column|OrderInList
Person|Jane Doe|1
Person|22|2
There are multiple ways to do this, but I prefer an xml trick that I picked up. You can find multiple split string examples online so I will not focus on that. Use whatever gives you the best performance. Unfortunately, You might not be able to get away from this table-valued function.
Update:
Thanks to Shnugo's performance enhancement comment, I have updated my xml splitter to give you the row number which reduces some of my code. I do the exact same thing to the Schema list.
2) Since the new Schema table and the new PU table now have the order each column appears, the PU table and the schema table can be joined on the "Table" and the OrderInList
CREATE FUNCTION [dbo].[fnSplitStrings_XML]
(
#List NVARCHAR(MAX),
#Delimiter VARCHAR(255)
)
RETURNS TABLE
AS
RETURN
(
SELECT y.i.value('(./text())[1]', 'nvarchar(4000)') AS Item,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as RowNumber
FROM
(
SELECT CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>')
+ '</i>').query('.') AS x
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
CREATE Procedure uspGetColumnValues
as
Begin
--Split each value in PU
select p.[Table],p.[Values],a.[Item],CHARINDEX(a.Item,p.[Values]) as LocationInStringForSorting,a.RowNumber
into #PuWithOrder
from PU p
cross apply [fnSplitStrings_XML](p.[Values],';') a --use whatever string split function is working best for you (performance wise)
--Split each value in Schema
select s.[Table],s.[Columns],a.[Item],CHARINDEX(a.Item,s.[Columns]) as LocationInStringForSorting,a.RowNumber
into #SchemaWithOrder
from Schemes s
cross apply [fnSplitStrings_XML](s.[Columns],';') a --use whatever string split function is working best for you (performance wise)
DECLARE #Fields TABLE --If this is an ETL process, maybe make this a permanent table with an auto incrementing Id and reference this table in all steps after this.
(
[Table] NVARCHAR(50),
[Columns] NVARCHAR(MAX),
[Column] VARCHAR(MAX),
[Value] VARCHAR(MAX),
OrderInList int
)
INSERT INTO #Fields([Table],[Columns],[Column],[Value],OrderInList)
Select pu.[Table],pu.[Values] as [Columns],s.Item as [Column],pu.Item as [Value],pu.RowNumber
from #PuWithOrder pu
join #SchemaWithOrder s on pu.[Table]=s.[Table] and pu.RowNumber=s.RowNumber
Select [Table],[Columns],[Column],[Value],OrderInList
from #Fields
order by [Table],[Columns],OrderInList
END
GO
EXEC uspGetColumnValues
GO
Update:
Since your working implementation is a table-valued function, I have another recommendation. The problem I see is that your using a table valued function which ultimately handles one record at a time. You are going to have better performance with set based operations and batching as needed. With a tabled valued function, you are likely going to be looping through each row. If this is some sort of ETL process, your team will be better off if you have a stored procedure that processes the rows in bulk. It might make sense to stage the results into a better table that your team can work with down stream rather than have them use a potentially slow table-valued function.

Related

Query tuning required for expensive query

Can someone help me to optimize the code? I have other way to optimize it by using compute column but we can not change the schema on prod as we are not sure how many API's are used to push data into this table. This table has millions of rows and adding a non-clustered index is not helping due to the query cost and it's going for a scan.
create table testcts(
name varchar(100)
)
go
insert into testcts(
name
)
select 'VK.cts.com'
union
select 'GK.ms.com'
go
DECLARE #list varchar(100) = 'VK,GK'
select * from testcts where replace(replace(name,'.cts.com',''),'.ms.com','') in (select value from string_split(#list,','))
drop table testcts
One possibility might be to strip off the .cts.com and .ms.com subdomain/domain endings before you insert or store the name data in your table. Then, use the following query instead:
SELECT *
FROM testcts
WHERE name IN (SELECT value FROM STRING_SPLIT(#list, ','));
Now SQL Server should be able to use an index on the name column.
If your values are always suffixed by cts.com or ms.com you could add that to the search pattern:
SELECT {YourColumns} --Don't use *
FROM dbo.testcts t
JOIN (SELECT CONCAT(SS.[value], V.Suffix) AS [value]
FROM STRING_SPLIT(#list, ',') SS
CROSS APPLY (VALUES ('.cts.com'),
('.ms.com')) V (Suffix) ) L ON t.[name] = L.[value];

How to create high performance SQL select query, if I need a condition for referrer table's records?

For example, I have 2 tables, which I need for my query, Property and Move for history of moving properties.
I must create a query which will return all properties + 1 additional boolean column, IsInService, which will have value true, in cases, when Move table has a record for property with DateTo = null and MoveTypeID = 1 ("In service").
I have created this query:
SELECT
[ID], [Name],
(SELECT COUNT(*)
FROM [Move]
WHERE PropertyID = p.ID
AND DateTo IS NULL
AND MoveTypeID = 1) AS IsInService
FROM
[Property] as p
ORDER BY
[Name] ASC
OFFSET 100500 ROWS FETCH NEXT 50 ROWS ONLY;
I'm not so strong in SQL, but as I know, subqueries are the evil :)
How to create high performance SQL query in my case, if it is expected that these tables will include millions of records?
I've updated the code based on your comment. If you need something else, please provide input and output data expected. This is about all I can do based on inference from the existing comments. Further, this isn't intended to give you an exact working solution. My intention was to give you a prototype from which you can build your solution.
That said:
The code below is the basic join that you need. However, keep in mind that indexing is probably going to play as big a part in performance as the structure of the table and the query. It doesn't matter how you query the tables if the indexes aren't there to support the queries once you reach a certain size. There are a lot of resources online for indexing but viewing querying plans should be at the top of your list.
As a note, your column [dbo].[Property] ([Name]) should probably be NVARCHAR to allow SQL to minimize data storage. Indexes on that column will then be smaller and searches/updates faster.
DECLARE #Property AS TABLE
(
[ID] INT
, [Name] NVARCHAR(100)
);
INSERT INTO #Property
([ID]
, [Name])
VALUES (1,N'A'),
(2,N'B'),
(3,N'C');
DECLARE #Move AS TABLE
(
[ID] INT
, [DateTo] DATE
, [MoveTypeID] INT
, [PropertyID] INT
);
INSERT INTO #Move
([ID]
, [DateTo]
, [MoveTypeID]
, [PropertyID])
VALUES (1,NULL,1,1),
(2,NULL,1,2),
(3,N'2017-12-07',1,2);
SELECT [Property].[ID] AS [property_id]
, [Property].[Name] AS [property_name]
, CASE
WHEN [Move].[DateTo] IS NULL
AND [Move].[MoveTypeID] = 1 THEN
N'true'
ELSE
N'false'
END AS [in_service]
FROM #Property AS [Property]
LEFT JOIN #Move AS [Move]
ON [Move].[PropertyID] = [Property].[ID]
WHERE [Move].[DateTo] IS NULL
AND [Move].[MoveTypeID] = 1;

Downsides to MERGE with dummy USING table?

I'm creating some stored procedures specifically for use from C# to update various tables in our database. A large number of items require a predictable function that will:
1) Check if a matching row already exists
2) If it doesn't exist, insert data
3) Gather ID of row and return to user
I know this can be done in a number of ways, but the most elegant way I can imagine seems to be using a MERGE with a dummy table and using the procedure params for the ON clause, such as:
CREATE PROCEDURE dbo.UpdatePerson(#PersonID INT, #FirstName VARCHAR(50)) AS
MERGE dbo.Person p
USING (SELECT 1 One) One
ON p.Person_ID = #PersonID
WHEN MATCHED THEN
UPDATE SET First_Name = #FirstName
WHEN NOT MATCHED THEN
INSERT (Person_ID, First_Name) VALUES (#PersonID, #FirstName);
This wraps it all together in one nice bundle, even though I'm not working with an actual table to merge in. I know the same basic idea could be accomplished with:
...
USING (SELECT #PersonID Person_ID, #FirstName First_Name) NewPerson
ON p.Person_ID = NewPerson.Person_ID
...
and maybe this would offer some kind of performance benefit?
Can anyone offer any solid reasons for/against this kind of usage of MERGE?
Instead of using MERGE you can use if condition.
You are having a temp table
CREATE TABLE #Table(PersonID INT,First_Name VARCHAR(100))
-- BEFORE THAT INSERT INTO TEMP TABLE
IF EXISTS(SELECT 1 FROM YOURTABLE WHERE PERSONID IN(SELECT PERSONID FROM #TABLE))
BEGIN
-------YOUR UPDATE QUERY
END
ELSE
BEGIN
-------INSERT QUERY
END
DROP TABLE #Table

Using LIKE(or any other method) to link records that they are similar or not similar only for some letters in SQL Server

I have two tables that both have names of villages. They are not in English, and since the sources of records are different, they were filled differently. For instance, in Table A, a village's name is 'ABCDEFGHIJK' and in Table B is 'ABCDEFGH-IJK'. The difference in one or two letters, in addition the state and zone of the villages are included in both tables, so, the probability of similarity of two villages in the same zone is quite low. However, they are not matched 100%. What would you suggest to link those records.
In the above picture, I have a main table, which the data is correct, and I'm using it as the index file, and there are Table 2, which includes data of each village. BUT!!! The name of villages in table 2 is not filled correctly.
So, I needed to fill the data with correct data.
If your suggestion includes a SQL Query, that would be very appreciated.
Thanks. :)
First, I think you should remove special character in Village column in Table2.
Then, you compare two table based on Village column (and other if have)
CREATE FUNCTION [dbo].[RemoveNonAlphaCharacters](#Temp VARCHAR(1000))
RETURN VARCHAR(1000)
AS
BEGIN
DECLARE #KeepValues AS VARCHAR(100)
SET #KeepValues = '%[^a-z]%' -- or '%[^a-z1-9]%' if includes numberic
WHILE PATINDEX(#KeepValues, #Temp) > 0
SET #Temp = STUFF(#Temp, PATINDEX(#KeepValues, #Temp), 1, '')
RETURN #Temp
END
SELECT T1.*, T2.Population
FROM Table1 T1
CROSS APPLY Table2 T2
WHERE T1.State = T2.State -- if have
AND T1.Zone = T2.Zone -- if have
AND T1.Name = (SELECT dbo.RemoveNonAlphaCharacters(T2.Name))
You can create a noise table that will have all those noisy characters (like *,/,etc...).
Table schema will be somewhat like
DECLARE #tblNoise(Noise VARCHAR(4),ReplaceValues VARCHAR(4))
Then clean the Village values from Table 2 by using the noise table.
Once that is done, the you can perform the match between two tables.
For cleaning the noises, you can take the help of a UDF.
Also, you can use SSIS for your operation(if you have such an option at your disposal)
Hope this helps.

If the contents of a database table cell is a list, how do I check if the list contains a specific value in T-SQL?

I have a SQL Server database table with a column called resources. Each cell of this column contains a comma-delimited list of integers. So the table data might look like this:
Product_ID Resources Condition
1 12,4,253 New
2 4,98,102,99 New
3 245,88 Used
etc....
I want to return the rows where a resource ID number is contained in the resources column. This doesn't work, but something like this:
SELECT *
FROM product_table
WHERE resources CONTAINS 4
If this was working, it would return the rows for product_id 1 and 2 because both of the resources cells in those rows contain the value 4. It would not return product_id 3, even though the resources cell for that row has the number 4 in it, because it's not the full comma-delimited value.
What is the correct way to do this?
Use the Split function as outlined in this resource:
CREATE FUNCTION Split
(
#delimited nvarchar(max),
#delimiter nvarchar(100)
) RETURNS #t TABLE
(
-- Id column can be commented out, not required for sql splitting string
id int identity(1,1), -- I use this column for numbering splitted parts
val nvarchar(max)
)
AS
BEGIN
declare #xml xml
set #xml = N'<root><r>' + replace(#delimited,#delimiter,'</r><r>') + '</r></root>'
insert into #t(val)
select
r.value('.','varchar(5)') as item
from #xml.nodes('//root/r') as records(r)
RETURN
END
GO
-- Create the test table and insert the test data
create table #test
(
product_id int,
resources nvarchar(max),
condition nvarchar(10)
);
insert into #test (product_id, resources, condition)
select 1, '12,4,253', 'new'
union
select 2, '4,98,102,99', 'new'
union
select 3, '245,88', 'used';
-- Use the Split function and cross apply to grab the data you need
select product_id, val, condition
from #test
cross apply dbo.split(#test.resources,',') split
where val = '4' -- split returns a string so use single quotes
you could do like this...
select *
from product_table
where ',' + resources + ',' like '%,4,%'
but this probably won't use an index so it'll be slow if the table is large.
A better solution if possible is to normalize by having an extra table with product_id and resource_id with values like (1,12), (1,4), (1, 253), (2,4), etc. It would be much faster because it'd use indexes
First of all, if you have any control over the schema, the actually correct way to do this is to have a many-to-many table of resources, so that you don't need to have a comma-separated list.
Barring that, you'd need a set of LIKE cases that are joined with OR, to deal with the different cases where the item you want is the first item, the last item, one of the middle items, or the only item.
Or use the split function as described here
and call it like this:
select * from Products where exists (select * from dbo.Split(',', Resources) where s = '4')

Resources