T-SQL : Cleaning up data, merging rows into columns - sql-server

I'm trying to clean up some Active Directory data in SQL Server. I have managed to read the raw LFD file into a table. Now I need to clean up some attributes where values are spread out over multiple rows. I can identify records that need to be appended to the prior record by the fact the have a leading space.
Example:
ID Record IsPartOfPrior
3114 memberOf: 0
3115 CN=Sharepoint-Members-home 1
3116 memberOf: 0
3117 This is 1
3118 part of the 1
3119 next line. 1
Ultimately, I would like to have the following table generated:
ID Record
3114 memberOf:CN=Sharepoint-Members-home
3116 memberOf:This is part of the next line
I could write it through a cursor, setting variables, working with temp tables and populating a table. But there has to be a set based (maybe recursive?) approach to this?
I could use the STUFF method to combine various rows together, but how am I about to group the various sets together? I'm thinking that I first have to define groupID's per record, and then stuff them together per groupID?
Thanks for any help.

Batch with comments below. Should work starting with SQL Server 2008.
--Here I emulate your table
DECLARE #yourtable TABLE (ID int, Record nvarchar(max), IsPartOfPrior bit)
INSERT INTO #yourtable VALUES
(3114,'memberOf:',0),(3115,'CN=Sharepoint-Members-home',1),(3116,'memberOf:',0),(3117,'This is',1),(3118,'part of the',1),(3119,'next line.',1)
--Getting max ID
DECLARE #max_id int
SELECT #max_id = MAX(ID)+1
FROM #yourtable
--We get next prior for each prior record
--And use STUFF and FOR XML PATH to build new Record
SELECT y.ID,
y.Record + b.Record as Record
FROM #yourtable y
OUTER APPLY (
SELECT TOP 1 ID as NextPrior
FROM #yourtable
WHERE IsPartOfPrior = 0 and y.ID < ID
ORDER BY ID ASC
) as t
OUTER APPLY (
SELECT STUFF((
SELECT ' '+Record
FROM #yourtable
WHERE ID > y.ID and ID < ISNULL(t.NextPrior,#max_id)
ORDER BY id ASC
FOR XML PATH('')
),1,1,'') as Record
) as b
WHERE y.IsPartOfPrior = 0
The output:
ID Record
----------- -----------------------------------------
3114 memberOf:CN=Sharepoint-Members-home
3116 memberOf:This is part of the next line.
This will work if ID are numeric and ascending.

Yet another option if 2012+
Example
Declare #YourTable Table ([ID] int,[Record] varchar(50),[IsPartOfPrior] int)
Insert Into #YourTable Values
(3114,'memberOf:',0)
,(3115,'CN=Sharepoint-Members-home',1)
,(3116,'memberOf:',0)
,(3117,'This is',1)
,(3118,'part of the',1)
,(3119,'next line.',1)
;with cte as (
Select *,Grp = sum(IIF([IsPartOfPrior]=0,1,0)) over (Order By ID)
From #YourTable
)
Select ID
,Record = Stuff((Select ' ' +Record From cte Where Grp=A.Grp Order by ID For XML Path ('')),1,1,'')
From (Select Grp,ID=min(ID)from cte Group By Grp ) A
Returns
ID Record
3114 memberOf: CN=Sharepoint-Members-home
3116 memberOf: This is part of the next line.
If it Helps with the Visualization, the cte Produces:
ID Record IsPartOfPrior Grp << Notice Grp Values
3114 memberOf: 0 1
3115 CN=Sharepoint-Members-home 1 1
3116 memberOf: 0 2
3117 This is 1 2
3118 part of the 1 2
3119 next line. 1 2

Related

Compare the two tables and update the value in a Flag column

I have two tables and the values like this
`create table InputLocationTable(SKUID int,InputLocations varchar(100),Flag varchar(100))
create table Location(SKUID int,Locations varchar(100))
insert into InputLocationTable(SKUID,InputLocations) values(11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6')
insert into InputLocationTable(SKUID,InputLocations) values(12,'Loc1, Loc2')
insert into InputLocationTable(SKUID,InputLocations) values(13,'Loc4,Loc5')
insert into Location(SKUID,Locations) values(11,'Loc3')
insert into Location(SKUID,Locations) values(11,'Loc4')
insert into Location(SKUID,Locations) values(11,'Loc5')
insert into Location(SKUID,Locations) values(11,'Loc7')
insert into Location(SKUID,Locations) values(12,'Loc10')
insert into Location(SKUID,Locations) values(12,'Loc1')
insert into Location(SKUID,Locations) values(12,'Loc5')
insert into Location(SKUID,Locations) values(13,'Loc4')
insert into Location(SKUID,Locations) values(13,'Loc2')
insert into Location(SKUID,Locations) values(13,'Loc2')`
I need to get the output by matching SKUID's from Each tables and Update the value in Flag column as shown in the screenshot, I have tried something like this code
`SELECT STUFF((select ','+ Data.C1
FROM
(select
n.r.value('.', 'varchar(50)') AS C1
from InputLocation as T
cross apply (select cast('<r>'+replace(replace(Location,'&','&'), ',', '</r><r>')+'</r>' as xml)) as S(XMLCol)
cross apply S.XMLCol.nodes('r') as n(r)) DATA
WHERE data.C1 NOT IN (SELECT Location
FROM Location) for xml path('')),1,1,'') As Output`
But not convinced with output and also i am trying to avoid xml path code, because performance is not first place for this code, I need the output like the below screenshot. Any help would be greatly appreciated.
I think you need to first look at why you think the XML approach is not performing well enough for your needs, as it has actually been shown to perform very well for larger input strings.
If you only need to handle input strings of up to either 4000 or 8000 characters (non max nvarchar and varchar types respectively), you can utilise a tally table contained within an inline table valued function which will also perform very well. The version I use can be found at the end of this post.
Utilising this function we can split out the values in your InputLocations column, though we still need to use for xml to concatenate them back together for your desired format:
-- Define data
declare #InputLocationTable table (SKUID int,InputLocations varchar(100),Flag varchar(100));
declare #Location table (SKUID int,Locations varchar(100));
insert into #InputLocationTable(SKUID,InputLocations) values (11,'Loc1, Loc2, Loc3, Loc4, Loc5, Loc6'),(12,'Loc1, Loc2'),(13,'Loc4,Loc5'),(14,'Loc1');
insert into #Location(SKUID,Locations) values (11,'Loc3'),(11,'Loc4'),(11,'Loc5'),(11,'Loc7'),(12,'Loc10'),(12,'Loc1'),(12,'Loc5'),(13,'Loc4'),(13,'Loc2'),(13,'Loc2'),(14,'Loc1');
--Query
-- Derived table splits out the values held within the InputLocations column
with i as
(
select i.SKUID
,i.InputLocations
,s.item as Loc
from #InputLocationTable as i
cross apply dbo.fn_StringSplit4k(replace(i.InputLocations,' ',''),',',null) as s
)
select il.SKUID
,il.InputLocations
,isnull('Add ' -- The split Locations are then matched to those already in #Location and those not present are concatenated together.
+ stuff((select ', ' + i.Loc
from i
left join #Location as l
on i.SKUID = l.SKUID
and i.Loc = l.Locations
where il.SKUID = i.SKUID
and l.SKUID is null
for xml path('')
)
,1,2,''
)
,'No Flag') as Flag
from #InputLocationTable as il
order by il.SKUID;
Output:
+-------+------------------------------------+----------------------+
| SKUID | InputLocations | Flag |
+-------+------------------------------------+----------------------+
| 11 | Loc1, Loc2, Loc3, Loc4, Loc5, Loc6 | Add Loc1, Loc2, Loc6 |
| 12 | Loc1, Loc2 | Add Loc2 |
| 13 | Loc4,Loc5 | Add Loc5 |
| 14 | Loc1 | No Flag |
+-------+------------------------------------+----------------------+
For nvarchar input (I have different functions for varchar and max type input) this is my version of the string splitting function linked above:
create function [dbo].[fn_StringSplit4k]
(
#str nvarchar(4000) = ' ' -- String to split.
,#delimiter as nvarchar(1) = ',' -- Delimiting value to split on.
,#num as int = null -- Which value in the list to return. NULL returns all.
)
returns table
as
return
-- Start tally table with 10 rows.
with n(n) as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(isnull(#str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that follows the specified delimiter.
,s(s) as (select 1 union all select t+1 from t where substring(isnull(#str,''),t,1) = #delimiter)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
,l(s,l) as (select s,isnull(nullif(charindex(#delimiter,isnull(#str,''),s),0)-s,4000) from s)
select rn
,item
from(select row_number() over(order by s) as rn
,substring(#str,s,l) as item
from l
) a
where rn = #num
or #num is null;
go

SQL stored procedure for picking a random sample based on multiple criteria

I am new to SQL. I looked for all over the internet for a solution that matches the problem I have but I couldn't find any. I have a table named 'tblItemReviewItems' in an SQL server 2012.
tblItemReviewItems
Information:
1. ItemReviewId column is the PK.
2. Deleted column will have only "Yes" and "No" value.
3. Audited column will have only "Yes" and "No" value.
I want to create a stored procedure to do the followings:
Pick a random sample of 10% of all ItemReviewId for distinct 'UserId' and distinct 'ReviewDate' in a given date range. 10% sample should include- 5% of the total population from Deleted (No) and 5% of the total population from Deleted (Yes). Audited ="Yes" will be excluded from the sample.
For example – A user has 118 records. Out of the 118 records, 17 records have Deleted column value "No" and 101 records have Deleted column value "Yes". We need to pick a random sample of 12 records. Out of those 12 records, 6 should have Deleted column value "No" and 6 should have Deleted column value "Yes".
Update Audited column value to "Check" for the picked sample.
How can I achieve this?
This is the stored procedure I used to pick a sample of 5% of Deleted column value "No" and 5% of Deleted column value "Yes". Now the situation is different.
ALTER PROC [dbo].[spItemReviewQcPickSample]
(
#StartDate Datetime
,#EndDate Datetime
)
AS
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
order by newid()
)
/count(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN #StartDate AND #EndDate
AND Deleted in ('Yes','No')
AND Audited='No'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
WITH CTE
AS (SELECT ItemReviewId
,100.00
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
ORDER BY newid()
)
/COUNT(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN #StartDate AND #EndDate
AND deleted IN ('Yes','No')
AND audited='No'
)
UPDATE a
SET Audited='Check'
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
END
Any help would be highly appreciated. Thanks in advance.
This may assist you in getting started. My idea is, you create the temp tables you need, and load the specific data into the (deleted, not deleted etc.). You then run something along the lines of:
IF OBJECT_ID('tempdb..#tmpTest') IS NOT NULL DROP TABLE #tmpTest
GO
CREATE TABLE #tmpTest
(
ID INT ,
Random_Order INT
)
INSERT INTO #tmpTest
(
ID
)
SELECT 1 UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 7 UNION ALL
SELECT 8 UNION ALL
SELECT 9 UNION ALL
SELECT 10 UNION ALL
SELECT 11 UNION ALL
SELECT 12 UNION ALL
SELECT 13 UNION ALL
SELECT 14 UNION ALL
SELECT 15 UNION ALL
SELECT 16;
DECLARE #intMinID INT ,
#intMaxID INT;
SELECT #intMinID = MIN(ID)
FROM #tmpTest;
SELECT #intMaxID = MAX(ID)
FROM #tmpTest;
WHILE #intMinID <= #intMaxID
BEGIN
UPDATE #tmpTest
SET Random_Order = 10 + CONVERT(INT, (30-10+1)*RAND())
WHERE ID = #intMinID;
SELECT #intMinID = #intMinID + 1;
END
SELECT TOP 5 *
FROM #tmpTest
ORDER BY Random_Order;
This assigns a random number to a column, that you then use in conjunction with a TOP 5 clause, to get a random top 5 selection.
Appreciate a loop may not be efficient, but you may be able to update to a random number without it, and the same principle could be implemented. Hope that gives you some ideas.

How to check for a specific condition by looping through every record in SQL Server?

I do have following table
ID Name
1 Jagan Mohan Reddy868
2 Jagan Mohan Reddy869
3 Jagan Mohan Reddy
Name column size is VARCHAR(55).
Now for some other task we need to take only 10 varchar length i.e. VARCHAR(10).
My requirement is to check that after taking the only 10 bits length of Name column value for eg if i take Name value of ID 1 i.e. Jagan Mohan Reddy868 by SUBSTRING(Name, 0,11) if it equals with another row value. here in this case the final value of SUBSTRING(Jagan Mohan Reddy868, 0,11) is equal to Name value of ID 3 row whose Name is 'Jagan Mohan Reddy'. I need to make a list of those kind rows. Can somebody help me out on how can i achieve in SQL Server.
My main check is that the truncated values of my Name column should not match with any non truncated values of Name column. If so i need to get those records.
Assuming I understand the question, I think you are looking for something like this:
Create and populate sample data (Please save us this step in your future questions)
DECLARE #T as TABLE
(
Id int identity(1,1),
Name varchar(15)
)
INSERT INTO #T VALUES
('Hi, I am Zohar.'),
('Hi, I am Peled.'),
('Hi, I am Z'),
('I''m Zohar peled')
Use a cte with a self inner join to get the list of ids that match the first 10 chars:
;WITH cte as
(
SELECT T2.Id As Id1, T1.Id As Id2
FROM #T T1
INNER JOIN #T T2 ON LEFT(T1.Name, 10) = t2.Name AND T1.Id <> T2.Id
)
Select the records from the original table, inner joined with a union of the Id1 and Id2 from the cte:
SELECT T.Id, Name
FROM #T T
INNER JOIN
(
SELECT Id1 As Id
FROM CTE
UNION
SELECT Id2
FROM CTE
) U ON T.Id = U.Id
Results:
Id Name
----------- ---------------
1 Hi, I am Zohar.
3 Hi, I am Z
Try this
SELECT Id,Name
FROM(
SELECT *,ROW_NUMBER() OVER(PARTITION BY Name, LEFT(Name,11) ORDER BY ID) RN
FROM Tbale1 T
) Tmp
WHERE Tmp.RN = 1
loop over your column for all the values and put your substring() function inside this loop and I think in Sql index of string starts from 1 instead of 0. If you pass your string to charindex() like this
CHARINDEX('Y', 'Your String')
thus you will come to know whether it is starting from 0 or 1
and you can save your substring value as value of other column with length 10
I hope it will help you..
I think this should cover all the cases you are looking for.
-- Create Table
DECLARE #T as TABLE
(
Id int identity(1,1),
Name varchar(55)
)
-- Create Data
INSERT INTO #T VALUES
('Jagan Mohan Reddy868'),
('Jagan Mohan Reddy869'),
('Jagan Mohan Reddy'),
('Mohan Reddy'),
('Mohan Reddy123551'),
('Mohan R')
-- Get Matching Items
select *, SUBSTRING(name, 0, 11) as ShorterName
from #T
where SUBSTRING(name, 0, 11) in
(
-- get all shortnames with a count > 1
select SUBSTRING(name, 0, 11) as ShortName
from #T
group by SUBSTRING(name, 0, 11)
having COUNT(*) > 1
)
order by Name, LEN(Name)

Change value of column in whole group

Given Example
A group is defined via the GRP_ID and the GRP_MAIN set. Orange and green are examples of what it should get. Blue is what we have.
All with the same NAME should be grouped to have the same UNIQUE GRP_ID (unique is important, because there are already some data grouped together) and the MAX or TOP or AVG record should be Marked as GRP_MAIN.
So my first question is, how can I set a value to a whole group
column, so I can after group by NAME do GRP_ID = 1
Secound is, how can I check all existing GRP_ID against a number?
Third, how to pick the next free number of all existing in GRP_ID?
Just copy this into an empty query window and execute... adapt to your needs...
You should use this query once to shuffle your data into a new table with columns für GRP_ID and GRP_MAIN. GRP_ID should be IDENTITY...
DECLARE #tbl TABLE(VALUE INT, NAME VARCHAR(10));
INSERT INTO #tbl VALUES
(12,'ab')
,(1,'ab')
,(2,'ab')
,(34,'ab')
,(5,'ab')
,(6,'ab')
,(3,'fg')
,(45,'fg')
,(65,'fg')
,(2,'ht')
,(3,'ht')
,(44,'hh')
,(5,'hh')
,(6,'hh')
,(7,'hh');
WITH DistinctNames AS
(
SELECT ROW_NUMBER() OVER(ORDER BY Name) AS Inx, x.NAME
FROM
(
SELECT DISTINCT NAME
FROM #tbl
) AS x
)
SELECT dn.Inx AS GRP_ID
,CASE WHEN tbl.Value=MainRec.VALUE THEN 1 ELSE 0 END AS GRP_MAIN
,tbl.VALUE
,tbl.NAME
FROM #tbl AS tbl
INNER JOIN DistinctNames AS dn ON tbl.NAME=dn.NAME
CROSS APPLY(SELECT TOP 1 x.* FROM #tbl AS x WHERE x.NAME=dn.NAME ORDER BY VALUE DESC) AS MainRec

How do I get the "Next available number" from an SQL Server? (Not an Identity column)

Technologies: SQL Server 2008
So I've tried a few options that I've found on SO, but nothing really provided me with a definitive answer.
I have a table with two columns, (Transaction ID, GroupID) where neither has unique values. For example:
TransID | GroupID
-----------------
23 | 4001
99 | 4001
63 | 4001
123 | 4001
77 | 2113
2645 | 2113
123 | 2113
99 | 2113
Originally, the groupID was just chosen at random by the user, but now we're automating it. Thing is, we're keeping the existing DB without any changes to the existing data(too much work, for too little gain)
Is there a way to query "GroupID" on table "GroupTransactions" for the next available value of GroupID > 2000?
I think from the question you're after the next available, although that may not be the same as max+1 right? - In that case:
Start with a list of integers, and look for those that aren't there in the groupid column, for example:
;WITH CTE_Numbers AS (
SELECT n = 2001
UNION ALL
SELECT n + 1 FROM CTE_Numbers WHERE n < 4000
)
SELECT top 1 n
FROM CTE_Numbers num
WHERE NOT EXISTS (SELECT 1 FROM MyTable tab WHERE num.n = tab.groupid)
ORDER BY n
Note: you need to tweak the 2001/4000 values int the CTE to allow for the range you want. I assumed the name of your table to by MyTable
select max(groupid) + 1 from GroupTransactions
The following will find the next gap above 2000:
SELECT MIN(t.GroupID)+1 AS NextID
FROM GroupTransactions t (updlock)
WHERE NOT EXISTS
(SELECT NULL FROM GroupTransactions n WHERE n.GroupID=t.GroupID+1 AND n.GroupID>2000)
AND t.GroupID>2000
There are always many ways to do everything. I resolved this problem by doing like this:
declare #i int = null
declare #t table (i int)
insert into #t values (1)
insert into #t values (2)
--insert into #t values (3)
--insert into #t values (4)
insert into #t values (5)
--insert into #t values (6)
--get the first missing number
select #i = min(RowNumber)
from (
select ROW_NUMBER() OVER(ORDER BY i) AS RowNumber, i
from (
--select distinct in case a number is in there multiple times
select distinct i
from #t
--start after 0 in case there are negative or 0 number
where i > 0
) as a
) as b
where RowNumber <> i
--if there are no missing numbers or no records, get the max record
if #i is null
begin
select #i = isnull(max(i),0) + 1 from #t
end
select #i
In my situation I have a system to generate message numbers or a file/case/reservation number sequentially from 1 every year. But in some situations a number does not get use (user was testing/practicing or whatever reason) and the number was deleted.
You can use a where clause to filter by year if all entries are in the same table, and make it dynamic (my example is hardcoded). if you archive your yearly data then not needed. The sub-query part for mID and mID2 must be identical.
The "union 0 as seq " for mID is there in case your table is empty; this is the base seed number. It can be anything ex: 3000000 or {prefix}0000. The field is an integer. If you omit " Union 0 as seq " it will not work on an empty table or when you have a table missing ID 1 it will given you the next ID ( if the first number is 4 the value returned will be 5).
This query is very quick - hint: the field must be indexed; it was tested on a table of 100,000+ rows. I found that using a domain aggregate get slower as the table increases in size.
If you remove the "top 1" you will get a list of 'next numbers' but not all the missing numbers in a sequence; ie if you have 1 2 4 7 the result will be 3 5 8.
set #newID = select top 1 mID.seq + 1 as seq from
(select a.[msg_number] as seq from [tblMSG] a --where a.[msg_date] between '2023-01-01' and '2023-12-31'
union select 0 as seq ) as mID
left outer join
(Select b.[msg_number] as seq from [tblMSG] b --where b.[msg_date] between '2023-01-01' and '2023-12-31'
) as mID2 on mID.seq + 1 = mID2.seq where mID2.seq is null order by mID.seq
-- Next: a statement to insert a row with #newID immediately in tblMSG (in a transaction block).
-- Then the row can be updated by your app.

Resources