Remove Partial Duplicate Rows in SQL Server 2016 - sql-server

I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).
I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...
Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them
IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
DROP TABLE #Category;
GO
CREATE TABLE #Category
(
RowID INT NOT NULL,
CategoryID INT NOT NULL,
ProductCategory VARCHAR(100) NOT NULL,
ProductName VARCHAR(100) NOT NULL,
[SubCategory Level 1 ID] INT NOT NULL,
[SubCategory Level 2 ID] INT NULL,
[SubCategory Level 3 ID] INT NULL,
[SubCategory Level 4 ID] INT NULL
);
INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);
SELECT * FROM #Category;
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9

If I understand this right, your logic is the following:
For each unique SubCategory Level 1, Product Category, and Product Name combination, you want to return the row which has the least amount of filled in SubCategory level data.
Using a quick dense_rank with partitions on the relevant fields, you can order the rows with less Sub Categories levels to be set to 1. Rows 2, 4, 6, and 9 should now be the only rows returned.
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude != 1
ORDER BY
RowID
Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1, Product Category, and Product Name combination, they'll both be included. If you do not want this, just swap the dense_rank to row_number and add some alternative criteria on which should be selected first.

this thread helped me a lot with understanding a different method to removing duplicate date. I want to thank the original contributors. I did however notice that the final solution is incomplete. The original poster wanted the results to return RowId's 2,4,6,9 however the ToInclude != 1 filter doesnt allow that. I am adding the code to complete the query by adding a where > 1 filter which will produce the intended result. See the code below:
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude > 1
ORDER BY
RowID
This returns:
Results Table of Code

Related

how to use "Stuff and 'For Xml Path'" to unite rows in table

Please help me to get united rows and list of accounts separated by commas in table. I don't quite understand how to use "Stuff and 'For Xml Path'" for it.
This is my query:
CREATE TABLE invoices
(
invoice VARCHAR(20) NOT NULL,
quantity INT NOT NULL,
price INT NOT NULL,
summ INT NOT NULL,
account INT NOT NULL,
);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210110', 2, 100, 200, 1001);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210110', 3, 100, 300, 1002);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210110', 1, 250, 250, 1001);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210110', 2, 120, 240, 1002);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210110', 4, 100, 400, 1002);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210114', 3, 100, 300, 1001);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210114', 5, 80, 400, 1003);
INSERT invoices(invoice, quantity, price, summ, account)
VALUES ('ty20210114', 5, 100, 500, 1004);
SELECT invoices.invoice, invoices.summ, accounts = STUFF(
(SELECT DISTINCT ',' + Convert(varchar, invoices.account, 60)
FROM invoices
FOR XML PATH (''))
, 1, 1, '')
FROM invoices
GROUP BY invoices.invoice, invoices.summ
This is what I get in result:
invoice
summ
accounts
ty20210110
200
1001,1002,1003,1004
ty20210110
240
1001,1002,1003,1004
ty20210110
250
1001,1002,1003,1004
ty20210110
300
1001,1002,1003,1004
ty20210110
400
1001,1002,1003,1004
ty20210114
300
1001,1002,1003,1004
ty20210114
400
1001,1002,1003,1004
ty20210114
500
1001,1002,1003,1004
This is what I need to get in result:
invoice
summ
accounts
ty20210110
1390
1001,1002
ty20210114
1200
1003,1004
So actually I need to get sums for 2 different invoices and to specify accounts by commas which involved to those invoices.
Also have this stuff at dbfiddle here: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=7a5de9e680693b5e70ea68cecebef6cc
Thank You in advance guys.
Don't group by summ if you want to sum it. Use sum() on it. And correlate the subquery. Otherwise you'll just get all accounts.
SELECT i1.invoice,
sum(i1.summ) summ,
stuff((SELECT DISTINCT
concat(',', i2.account)
FROM invoices i2
WHERE i2.invoice = i1.invoice
FOR XML PATH ('')),
1,
1,
'') accounts
FROM invoices i1
GROUP BY i1.invoice;
For SQL Server 2017 onwards.
Leveraging the STRING_AGG() function. The only nuance is that we need to select DISTINCT account value in the sub-query.
SQL
-- DDL and sample data population, start
DECLARE #invoices TABLE
(
invoice VARCHAR(20) NOT NULL,
quantity INT NOT NULL,
price INT NOT NULL,
summ INT NOT NULL,
account INT NOT NULL
);
INSERT #invoices(invoice, quantity, price, summ, account) VALUES
('ty20210110', 2, 100, 200, 1001),
('ty20210110', 3, 100, 300, 1002),
('ty20210110', 1, 250, 250, 1001),
('ty20210110', 2, 120, 240, 1002),
('ty20210110', 4, 100, 400, 1002),
('ty20210114', 3, 100, 300, 1001),
('ty20210114', 5, 80, 400, 1003),
('ty20210114', 5, 100, 500, 1004);
-- DDL and sample data population, end
SELECT i1.invoice
, SUM(i1.summ) AS summ
, (
SELECT STRING_AGG(account,',') FROM
(
(SELECT DISTINCT account FROM #invoices AS i2 WHERE i2.invoice = i1.invoice)
) AS x) AS accounts
FROM #invoices AS i1
GROUP BY i1.invoice;
Output
+------------+------+----------------+
| invoice | summ | accounts |
+------------+------+----------------+
| ty20210110 | 1390 | 1001,1002 |
| ty20210114 | 1200 | 1001,1003,1004 |
+------------+------+----------------+

for sql server, I want to select all of the records that has changed

for example, my table has a record for each date, and each date's record could be same as the previous date record, could be different. my case is from date 1 to date 3, all of the record are same, and then date 4, the record is changed, date 5 the record is changed too, but it changed back to same as date 3. Now I want to a way to query the table and get the records of date 1, date 4 and date 5. Any idea, how to do it? Thanks
I read the issue above, is that a) you take daily logs of all rows, and b) you want to report on any row that is different from the previous day's.
SQL Server has a great function for dealing with differences across a large number of columns - EXCEPT. It also has the advantage of treating NULLs as distinct values - so a change from something to NULL, or vice versa, counts as a change. This is not true for most equality/inequality checks.
Here is a version where I create a daily snapshot of some fields from a 'users' table.
The SELECT query finds all rows from the log, except where the previous entry in the log is the same.
CREATE TABLE #UserLog (LogDate date, UserID int, UserName nvarchar(100), UserEmail nvarchar(100), LastLogonDate datetime, PRIMARY KEY (LogDate, UserID));
INSERT INTO #UserLog (LogDate, UserID, UserName, UserEmail, LastLogonDate) VALUES
('20201011', 1, 'Bob', NULL, '20201009 15:38'),
('20201012', 1, 'Bob', NULL, '20201009 15:38'),
('20201013', 1, 'Bob', 'Bob#gm.com', '20201012 09:15'),
('20201014', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201015', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201017', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201013', 2, 'Pat', 'Pat#hm.com', NULL),
('20201014', 2, 'Pat', 'Pat#hm.com', NULL),
('20201015', 2, 'Pat', 'Pat#hm.com', '20201014 20:55'),
('20201017', 2, 'Pat', 'Pat#hm.com', '20201016 13:22');
SELECT LogDate, UserID, UserName, UserEmail, LastLogonDate
FROM #UserLog
EXCEPT
SELECT LEAD(LogDate) OVER (PARTITION BY UserID ORDER BY LogDate), UserID, UserName, UserEmail, LastLogonDate
FROM #UserLog
ORDER BY UserID, LogDate;
In the 'EXCEPT' segment, it basically gets the data for each given row, then changes the date to the next date in sequence for that user e.g., it turns
('20201012', 1, 'Bob', NULL, '20201009 15:38'),
into
('20201013', 1, 'Bob', NULL, '20201009 15:38'),
As this is not the same as the actual row for Bob on the 13th, the row in the top part of the statement shows.
My initial test run of this simply had a DATEADD(day, 1, Logdate) in the EXCEPT portion, and that would show all rows that were different from yesterday's. However, the updated version above allows for breaks in the sequence (e.g., in the above, the logging failed on the 16th).
Here's a DB<>fiddle with the code above.
UPDATE - data posted in comment in another answer.
Here's a version with that data.
CREATE TABLE #tLog (LogDate date, v_1 int, v_2 varchar(100), v_3 int, v_4 varchar(10), v_5 int, v_6 varchar(10));
INSERT INTO #tLog (Logdate, v_1, v_2, v_3, v_4, v_5, v_6) VALUES
('20200101', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200102', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200103', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200104', 101, 'test_1', 1, '123', 120, 'JJ'),
('20200105', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200106', 101, 'test_1', 1, '12345', 120, 'JJ'),
('20200107', 101, 'test_1', 1, '12345', 120, 'JJ'),
('20200108', 101, 'test_2', 2, '12345', 200, 'JJ'),
('20200109', 101, 'test_1', 1, '12345', 120, 'TT'),
('20200110', 100, 'test_1', 0, '123', 120, 'JJ');
SELECT LogDate, v_1, v_2, v_3, v_4, v_5, v_6
FROM #tLog
EXCEPT
SELECT LEAD(LogDate) OVER (ORDER BY LogDate), v_1, v_2, v_3, v_4, v_5, v_6
FROM #tLog
ORDER BY LogDate;
And here's a copy of the results of the above. Note that only on the 2nd, 3rd and 7th did the data not change from the previous day.
LogDate v_1 v_2 v_3 v_4 v_5 v_6
--------------- ----------------------------
2020-01-01 100 test_1 0 123 120 JJ
2020-01-04 101 test_1 1 123 120 JJ
2020-01-05 100 test_1 0 123 120 JJ
2020-01-06 101 test_1 1 12345 120 JJ
2020-01-08 101 test_2 2 12345 200 JJ
2020-01-09 101 test_1 1 12345 120 TT
2020-01-10 100 test_1 0 123 120 JJ
Note that I have removed the 'PARTITION BY' in the LEAD as there are no real partitions - it's just one row after the next. However there's a distinct chance you may need this when it comes to actual data.
Here's a DB<>fiddle with both the original and this cut-down one with the OP's data.

How do you validate that range doesn't overlap in a list of data?

I have a list of data :
Id StartAge EndAge Amount
1 0 2 50
2 2 5 100
3 5 10 150
4 6 9 160
I have to set Amount for various age group.
The age group >0 and <=2 need to pay 50
The age group >2 and <=5 need to pay 100
The age group >5 and <=10 need to pay 150
But
The age group >6 and <=9 need to pay 160 is an invalid input because >6 and <=9 already exist on 150 amount range.
I have to validate such kind of invalid input before inserting my data as a bulk.Once 5-10 range gets inserted anything that is within this range should not be accepted by system. For example: In above list, user should be allowed to insert 10-15 age group but any of the following should be checked as invalid.
6-9
6-11
3-5
5-7
If Invalid Input exists on my list I don't need to insert the list.
You could try to insert your data to the temporary table first.
DECLARE #TempData TABLE
(
[Id] TINYINT
,[StartAge] TINYINT
,[EndAge] TINYINT
,[Amount] TINYINT
);
INSERT INTO #TempData ([Id], [StartAge], [EndAge], [Amount])
VALUES (1, 0, 2, 50)
,(2, 2, 5, 100)
,(3, 5, 10, 150)
,(4, 6, 9, 160);
Then, this data will be transferred to your target table using INSERT INTO... SELECT... statement.
INSERT INTO <your target table>
SELECT * FROM #TempData s
WHERE
NOT EXISTS (
SELECT 1
FROM #TempData t
WHERE
t.[Id] < s.[Id]
AND s.[StartAge] < t.[EndAge]
AND s.[EndAge] > t.[StartAge]
);
I've created a demo here
We can use recursive CTE to find how records are chained by end age and start age pairs:
DECLARE #DataSource TABLE
(
[Id] TINYINT
,[StartAge] TINYINT
,[EndAge] TINYINT
,[Amount] TINYINT
);
INSERT INTO #DataSource ([Id], [StartAge], [EndAge], [Amount])
VALUES (1, 0, 2, 50)
,(2, 2, 5, 100)
,(3, 5, 10, 150)
,(4, 6, 9, 160)
,(5, 6, 11, 160)
,(6, 3, 5, 160)
,(7, 5, 7, 160)
,(9, 10, 15, 20)
,(8, 7, 15, 20);
WITH PreDataSource AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY [StartAge] ORDER BY [id]) as [pos]
FROM #DataSource
), DataSource AS
(
SELECT [Id], [StartAge], [EndAge], [Amount], [pos]
FROM PreDataSource
WHERE [id] = 1
UNION ALL
SELECT R.[Id], R.[StartAge], R.[EndAge], R.[Amount], R.[pos]
FROM DataSource A
INNER JOIN PreDataSource R
ON A.[Id] < R.[Id]
AND A.[EndAge] = R.[StartAge]
AND R.[pos] =1
)
SELECT [Id], [StartAge], [EndAge], [Amount]
FROM DataSource;
This is giving us, the following output:
Note, that before this, we are using the following statement to prepare the data:
SELECT *, ROW_NUMBER() OVER (PARTITION BY [StartAge] ORDER BY [id]) as [pos]
FROM #DataSource;
The idea is to find records with same start age and to calculated which one is inserted first. Then, in the CTE we are getting only the first.
Assuming you are bulk inserting the mentioned data into a temp table(#tmp) or table variable (#tmp).
If you are working on sql server 2012 try the below.
select *
from(select *,lag(endage,1,0)over(order by endage) as [col1]
from #tmp)tmp
where startage>=col1 and endage>col1
The result of this query should be inserted into your main table.

How could I replace a T-SQL cursor?

I would like to ask you how I could replace a cursor that I've inserted into my stored procedure.
Actually, we found that cursor is the only way out to manage my scenario, but as I've read this is not a best practise.
This is my scenario:I have to calculate recursively the stock row by row and set the season according to what has been calculated in the previous rows.
I can set the season when the transfer type is "purchase". The others transfers should be set with the correct season by a T-SQL query.
The table where I should calculate the season has the following template and fake data, but they reflect the real situation:
Transfer Table Example
The rows that have the "FlgSeason" set as null, are calculated as follow: in ascending order, the cursor start from the row 3 and go back the previous rows and calculate the amount of stock for each season and then update the column season with the minimum season with stock.
Here's the code I used:
CREATE TABLE [dbo].[transfers]
(
[rowId] [int] NULL,
[area] [int] NULL,
[store] [int] NULL,
[item] [int] NULL,
[date] [date] NULL,
[type] [nvarchar](50) NULL,
[qty] [int] NULL,
[season] [nvarchar](50) NULL,
[FlagSeason] [int] NULL
) ON [PRIMARY]
INSERT INTO [dbo].[transfers]
([rowId]
,[area]
,[store]
,[item]
,[date]
,[type]
,[qty]
,[season]
,[FlagSeason])
VALUES (1,1,20,300,'2015-01-01','Purchase',3,'2015-FallWinter',1)
, (2,1,20,300,'2015-01-01','Purchase',4,'2016-SpringSummer',1)
, (3,1,20,300,'2015-01-01','Sales',-1,null,null)
, (4,1,20,300,'2015-01-01','Sales',-2,null,null)
, (5,1,20,300,'2015-01-01','Sales',-1,null,null)
, (6,1,20,300,'2015-01-01','Sales',-1,null,null)
, (7,1,20,300,'2015-01-01','Purchase',4,'2016-FallWinter',1)
, (8,1,20,300,'2015-01-01','Sales',-1,null,null)
DECLARE #RowId as int
DECLARE db_cursor CURSOR FOR
Select RowID
from Transfers
where [FlagSeason] is null
order by RowID
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #RowId
WHILE ##FETCH_STATUS = 0
BEGIN
Update Transfers
set Season = (Select min (Season) as Season
from (
Select
Season
, SUM(QTY) as Qty
from Transfers
where RowID < #RowId
and [FlagSeason] = 1
group by Season
having Sum(QTY) > 0
)S
where s.QTY >= 0
)
, [FlagSeason] = 1
where rowId = #RowId
FETCH NEXT FROM db_cursor INTO #RowId
end
In this case the query would extract:
3 qty for season 2015 FW
4 for 2016 SS.
Than The update statment will set 2015-fw (the min over the two season with qty).
Then the courson go forward the row 4, and runs again the query to extract the stock updated considering the calculation at row 3. So the result should be
QTY 2 For 2015 FW
QTY 4 FOr 2016 SS
and then the update would set 2015 FW.
And so on.
The final output should be something like this:
Output
Actually, the only way-out was to implement a cursor and now it takes above 30/40 minutes to scan and update about 2,5 million rows. Do anybody know a solution without recurring to a cursor?
Thanks in advance!
Updated to run on 2008
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-SpringSummer', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 20, 300, '20170125', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170225', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20150801', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 21, 301, '20150901', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20151221', 'Sales', -2, NULL, NULL),
(1, 21, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 21, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 21, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 21, 302, '20151221', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 20, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 20, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20151221', 'Sales', -1, NULL, NULL);
WITH Purchases
AS (SELECT t1.RowID,
t1.Area,
t1.Store,
t1.Item,
t1.Date,
t1.Type,
t1.Qty,
t1.Season,
RunningInventory = ( SELECT SUM(t2.Qty)
FROM #transfer AS t2
WHERE t1.Type = t2.Type
AND t1.Area = t2.Area
AND t1.Store = t2.Store
AND t1.Item = t2.Item
AND t2.Date <= t1.Date
)
FROM #transfer AS t1
WHERE t1.Type = 'Purchase'
),
Sales
AS (SELECT t1.RowID,
t1.Area,
t1.Store,
t1.Item,
t1.Date,
t1.Type,
t1.Qty,
t1.Season,
RunningSales = ( SELECT SUM(ABS(t2.Qty))
FROM #transfer AS t2
WHERE t1.Type = t2.Type
AND t1.Area = t2.Area
AND t1.Store = t2.Store
AND t1.Item = t2.Item
AND t2.Date <= t1.Date
)
FROM #transfer AS t1
WHERE t1.Type = 'Sales'
)
SELECT Sales.RowID,
Sales.Area,
Sales.Store,
Sales.Item,
Sales.Date,
Sales.Type,
Sales.Qty,
Season = ( SELECT TOP 1
Purchases.Season
FROM Purchases
WHERE Purchases.Area = Sales.Area
AND Purchases.Store = Sales.Store
AND Purchases.Item = Sales.Item
AND Purchases.RunningInventory >= Sales.RunningSales
ORDER BY Purchases.Date, Purchases.Season
)
FROM Sales
UNION ALL
SELECT Purchases.RowID ,
Purchases.Area ,
Purchases.Store ,
Purchases.Item ,
Purchases.Date ,
Purchases.Type ,
Purchases.Qty ,
Purchases.Season
FROM Purchases
ORDER BY Sales.Area, Sales.Store, item, Sales.Date
*original answer below**
I don't understand the purpose of the flagseason column so I didn't include that. Essentially, this calculates a running sum for purchases and sales and then finds the season that has a purchase_to_date inventory of at least the sales_to_date outflow for each sales transaction.
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-FallWinter', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2016-FallWinter', 1),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL);
WITH Inventory
AS (SELECT *,
PurchaseToDate = SUM(CASE WHEN Type = 'Purchase' THEN Qty ELSE 0 END) OVER (ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
SalesToDate = ABS(SUM(CASE WHEN Type = 'Sales' THEN Qty ELSE 0 END) OVER (ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM #transfer
)
SELECT Inventory.RowID,
Inventory.Area,
Inventory.Store,
Inventory.Item,
Inventory.Date,
Inventory.Type,
Inventory.Qty,
Season = CASE
WHEN Inventory.Season IS NULL
THEN ( SELECT TOP 1
PurchaseToSales.Season
FROM Inventory AS PurchaseToSales
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
ORDER BY Inventory.Date
)
ELSE
Inventory.Season
END,
Inventory.PurchaseToDate,
Inventory.SalesToDate
FROM Inventory;
*UPDATED*******************************
You'll need an index on your data to help with the sorting in order to make this perform.
Possibly:
CREATE NONCLUSTERED INDEX IX_Transfer ON #transfer(Store, Item, Date) INCLUDE(Area,Qty,Season,Type)
You should see a index scan on the named index. It will not be a seek because the sample query does not filter any data and all of the data is included.
In addition, you need to remove Season from the Partition By clause of the SalesToDate. Resetting the sales for each season will throw your comparisons off because the rolling sales need to be compared to the rolling inventory in order for you to determine the source of sales inventory.
Two other tips for the partition clause:
Don't duplicate the fields between partition by and order by. The order of the partition fields doesn't matter since the aggregate is reset for each partition. At best, the ordered partition field will be ignored, at worst it may cause the optimizer to aggregate the fields in a particular order. This does not have any effect on the results, but can added unnecessary overhead.
Make sure your index matches the definition of the partition by/order by clause.
The index should be [partitioning fields, sequence doesn't matter] + [ordering fields, sequence needs to match order by clause].
In your scenario, the indexed columns should be on store, item, and then date. If date were before store or item, the index would not be used because the optimizer will need to first handle partitioning by store & item before sorting by date.
If you may have multiple areas in your data, the index and partition clauses would need to be
index: area, store, item, date
partition by: area, store, item order by date
Referring to Wes's answer, the solution proposed is almost fine. It works good but I've noticed that the assignment of the season doesn't work properly beacause, in my scenario, the stock should be calculated and updated by store and item itself. I've Updated the script adding some adjstments. Moreover, I've added some new "Fake" data to understand better my scenario and how it should work.
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-SpringSummer', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 20, 300, '20170125', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170225', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20150801', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 21, 301, '20150901', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20151221', 'Sales', -2, NULL, NULL),
(1, 21, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 21, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 21, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 21, 302, '20151221', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 20, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 20, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20151221', 'Sales', -1, NULL, NULL)
;
WITH Inventory
AS (SELECT *,
PurchaseToDate = SUM(CASE WHEN Type = 'Purchase' THEN Qty ELSE 0 END) OVER (partition by store, item ORDER BY store, item,Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
SalesToDate = ABS(SUM(CASE WHEN Type = 'Sales' THEN Qty ELSE 0 END) OVER (partition by store, item,season ORDER BY store, item, Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM #transfer
)
SELECT Inventory.RowID,
Inventory.Area,
Inventory.Store,
Inventory.Item,
Inventory.Date,
Inventory.Type,
Inventory.Qty,
Season = CASE
WHEN Inventory.Season IS NULL
THEN ( SELECT TOP 1
PurchaseToSales.Season
FROM Inventory AS PurchaseToSales
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
and PurchaseToSales.Item = inventory.item --//Added
and PurchaseToSales.store = inventory.store --//Added
and PurchaseToSales.Area = Inventory.area --//Added
ORDER BY Inventory.Date
)
ELSE
Inventory.Season
END,
Inventory.PurchaseToDate,
Inventory.SalesToDate
FROM Inventory
Here the output:
enter image description here
After these adjustments, it works fine, but if I switch the fake data with the real data that are within in a 6 milions row data table, the query becomes very slow (~400 rows extracted per minutes) because of the insert of these check inside the where clause of the subquery:
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
and PurchaseToSales.Item = inventory.item --//Added
and PurchaseToSales.store = inventory.store --//Added
and PurchaseToSales.Area = Inventory.area --//Added
I've tryed to replace the subquery with the "Cross Apply" function but nothing has changed. Am I Missing somethings?
Thanks in advance

How to Hide Duplicates in a Query for Just One Field in SQL?

I'm pritty new in this topic and need help...
How to Hide Duplicates in a Query for Just One Field in SQL? For example, I have a table
1, 122, 123, 6
2, 122, 156, 7
3, 122, 188, 6
4, 101, 186, 8
and I want to get table
122, 123, 6
empty,156, 7
empty,188, 6
101, 186, 8
"Emty" means that this cell should be emty. Thanks for any help.
Here is one possible way:
DECLARE #t TABLE (col1 INT, col2 INT, col3 INT, col4 INT)
INSERT #t VALUES
(1, 122, 123, 69),
(2, 122, 156, 7),
(3, 122, 188, 6),
(4, 101, 186, 8)
SELECT CASE WHEN ROW_NUMBER() OVER (PARTITION BY col2 ORDER BY col1) = 1 THEN col2 END col2_New
, col3
, col4
FROM #t
ORDER BY
col1
Only the first occurrence of a col2 gets it's value written in the result set (values are ordered by col1 ascending), otherwise the value is NULL (which in database terms means no value defined). It uses TSQL ROW_NUMBER function, you can read about it here. Call to ROW_NUMBER function is enclosed in a CASE ... WHEN ... THEN ... END conditional expression that makes it possible to only write out value of col2 if it is the first occurrence of a certain value.

Resources