Joining column spread across multiple rows based on condition - sql-server

I have a table for employee with comment spread across multiple rows. Those need to be joined into a single row. To identify which rows can be joined, we need to use date field - if date is present and there is subsequent row with no date then that denote start of comment for employee. If however there is single row with no date with no prior date row as well then that is considered as new comment. The order in which comment are entered (identity column) is also provided so LEAD function was the way I was trying
Below Table is what we have:
Table screenshot
EmployeeId
Date
Comment
Order
1001
2021-01-08
This is only first part
1
1001
NULL
this is the second
2
1001
NULL
and this is third part
3
1001
2021-01-15
This is a new comment for same
4
1002
2021-01-16
This one has subsequent comment
5
1002
2021-01-16
The second comment
6
1003
NULL
This is single comment
7
1003
2021-01-12
This is also single comment
8
The result we expect is :
Result Expected
EmployeeId
Date
Comment
Order
1001
2021-01-08
This is only first part this is the second and this is third part
1
1001
NULL
This is a new comment for same
4
1002
2021-01-15
This one has subsequent comment The second comment
5
1003
2021-01-16
This is single comment
7
1003
2021-01-16
This is also single comment
8
I am trying the lead function but not able to get how to join n number of row based on condition. Any help?
SQL :
CREATE TABLE Comments(
[EmployeeID] [int] NOT NULL,
[Date] [date] null,
[Comment] [varchar](100) NULL,
[Order] [int] NULL
)
INSERT INTO Comments VALUES('1001','1/8/2021', 'This is only first part', 1)
INSERT INTO Comments VALUES('1001',NULL, 'this is the second', 2)
INSERT INTO Comments VALUES('1001',NULL, 'and this is third part', 3)
INSERT INTO Comments VALUES('1001','1/15/2021', 'This is a new comment for same', 4)
INSERT INTO Comments VALUES('1002','1/16/2021', 'This one has subsequent comment', 5)
INSERT INTO Comments VALUES('1002','1/16/2021', 'The second comment', 6)
INSERT INTO Comments VALUES('1003',NULL, 'This is single comment', 7)
INSERT INTO Comments VALUES('1003','1/12/2021', 'This is also single comment', 8)

I left a bunch of comments about details in your "expected results" that did not make sense given your requirements. If I go by your stated requirements then here is a solution:
First normalize the table to have dates so we can use group by
SELECT EmployeeID,
COALESCE([Date],
LAG([Date] OVER (ORDER BY [Order] ASC), -- Get the prior one if null
MIN([Date] OVER (PARTITION BY EmployeeID ORDER BY [Date] ASC)) AS [Date], -- Get the smallest one if the last two are null
Comment,
[Order]
FROM sometableyoudidnotname
Now that we have this table we can use group by and string_agg
SELECT EmployeeID,
MIN(cDate) as [DATE],
STRING_AGG(Comment, ' ') WITHIN GROUP (ORDER BY [ORDER] ASC) AS Comment
FROM (
SELECT EmployeeID,
COALESCE([Date],
LAG([Date] OVER (ORDER BY [Order] ASC), -- Get the prior one if null
MIN([Date] OVER (PARTITION BY EmployeeID ORDER BY [Date] ASC)) AS [Date], -- Get the smallest one if the last two are null
Comment,
[Order]
FROM sometableyoudidnotname) X
GROUP BY EmployeeID

Related

Finding a difference then the largest value over time

How do you get the row that gained most value over a period of time out of the large group set?
I've seen some overly-complicated variations on this question, and none with a good answer. I've tried to put together the simplest possible example:
Given a table like the one below, with row#, ID, year, and value columns, how would you find an ID that gained the most value and display the difference as a new column in the output?
Column A
ID
Year
Value
row 1
322
2012
150,000
row 2
322
2013
165,000
row 3
344
2012
220,000
row 4
344
2013
290,000
Desired output:
ID
Value
Value_Gained
344
290,000
70,000
SELECT id, year, value
FROM table
WHERE value = (SELECT MAX(value) FROM table);
The FIRST_VALUE window function will help you get values between last and first year for each of your ids. Then it's sufficient to order by your biggest values and getting one row using TOP(N).
SELECT TOP(1)
ID,
FIRST_VALUE([Value]) OVER(PARTITION BY [ID] ORDER BY [Year] DESC) AS [Value],
FIRST_VALUE([Value]) OVER(PARTITION BY [ID] ORDER BY [Year] DESC)
- FIRST_VALUE([Value]) OVER(PARTITION BY [ID] ORDER BY [Year]) AS [ValueGained]
FROM tab
ORDER BY [Value] DESC
Check the demo here.

How does updating rows from a subquery work in SQL Server?

How does SQL Server know which rows to update when updating from a subquery rather than a table?
Say I have a table with three columns defined like below:
CREATE TABLE A (
AId int IDENTITY (1,1) PRIMARY KEY,
AExternalId int NULL,
ASequence int NULL
)
I want to update the column ASequence by sequential numbers within groups of AExternalId where ASequence is NULL.
For example, having inserted four different AExternalId's (or groups),
INSERT INTO A ([AExternalId]) VALUES (1001)
INSERT INTO A ([AExternalId]) VALUES (1002)
INSERT INTO A ([AExternalId]) VALUES (1002)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId]) VALUES (1003)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 10)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 20)
INSERT INTO A ([AExternalId], [ASequence]) VALUES (1004, 30)
the table looks like this:
AId
AExternalId
ASequence
1
1001
NULL
2
1002
NULL
3
1002
NULL
4
1003
NULL
5
1003
NULL
6
1003
NULL
7
1004
10
8
1004
20
9
1004
30
After the update, the table should look like this:
AId
AExternalId
ASequence
1
1001
1
2
1002
1
3
1002
2
4
1003
1
5
1003
2
6
1003
3
7
1004
10
8
1004
20
9
1004
30
AIds within every group of AExternalId's now has a sequential number (except for the ones that already had a sequence).
I can achieve this by running the following query:
UPDATE t1
SET t1.[ASequence] = t1.[CalcSequence]
FROM (
SELECT AId, AExternalId, ASequence, ROW_NUMBER() OVER (PARTITION BY [AExternalId] ORDER BY [AExternalId], [AId] ASC) AS [CalcSequence]
FROM [A]
WHERE (ASequence IS NULL) AND (AExternalId IS NOT NULL)
) t1
The question is, why (or rather how) does this work as there is no table specified and no condition for the update?
I have been taught that an update without condition will update all rows in a table but in this case there is no table specified (only in the subquery).
Does this work because I am updating the resulting rows from the inner select? If so, how are rows "matched" so that the update is made on the correct row?
Is this an example of a Correlated Subquery?
I've tried to read up on those but failed to understand if this applies here. Many texts on Correlated Subqueries talk about performance issues and that the correlated subquery requires values from its outer query which does not seem to fit this example.
An alternative way of achieving the same result is by using an INNER JOIN:
UPDATE t1
SET t1.[ASequence] = t2.[CalcSequence]
FROM [A] t1
INNER JOIN (
SELECT AId, AExternalId, ASequence, ROW_NUMBER() OVER (PARTITION BY [AExternalId] ORDER BY [AExternalId], [AId] ASC) AS [CalcSequence]
FROM [A]
WHERE (ASequence IS NULL) AND (AExternalId IS NOT NULL)
) t2 ON t2.AId = t1.AId
I have compared the results of both queries and they are identical. Performance-wise, the first query seems to be a bit faster and consumes less resources.
The second query (with inner join) feels more "familiar", more "correct" but I would really like to understand how the first one works.

FInd duplicate rows and show only the earliest

I have the following table:
respid, uploadtime
I need a query that will show all the records that respid is duplicate and show them except the latest (by upload time)
exmple:
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01
4 2016-01-01
In this case the query should return four records (the latest is : 4 2016-01-01 )
Thank you very much.
Use ROW_NUMBER:
WITH cte AS (
SELECT respid, uploadtime,
ROW_NUMBER() OVER (PARTITION BY respid ORDER BY uploadtime DESC) rn
FROM yourTable
)
SELECT respid, uploadtime
FROM cte
WHERE rn > 1
ORDER BY respid, uploadtime;
The logic here is to show all records except those having the first row number value, which would be the latest records for each respid group.
If I interpreted your question correctly, then you want to see all records where respid occurs multiple times, but exclude the last duplicate.
Translating this to SQL could sound like "show all records that have a later record for the same respid". That is exactly what the solution below does. It says that for every row in the result a later record with the same respid must exists.
Sample data
declare #MyTable table
(
respid int,
uploadtime date
);
insert into #MyTable (respid, uploadtime) values
(4, '2014-01-01'),
(4, '2014-06-01'),
(4, '2015-01-01'),
(4, '2015-06-01'),
(4, '2016-01-01'), --> last duplicate of respid=4, not part of result
(5, '2020-01-01'); --> has no duplicate, not part of result
Solution
select mt.respid, mt.uploadtime
from #MyTable mt
where exists ( select top 1 'x'
from #MyTable mt2
where mt2.respid = mt.respid
and mt2.uploadtime > mt.uploadtime );
Result
respid uploadtime
----------- ----------
4 2014-01-01
4 2014-06-01
4 2015-01-01
4 2015-06-01

Checking next row in table is incremented by 1 minute in datetime column

I need to check alot of data in a Table to make sure my feed has not skipped anything.
Basically the table has the following columns
ID Datetime Price
The data in DateTime column is incremented by 1 minute in each successive row. I need to check the next row of the current one to see if is 1 minute above the one being queries in that specific context.
The query will probably need some sort of loop, then grab a copy of the next row and compare it to the datetime row of the current to make sure it is incremented by 1 minute.
I created a test-table to match your description, and inserted 100 rows with 1 minute between each row like this:
CREATE TABLE [Test] ([Id] int IDENTITY(1,1), [Date] datetime, [Price] int);
WITH [Tally] AS (
SELECT GETDATE() AS [Date]
UNION ALL
SELECT DATEADD(minute, -1, [Date]) FROM [Tally] WHERE [Date] > DATEADD(minute, -99, GETDATE())
)
INSERT INTO [Test] ([Date], [Price])
SELECT [Date], 123 AS [Price]
FROM [Tally]
Then i deleted a record in the middle to simulate a missing minute:
DELETE FROM [Test]
WHERE Id = 50
Now we can use this query to find missing records:
SELECT
a.*
,CASE WHEN b.[Id] IS NULL THEN 'Next record is Missing!' ELSE CAST(b.[Id] as varchar) END AS NextId
FROM
[Test] AS a
LEFT JOIN [Test] AS b ON a.[Date] = DATEADD(minute,1,b.[Date])
WHERE
b.[Id] IS NULL
The resullt will look like this:
Id Date Price NextId
----------- ----------------------- ----------- ------------------------------
49 2013-05-11 22:42:56.440 123 Next record is Missing!
100 2013-05-11 21:51:56.440 123 Next record is Missing!
(2 row(s) affected)
The key solution to the problem is to join the table with itself, but use datediff to find the record that is supposed to be found on the next minute. The last record of the table will of course report that the next row is missing, since it hasn't been inserted yet.
Borrowing TheQ's sample data you can use
WITH T
AS (SELECT *,
DATEDIFF(MINUTE, '20000101', [Date]) -
DENSE_RANK() OVER (ORDER BY [Date]) AS G
FROM Test)
SELECT MIN([Date]) AS StartIsland,
MAX([Date]) AS EndIsland
FROM T
GROUP BY G

SQL Pivot question

I'm having a hard time getting my head around a query im trying to build with SQL Server 2005.
I have a table, lets call its sales:
SaleId (int) (pk) EmployeeId (int) SaleDate(datetime)
I want to produce a report listing the total number of sales by an employee for each day in a given data range.
So, for example I want the see all sales in December 1st 2009 - December 31st 2009 with an output like:
EmployeeId Dec1 Dec2 Dec3 Dec4
1 10 10 1 20
2 25 10 2 2
..etc however the dates need to be flexible.
I've messed around with using pivot but cant quite seem to get it, any ideas welcome!
Here's a complete example. You can change the date range to fit your needs.
use sandbox;
create table sales (SaleId int primary key, EmployeeId int, SaleAmt float, SaleDate date);
insert into sales values (1,1,10,'2009-12-1');
insert into sales values (2,1,10,'2009-12-2');
insert into sales values (3,1,1,'2009-12-3');
insert into sales values (4,1,20,'2009-12-4');
insert into sales values (5,2,25,'2009-12-1');
insert into sales values (6,2,10,'2009-12-2');
insert into sales values (7,2,2,'2009-12-3');
insert into sales values (8,2,2,'2009-12-4');
SELECT * FROM
(SELECT EmployeeID, DATEPART(d, SaleDate) SaleDay, SaleAmt
FROM sales
WHERE SaleDate between '20091201' and '20091204'
) src
PIVOT (SUM(SaleAmt) FOR SaleDay
IN ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31])) AS pvt;
Results (actually 31 columns (for all possible month days) will be listed, but I'm just showing first 4):
EmployeeID 1 2 3 4
1 10 10 1 20
2 25 10 2 2
I tinkered a bit, and I think this is how you can do it with PIVOT:
select employeeid
, [2009/12/01] as Dec1
, [2009/12/02] as Dec2
, [2009/12/03] as Dec3
, [2009/12/04] as Dec4
from sales pivot (
count(saleid)
for saledate
in ([2009/12/01],[2009/12/02],[2009/12/03],[2009/12/04])
) as pvt
(this is my table:
CREATE TABLE [dbo].[sales](
[saleid] [int] NULL,
[employeeid] [int] NULL,
[saledate] [date] NULL
data is: 10 rows for '2009/12/01' for emp1, 25 rows for '2009/12/01' for emp2, 10 rows for '2009/12/02' for emp1, etc.)
Now, i must say, this is the first time I used PIVOT and perhaps I am not grasping it, but this seems pretty useless to me. I mean, what good is it to have a crosstab if you cannot do anything to specify the columns dynamically?
EDIT: ok- dcp's answer does it. The trick is, you don't have to explicitly name the columns in the SELECT list, * will actually correctly expand to a column for the first 'unpivoted' column, and a dynamically generated column for each value that appears in the FOR..IN clause in the PIVOT construct.

Resources