Related
I have a table which I want to get the latest entry for each group. Here's the table:
DocumentStatusLogs Table
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.
My preferred output:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:
SELECT
DocumentID,
GetOnlyTheTop(Status),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC
If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?
Please see the parent table for more information:
Current Documents Table
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATE
I just learned how to use "apply" which makes it easier to address such problems.
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
maintain status in 2 places
preserve status history
...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
I just learned how to use cross apply. Here's how to use it in this scenario:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.
More about the TOP clause can be found here.
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
If you're worried about performance, you can also do this with MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.
BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).
Edit: MsSQL does not support USING, so change it to:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:
Select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in Sql Server 2008 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).
Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs
)
SELECT ID
,DocumentID
,Status
,DateCreated
FROM CTE
WHERE RN = 1;
Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
SELECT DISTINCT
ID = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DocumentID
,Status = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DateCreated = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs;
Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.
SELECT DISTINCT
ID = CA.ID
,DocumentID = D.DocumentID
,Status = CA.Status
,DateCreated = CA.DateCreated
FROM DocumentStatusLogs D
CROSS APPLY (
SELECT TOP 1 I.*
FROM DocumentStatusLogs I
WHERE I.DocumentID = D.DocumentID
ORDER BY I.DateCreated DESC
) CA;
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
My code to select top 1 from each group
select a.* from #DocumentStatusLogs a where
datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where
a.documentid = b.documentid
order by datecreated desc
)
This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
SELECT T.doc_id, T.status, T.date_created FROM
(
SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;
CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.
Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
I believe this can be done just like this. This might need some tweaking but you can just select the max from the group.
These answers are overkill..
SELECT
d.DocumentID,
MAX(d.Status),
MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC
In scenarios where you want to avoid using row_count(), you can also use a left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
SELECT documentid,
status,
datecreated
FROM documentstatuslogs dlogs
WHERE status = (SELECT status
FROM documentstatuslogs
WHERE documentid = dlogs.documentid
ORDER BY datecreated DESC
LIMIT 1)
Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).
So the accepted answer can become
SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1
See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify
You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.
*Teradata, BigQuery, H2, Snowflake...
Try this:
SELECT [DocumentID]
,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
SELECT [DocumentID]
,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
FROM DocumentStatusLogs
GROUP BY DocumentID
) AS [tmpQry]
I have a table which I want to get the latest entry for each group. Here's the table:
DocumentStatusLogs Table
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.
My preferred output:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:
SELECT
DocumentID,
GetOnlyTheTop(Status),
GetOnlyTheTop(DateCreated)
FROM DocumentStatusLogs
GROUP BY DocumentID
ORDER BY DateCreated DESC
If such function doesn't exist, is there any way I can achieve the output I want?
Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?
Please see the parent table for more information:
Current Documents Table
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATE
I just learned how to use "apply" which makes it easier to address such problems.
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
As for normalised or not, it depends if you want to:
maintain status in 2 places
preserve status history
...
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
I just learned how to use cross apply. Here's how to use it in this scenario:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.
More about the TOP clause can be found here.
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
If you're worried about performance, you can also do this with MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs as a log, but still store the latest info in the main table.
BTW, if you already have the DateCreated column in the Documents table you can just join DocumentStatusLogs using that (as long as DateCreated is unique in DocumentStatusLogs).
Edit: MsSQL does not support USING, so change it to:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using first_value and over you can make short work of the above query:
Select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in Sql Server 2008 and up. First_value can be thought of as a way to accomplish Select Top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
Here are 3 separate approaches to the problem in hand along with the best choices of indexing for each of those queries (please try out the indexes yourselves and see the logical read, elapsed time, execution plan. I have provided the suggestions from my experience on such queries without executing for this specific problem).
Approach 1: Using ROW_NUMBER(). If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs
)
SELECT ID
,DocumentID
,Status
,DateCreated
FROM CTE
WHERE RN = 1;
Approach 2: Using FIRST_VALUE. If rowstore index is not being able to enhance the performance, you can try out nonclustered/clustered columnstore index as for queries with aggregation and grouping and for tables which are ordered by in different columns all the times, columnstore index usually is the best choice.
SELECT DISTINCT
ID = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DocumentID
,Status = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
,DateCreated = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM DocumentStatusLogs;
Approach 3: Using CROSS APPLY. Creating rowstore index on DocumentStatusLogs table covering the columns used in the query should be enough to cover the query without need of a columnstore index.
SELECT DISTINCT
ID = CA.ID
,DocumentID = D.DocumentID
,Status = CA.Status
,DateCreated = CA.DateCreated
FROM DocumentStatusLogs D
CROSS APPLY (
SELECT TOP 1 I.*
FROM DocumentStatusLogs I
WHERE I.DocumentID = D.DocumentID
ORDER BY I.DateCreated DESC
) CA;
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
My code to select top 1 from each group
select a.* from #DocumentStatusLogs a where
datecreated in( select top 1 datecreated from #DocumentStatusLogs b
where
a.documentid = b.documentid
order by datecreated desc
)
This solution can be used to get the TOP N most recent rows for each partition (in the example, N is 1 in the WHERE statement and partition is doc_id):
SELECT T.doc_id, T.status, T.date_created FROM
(
SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;
CROSS APPLY was the method I used for my solution, as it worked for me, and for my clients needs. And from what I've read, should provide the best overall performance should their database grow substantially.
Verifying Clint's awesome and correct answer from above:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
In Microsoft SQL Server Management Studio: after highlighting and running the first block, highlight both Option 1 and Option 2, right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
Option 1 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
I believe this can be done just like this. This might need some tweaking but you can just select the max from the group.
These answers are overkill..
SELECT
d.DocumentID,
MAX(d.Status),
MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC
In scenarios where you want to avoid using row_count(), you can also use a left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
SELECT documentid,
status,
datecreated
FROM documentstatuslogs dlogs
WHERE status = (SELECT status
FROM documentstatuslogs
WHERE documentid = dlogs.documentid
ORDER BY datecreated DESC
LIMIT 1)
Some database engines* are starting to support the QUALIFY clause that allows to filter the result of window functions (which the accepted answer uses).
So the accepted answer can become
SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1
See this article for an in depth explanation: https://jrandrews.net/the-joy-of-qualify
You can use this tool to see which database support this clause: https://www.jooq.org/translate/
There is an option to transform the qualify clause when the target dialect does not support it.
*Teradata, BigQuery, H2, Snowflake...
Try this:
SELECT [DocumentID]
,[tmpRez].value('/x[2]', 'varchar(20)') AS [Status]
,[tmpRez].value('/x[3]', 'datetime') AS [DateCreated]
FROM (
SELECT [DocumentID]
,cast('<x>' + max(cast([ID] AS VARCHAR(10)) + '</x><x>' + [Status] + '</x><x>' + cast([DateCreated] AS VARCHAR(20))) + '</x>' AS XML) AS [tmpRez]
FROM DocumentStatusLogs
GROUP BY DocumentID
) AS [tmpQry]
I have a table called body_scan that looks like this:
body_no tag
--------------------
1 noscan
2 noscan
3 missing
4 noscan
5 missing
I also have a list that I can load into a temp table like so
tag_no
------
aaa
bbb
ccc
What I need to be able to do is to update the body_scan table with the tag numbers in the temporary table.
You will notice that there are only 3 tags in the temp table but 5 in the body_scan table. I need to update the tag value "noscan" with values from the temp table and leave the missing as they are..
The order of the tags in the temporary table is the same as the order of body_no from the body_scan table.
So yes, I did consider the row_number() function. But I'm just not 100% sure how to define the join correctly..
How do I achieve this please?
The desired result is :
body_no tag
-------------------
1 aaa
2 bbb
3 missing
4 ccc
5 missing
Firstly, you need to preserve the input file order of data by adding an identity field to the temp_table (note that some ETL tools insert data in parallel and that messes things up so you might even need to add this column to the file)
Once you've done that, you need to generate a key in body_scan that you can join to. This is simply ROW_NUMBER() over the existing table, excluding the missing rows
This returns the row and what it should be matched to in temp_table
SELECT
body_no,
ROW_NUMBER() OVER (ORDER BY body_no) RN
FROM body_scan
WHERE tag<> 'missing';
This joins in the temp table (assumes your ordinal column is called RowID)
SELECT T1.body_no, T1.tag, T1.RN, T2.tag_no
FROM
(
SELECT
body_no,tag,
ROW_NUMBER() OVER (ORDER BY body_no) RN
FROM body_scan
WHERE tag<> 'missing'
) T1
INNER JOIN
temp_table T2
ON T1.RN=T2.RowID;
This updates it back to the table:
UPDATE TGT
SET tag=SRC.tag_no
FROM body_scan TGT
INNER JOIN
(
SELECT T1.body_no, T2.tag_no
FROM
(
SELECT
body_no,tag,
ROW_NUMBER() OVER (ORDER BY body_no) RN
FROM body_scan
WHERE tag<> 'missing'
) T1
INNER JOIN
temp_table T2
ON T1.RN=T2.RowID
) SRC
ON SRC.body_no=TGT.body_no;
(There's half a dozen ways to write that final statement but I prefer this way as you can see the dataset you're updating from in the subselect)
I cant understand your explanation and command discussion. I workout(in SQL 2012) to achieve your OUTPUT table. As,
update a
set a.tag = t.tag
from (
select m.*, ROW_NUMBER() over(partition by m.tag order by m.rn)trn from(
select *, row_number() over(partition by (select null) order by (select null)) rn from body_scan --set order what the order of actual table's order
) m --set row number for noscan rows
) a
join(
select *, ROW_NUMBER() over(order by (select null)) rn from #temp --set order what the order of actual table order
) t
on a.trn = t.rn and a.tag <> 'missing' -- join to noscan rows using row numbers
OUTPUT:
body_no tag
--------------
1 aaa
2 bbb
3 missing
4 ccc
5 missing
declare #t table
(
id int,
SomeNumt int
)
insert into #t
select 1,10
union
select 2,12
union
select 3,3
union
select 4,15
union
select 5,23
select * from #t
the above select returns me the following.
id SomeNumt
1 10
2 12
3 3
4 15
5 23
How do I get the following:
id srome CumSrome
1 10 10
2 12 22
3 3 25
4 15 40
5 23 63
select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as sum
from #t t1
inner join #t t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
SQL Fiddle example
Output
| ID | SOMENUMT | SUM |
-----------------------
| 1 | 10 | 10 |
| 2 | 12 | 22 |
| 3 | 3 | 25 |
| 4 | 15 | 40 |
| 5 | 23 | 63 |
Edit: this is a generalized solution that will work across most db platforms. When there is a better solution available for your specific platform (e.g., gareth's), use it!
The latest version of SQL Server (2012) permits the following.
SELECT
RowID,
Col1,
SUM(Col1) OVER(ORDER BY RowId ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col2
FROM tablehh
ORDER BY RowId
or
SELECT
GroupID,
RowID,
Col1,
SUM(Col1) OVER(PARTITION BY GroupID ORDER BY RowId ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col2
FROM tablehh
ORDER BY RowId
This is even faster. Partitioned version completes in 34 seconds over 5 million rows for me.
Thanks to Peso, who commented on the SQL Team thread referred to in another answer.
For SQL Server 2012 onwards it could be easy:
SELECT id, SomeNumt, sum(SomeNumt) OVER (ORDER BY id) as CumSrome FROM #t
because ORDER BY clause for SUM by default means RANGE UNBOUNDED PRECEDING AND CURRENT ROW for window frame ("General Remarks" at https://msdn.microsoft.com/en-us/library/ms189461.aspx)
Let's first create a table with dummy data:
Create Table CUMULATIVESUM (id tinyint , SomeValue tinyint)
Now let's insert some data into the table;
Insert Into CUMULATIVESUM
Select 1, 10 union
Select 2, 2 union
Select 3, 6 union
Select 4, 10
Here I am joining same table (self joining)
Select c1.ID, c1.SomeValue, c2.SomeValue
From CumulativeSum c1, CumulativeSum c2
Where c1.id >= c2.ID
Order By c1.id Asc
Result:
ID SomeValue SomeValue
-------------------------
1 10 10
2 2 10
2 2 2
3 6 10
3 6 2
3 6 6
4 10 10
4 10 2
4 10 6
4 10 10
Here we go now just sum the Somevalue of t2 and we`ll get the answer:
Select c1.ID, c1.SomeValue, Sum(c2.SomeValue) CumulativeSumValue
From CumulativeSum c1, CumulativeSum c2
Where c1.id >= c2.ID
Group By c1.ID, c1.SomeValue
Order By c1.id Asc
For SQL Server 2012 and above (much better performance):
Select
c1.ID, c1.SomeValue,
Sum (SomeValue) Over (Order By c1.ID )
From CumulativeSum c1
Order By c1.id Asc
Desired result:
ID SomeValue CumlativeSumValue
---------------------------------
1 10 10
2 2 12
3 6 18
4 10 28
Drop Table CumulativeSum
A CTE version, just for fun:
;
WITH abcd
AS ( SELECT id
,SomeNumt
,SomeNumt AS MySum
FROM #t
WHERE id = 1
UNION ALL
SELECT t.id
,t.SomeNumt
,t.SomeNumt + a.MySum AS MySum
FROM #t AS t
JOIN abcd AS a ON a.id = t.id - 1
)
SELECT * FROM abcd
OPTION ( MAXRECURSION 1000 ) -- limit recursion here, or 0 for no limit.
Returns:
id SomeNumt MySum
----------- ----------- -----------
1 10 10
2 12 22
3 3 25
4 15 40
5 23 63
Late answer but showing one more possibility...
Cumulative Sum generation can be more optimized with the CROSS APPLY logic.
Works better than the INNER JOIN & OVER Clause when analyzed the actual query plan ...
/* Create table & populate data */
IF OBJECT_ID('tempdb..#TMP') IS NOT NULL
DROP TABLE #TMP
SELECT * INTO #TMP
FROM (
SELECT 1 AS id
UNION
SELECT 2 AS id
UNION
SELECT 3 AS id
UNION
SELECT 4 AS id
UNION
SELECT 5 AS id
) Tab
/* Using CROSS APPLY
Query cost relative to the batch 17%
*/
SELECT T1.id,
T2.CumSum
FROM #TMP T1
CROSS APPLY (
SELECT SUM(T2.id) AS CumSum
FROM #TMP T2
WHERE T1.id >= T2.id
) T2
/* Using INNER JOIN
Query cost relative to the batch 46%
*/
SELECT T1.id,
SUM(T2.id) CumSum
FROM #TMP T1
INNER JOIN #TMP T2
ON T1.id > = T2.id
GROUP BY T1.id
/* Using OVER clause
Query cost relative to the batch 37%
*/
SELECT T1.id,
SUM(T1.id) OVER( PARTITION BY id)
FROM #TMP T1
Output:-
id CumSum
------- -------
1 1
2 3
3 6
4 10
5 15
Select
*,
(Select Sum(SOMENUMT)
From #t S
Where S.id <= M.id)
From #t M
You can use this simple query for progressive calculation :
select
id
,SomeNumt
,sum(SomeNumt) over(order by id ROWS between UNBOUNDED PRECEDING and CURRENT ROW) as CumSrome
from #t
There is a much faster CTE implementation available in this excellent post:
http://weblogs.sqlteam.com/mladenp/archive/2009/07/28/SQL-Server-2005-Fast-Running-Totals.aspx
The problem in this thread can be expressed like this:
DECLARE #RT INT
SELECT #RT = 0
;
WITH abcd
AS ( SELECT TOP 100 percent
id
,SomeNumt
,MySum
order by id
)
update abcd
set #RT = MySum = #RT + SomeNumt
output inserted.*
For Ex: IF you have a table with two columns one is ID and second is number and wants to find out the cumulative sum.
SELECT ID,Number,SUM(Number)OVER(ORDER BY ID) FROM T
Once the table is created -
select
A.id, A.SomeNumt, SUM(B.SomeNumt) as sum
from #t A, #t B where A.id >= B.id
group by A.id, A.SomeNumt
order by A.id
The SQL solution wich combines "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" and "SUM" did exactly what i wanted to achieve.
Thank you so much!
If it can help anyone, here was my case. I wanted to cumulate +1 in a column whenever a maker is found as "Some Maker" (example). If not, no increment but show previous increment result.
So this piece of SQL:
SUM( CASE [rmaker] WHEN 'Some Maker' THEN 1 ELSE 0 END)
OVER
(PARTITION BY UserID ORDER BY UserID,[rrank] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Cumul_CNT
Allowed me to get something like this:
User 1 Rank1 MakerA 0
User 1 Rank2 MakerB 0
User 1 Rank3 Some Maker 1
User 1 Rank4 Some Maker 2
User 1 Rank5 MakerC 2
User 1 Rank6 Some Maker 3
User 2 Rank1 MakerA 0
User 2 Rank2 SomeMaker 1
Explanation of above: It starts the count of "some maker" with 0, Some Maker is found and we do +1. For User 1, MakerC is found so we dont do +1 but instead vertical count of Some Maker is stuck to 2 until next row.
Partitioning is by User so when we change user, cumulative count is back to zero.
I am at work, I dont want any merit on this answer, just say thank you and show my example in case someone is in the same situation. I was trying to combine SUM and PARTITION but the amazing syntax "ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW" completed the task.
Thanks!
Groaker
Above (Pre-SQL12) we see examples like this:-
SELECT
T1.id, SUM(T2.id) AS CumSum
FROM
#TMP T1
JOIN #TMP T2 ON T2.id < = T1.id
GROUP BY
T1.id
More efficient...
SELECT
T1.id, SUM(T2.id) + T1.id AS CumSum
FROM
#TMP T1
JOIN #TMP T2 ON T2.id < T1.id
GROUP BY
T1.id
Try this
select
t.id,
t.SomeNumt,
sum(t.SomeNumt) Over (Order by t.id asc Rows Between Unbounded Preceding and Current Row) as cum
from
#t t
group by
t.id,
t.SomeNumt
order by
t.id asc;
Try this:
CREATE TABLE #t(
[name] varchar NULL,
[val] [int] NULL,
[ID] [int] NULL
) ON [PRIMARY]
insert into #t (id,name,val) values
(1,'A',10), (2,'B',20), (3,'C',30)
select t1.id, t1.val, SUM(t2.val) as cumSum
from #t t1 inner join #t t2 on t1.id >= t2.id
group by t1.id, t1.val order by t1.id
Without using any type of JOIN cumulative salary for a person fetch by using follow query:
SELECT * , (
SELECT SUM( salary )
FROM `abc` AS table1
WHERE table1.ID <= `abc`.ID
AND table1.name = `abc`.Name
) AS cum
FROM `abc`
ORDER BY Name
This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Closed 6 years ago.
I am trying to write a query to pick one entry for each item for each month but the latest in the month from the following table:
Name | Date | Value
a |2015-01-01 | 1
a |2015-01-02 | 2
b |2015-01-03 | 1
b |2015-01-04 | 1
b |2015-01-03 | 3
c |2015-01-02 | 2
c |2015-01-29 | 10
a |2015-02-10 | 2
a |2015-02-20 | 1
c |2015-02-10 | 2
c |2015-02-22 | 23
b |2015-02-25 | 1
b |2015-02-19 | 2
return should be:
a |2015-01-02 | 2
b |2015-01-04 | 1
c |2015-01-29 | 10
a |2015-02-20 | 1
b |2015-02-25 | 1
c |2015-02-22 | 23
I wonder how would this be achieved instead of sending multiple queries to SQL server for each month I would like to load all the values with one query then filter the collection on the memory. Otherwise I would end up writing a query as below:
SELECT Name,Date, Value FROM MyTable mt
INNER JOIN (
select max(Date) as MaxDate
FROM [MyTable] m WHERE YEAR(Date) =YEAR(#date)
AND MONTH(Date)=MONTH(#date)) mx ON t.Date = mx.MaxDate)
And this query needs to be run for each month.
Any better idea to return all entries with a single query?
Thanks,
Try grouping by year and month in the derived table:
SELECT t1.Name, t1.[Date], t1.Value
FROM MyTable t1
INNER JOIN (
SELECT Name, YEAR(Date) AS y, MONTH([Date]) AS m, MAX([Date]) as MaxDate
FROM MyTable
GROUP BY Name, YEAR(Date), MONTH([Date])
) t2 ON t1.Name = t2.Name AND
YEAR(t1.[Date]) = t2.y AND MONTH(t1.[Date]) = t2.m AND
t1.[Date] = t2.MaxDate
SELECT *
FROM (
SELECT NAME, DATE, VALUE,
ROW_NUMBER() OVER (PARTITION BY NAME, YEAR(Date), MONTH(Date)
ORDER BY Date DESC) rn
FROM MyTable) AS t
WHERE t.rn = 1
Assuming that you are using a SQL Server version that supports it, you can use the ROW_NUMBER() windowing function to return a sequence number for each row, then you can subsequently use that to restrict to only the rows that you require.
SELECT [Name],[Date],[Value]
,ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY [Date] DESC) AS [Seq]
FROM myTable
Things to consider:
What happens when there is a tie? ROW_NUMBER will always return a sequence number, but if your data has > 1 row at the same Date value, the order will be arbritrary. To solve this add additional tie-break ORDER BY entries
How do I filter this? Put it into a Common Table Expression, Inline View or Real View
I think you need a correlated query once you have a set of distinct (Name, Month). There are various ways of doing this, one is to use cross apply:
select *
from (select distinct Name, Month(Date) as Month
from theTable) itemMonths
cross apply (select Max(value)
from theTable t
where Month(t.Date) = itemMonths.Month
and t.Name = itemMonths.Name)
You could try the following:
WITH MyTable AS
(SELECT 'a' AS name, GETDATE() AS date, 1 AS value
UNION ALL
SELECT 'a', GETDATE()+1, 2
)
, res AS (
SELECT Name,date,MAX(Date) OVER(PARTITION BY Name, DATEPART(yyyy,date), DATEPART(mm, date)) AS max_date , Value FROM MyTable
)
SELECT name,date,res.value FROM res WHERE date=max_date
You still need a filter though as the Max Over will return all rows.
If you were using Teradata I'd suggest using the Qualify Clause but Itzik hasn't had any luck getting this ported to SQL server!
https://connect.microsoft.com/SQLServer/feedback/details/532474
Use Cross apply
SELECT b.*
FROM mytable mt
CROSS apply (SELECT TOP 1 NAME, date, value
FROM [mytable] m
WHERE m.NAME = mt.NAME
AND Month(m.date) = Month(mt.date)
AND Year(m.date) = Year(mt.date)
ORDER BY m.date DESC) b