Grouping by or iterating through partitions in SQL - sql-server

Two part question regarding partitioning in SQL.
In T-SQL when you use PARTITION BY is there a way to assign a unique number to each partition, in addition to something like row_number()?
E.g. row_number() would yield,
Action Timestamp RowNum
A '2013-1-10' 1
A '2013-1-11' 2
B '2013-1-12' 1
B '2013-1-13' 2
Whereas, in addition, uniquely identifying each partition could yield,
Action Timestamp RowNum PartitionNum
A '2013-1-10' 1 1
A '2013-1-11' 2 1
B '2013-1-12' 1 2
B '2013-1-13' 2 2
Then one could GROUP BY partition number.
Second part of my question is, how can you break out each partition and iterate through it, e.g.,
for each partition p
for each row r in p
do F(r)
Any way in T-SQL?

You could use dense_rank():
select *
, row_number() over (partition by Action order by Timestamp) as RowNum
, dense_rank() over (order by Action) as PartitionNum
from YourTable
Example at SQL Fiddle.
T-SQL is not good at iterating, but if you really have to, check out cursors.

Related

Create 2 table ids, the second restarts when the first increments

I have the following table:
CREATE TABLE [dbo].[CASES]
(
[CASE_ID] INT NOT NULL,
[CASE_SECTION] INT NOT NULL,
[CASE_DATA] NVARCHAR(MAX) NOT NULL,
)
I want CASE_SECTION to increment based on whether the CASE_ID has changed.
Example:
CASE_ID CASE_SECTION CASE_DATA
---------------------------------------------
1 1 'FROG ATE THE FLY'
1 2 'FROG SAT ON LOG'
2 1 'CHEETAH CHEATAXED'
3 1 'BLUE CHEESE STINKS'
Basically, I want to do something similar to using ROW_NUMBER() OVER(PARTITION BY CASE_ID) as the CASE_DATA is inserted into the table.
Is there a way I can set the table up so that CASE_SECTION increments like this by default when data is inserted?
you can use rownumber:
Select row_number() over(partition by case_id order by case_data) as case_section, * from yourtable
If you can add either an IDENTITY or InsertedDateTime column to the table, you can use that to make Case_Section a computed column that uses row_number() partitioned by case_id.
Another poster suggested this, but if you order by case_data or any other column that isn't guaranteed to be ordinal, you run the risk that the value will move around as data is inserted and changes the order of the rows.
If your computation does row_number() over(partition by case_id order by [ColumnThatIncreasesWithEachInsert]) then the values will be stable over time.

min(count(*)) over... behavior?

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle
The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

MS SQL Server Algebraic Syntax

I have a table logging a floating point value from a scale (a weight). I'd like to evaluate the absolute value of the integral of this curve dynamically. I'm attempting to perform some simple algebra based on the trapezoidal approx. with a sampling rate (b-a=1) of one:
(b-a)((f(a)+f(b))/2 - f(a))
The values f(a) and f(b) represent the 2 most recent values logged in my SQL Server table. I've attempted the following with an evalution error:
SELECT TOP 2
SUM(Scale_Weight) OVER(ORDER BY t_stamp DESC)/2.0
FROM table
This query evaluates, but simply divides the most recent value by 2:
SELECT
SUM(Scale_Weight) OVER(ORDER BY t_stamp DESC)/2.0
FROM table
As you can see, I haven't even attempted the absolute value or the subtraction of the "2nd most recent" value because I didn't know how to reference a specific row (cell?). As a noob, I feel the math is doable in a single query, I just can't find the proper syntax. Thanks in advance.
So to update more clearly:
Thanks for the input ps2goat, though for some reason I'm unable to implement "TOP" function, so I currently have this:
SELECT ABS(SUM(Scale_Weight) OVER(PARTITION BY quality_code
ORDER BY t_stamp
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)/2.0)
FROM table
Still need to subtract the preceding value, something like:
SELECT ABS(SUM(Scale_Weight) OVER(PARTITION BY quality_code
ORDER BY t_stamp
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)/2.0
- 1 PRECEDING)
FROM table
Any ideas to reference the preceding value for subtraction?
You can use the LAG function to refer to the last value in a certain order. For example:
SELECT Scale_Weight AS Current, LAG(Scale_Weight) AS Last OVER (ORDER BY t_stamp)
FROM table
You can add your formula tothis query.
This is what I did. Instead of timestamps, I used an Identity field, as those are incremented and easier to enter manually (not sure if you had datetime values or actual timestamp values)
fiddle: http://sqlfiddle.com/#!6/77bcb/4/0
schema:
create table x(
xId int identity(1,1) not null primary key,
scale_weight decimal(12,4)
);
insert into x(scale_weight)
select 24.1234 union all
select 32.4455 union all
select 88.1234 union all
select 223.443;
The inner query (below) grabs the top two rows, ordered by id descending (use your t_stamp column). The outer query sums all the Scale_Weight values returned by the inner query and divides that value by two.
sql:
select SUM(Scale_Weight)/2.0 from
(
SELECT TOP 2 Scale_Weight
FROM x
ORDER BY xid DESC
) y

FInding max value from TOP selection grouped by key in SQL Server

Apologies for goofy title. I am not sure how to describe the problem.
I have a table in SQL Server with this structure;
ID varchar(15)
ProdDate datetime
Value double
For each ID there can be hundreds of rows, each with its own ProdDate. ID and ProdDate form the unique key for the table.
What I need to do is find the maximum Value for each ID based upon the first 12 samples, ordered by ProdDate ascending.
Said another way. For each ID I need to find the 12 earliest dates for that ID (the sampling for each ID will start at different dates) and then find the maximum Value for those 12 samples.
Any idea of how to do this without multiple queries and temporary tables?
You can use a common table expression and ROW_NUMBER to logically define the TOP 12 per Id then MAX ... GROUP BY on that.
;WITH T
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY ProdDate) AS RN
FROM YourTable)
SELECT Id,
MAX(Value) AS Value
FROM T
WHERE RN <= 12
GROUP BY Id

Resources