How can I use the LAG function and return the same value if the subsequent value in a row is duplicated? - sql-server

I am using the LAG function to move my values one row down.
However, I need to use the same value as previous if the items in source column is duplicated:
ID | SOURCE | LAG | DESIRED OUTCOME
1 | 4 | - | -
2 | 2 | 4 | 4
3 | 3 | 2 | 2
4 | 3 | 3 | 2
5 | 3 | 3 | 2
6 | 1 | 3 | 3
7 | 4 | 1 | 1
8 | 4 | 4 | 1
As you can see, for instance in ID range 3-5 the source data doesn't change and the desired outcome should be fed from the last row with different value (so in this case ID 2).

Sql server's version of lag supports an expression in the second argument to determine how many rows back to look. You can replace this with some sort of check to not look back e.g.
select lagged = lag(data,iif(decider < 0,0,1)) over (order by id)
from (values(0,1,'dog')
,(1,2,'horse')
,(2,-1,'donkey')
,(3,2,'chicken')
,(4,23,'cow'))f(id,decider,data)
This returns the following list
null
dog
donkey
donkey
chicken
Because the decider value on the row with id of 2 was negative.

Well, first lag may not be the tool for the job. This might be easier to solve with a recursive CTE. Sql and window functions work over set. That said, our goal here is to come up with a way of describing what we want. We'd like a way to partition our data so that sequential islands of the same value are part of the same set.
One way we can do that is by using lag to help us discover if the previous row was different or not.
From there, we can now having a running sum over these change events to create partitions. Once we have partitions, we can assign a row number to each element in the partition. Finally, once we have that, we can now use the row number to look
back that many elements.
;with d as (
select * from (values
(1,4)
,(2,2)
,(3,3)
,(4,3)
,(5,3)
,(6,1)
,(7,4)
,(8,4)
)f(id,source))
select *,lag(source,rn) over (order by Id)
from (
select *,rn=row_number() over (partition by partition_id order by id)
from (
select *, partition_id = sum(change) over (order by id)
from (
select *,change = iif(lag(source) over (order by id) != source,1,0)
from d
) source_with_change
) partitioned
) row_counted
As an aside, this an absolutely cruel interview question I was asked to do once.

Related

Equivalent of Excel Pivoting in Stata

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.
If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

Simultaneously Aggregating Overlapping Ranges (A Rectangle Problem)

My Problem
Consider a set of data with two intervals. For instance, consider a student schedule of classes. Each record has a begin and end date, and each class has a period start time and a period end time. But this schedule is not 'normalized' in the sense that some records overlap. So if you search for records encompassing a given date and period for a student, you might get multiple matches.
Here's a contrived example. I represent the dates as integers to simplify the problem:
declare #schedule table (
student char(3),
fromDate int,
toDate int,
fromPeriod int,
toPeriod int
)
insert #schedule values
('amy', 1, 7, 7, 9),
('amy', 3, 9, 5, 8),
('amy', 10, 12, 1, 3),
('ted', 1, 5, 11, 14),
('ted', 7, 11, 13, 16);
Amy's date and period ranges either overlap or are adjacent. If I queried for 'date 5 period 7', I would get two matches. I need these reworked so that they represent the same 'area' but no longer overlap.
Ted's periods overlap but his dates do not. This means there's no real overlap, so no need to re-work anything.
My Research
I've read many posts and some articles on working overlapping intervals. Namely:
Merge overlapping date intervals
Flattening intersecting timespans
Condense Time Periods with SQL
SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value
Flatten/merge overlapping time intervals
SQL Server separate overlapping dates
I've implemented one from Itzik from a blog entitled 'solutions-packing-date-and-time-intervals-puzzle' that has worked great for one particular project. I don't think it's a stable link, but I've found a copy of it here.
But I'm having difficulty extending the knowledge in those resources to my problem at hand. It might be my limitation. I have trouble following them. I've studied Itzik's solution and have come to understand a lot of it, but I do recall there's one piece I just couldn't understand. Or it might be that those solutions only work with singular ranges.
My Attempt
I resolved this question by treating the ranges as literal rectangle objects. It works. I've even made a version of it somewhat performant in my own application. So I'll post it as a solution in case it is of use to anyone with the same issue.
But it is so long and involved and there are enough quirks to it (e.g. buffering lines, looping shapes, working with float values, rounding issues) that I can't help but think that there's a much better way. Can the concepts of my listed resources be extended to dual ranges? Or do some SRID's allow cutting rectangles with zero-length lines?
Expected Results:
There is no one answer to this problem, because you can aggregate ranges and deconstruct them in different ways. But to minimize the number of resulting rectangles, there are really only two acceptable answers. Visually, with dates on the X axis and periods on the Y axis, overlapping ranges can start out like this:
+------------+
| |
| +------------+
| |||||||| | <- 2 overlapping rectangles
+----| |
| |
+------------+
We can rework it this way:
+---+ +-----+
| | | |
| | | | +---+ <- 3 non-overlapping
| | | | | | vertically cut rectangles
+---| | | | |
| | | |
+-----+ +---+
Or this way:
+-----------+
+-----------+
+-----------------+ <- 3 non-overlapping
+-----------------+ horizontally cut rectangles
+-----------+
+-----------+
Going with vertical cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |2 |7 |9 |
|amy |3 |7 |5 |9 |
|amy |8 |9 |5 |8 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Going with horizontal cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |7 |9 |9 |
|amy |1 |9 |7 |8 |
|amy |3 |9 |5 |6 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Either is acceptable. Though, to keep it deterministic and tractable, you would want to choose one strategy and stick with it.
Numbers table:
To address the problem geometrically as I indicate in my post, you have to work with the SQL Server geometry data type. Unfortunately, to get each individual shape or point inside a geometry value, you have to call for the shape by index. A numbers table helps with this. So I do that first (swap this out for your preferred implementation).
create table #numbers (i int);
declare #i int = 1;
while #i <= 100 begin
insert #numbers values (#i);
set #i += 1;
end;
Aggregate the Ranges:
The first required task is to convert the numeric ranges to geometric rectangles. Point creates the corner points. STUnion and STEnvelope serve to turn these into a
rectangle. Also, since we desire ranges to merge together when they are integer-adjacent, we add 1 to the 'to' fields before geometric conversion.
Then the rectangles must be unioned so that there are no overlaps. This is done by UnionAggregate. The result is a geometry object of rectilinearPolygons (boxy shapes).
The geometry object can still have multiple rectillinearPolygons. So these are listed and output as individual shapes to rectilinears.
with
aggregateRectangles as (
select student,
rectilinears = geometry::UnionAggregate(rectangle)
from #schedule s
cross apply (select
minPt = geometry::Point(s.fromDate, s.fromPeriod, 0),
maxPt = geometry::Point(s.toDate + 1, s.toPeriod + 1, 0)
) extremePoints
cross apply (select rectangle = minPt.STUnion(maxPt).STEnvelope()) enveloped
group by student
)
select ar.student,
r.rectilinear,
mm.minY,
mm.maxY
into #rectilinears
from aggregateRectangles ar
join #numbers n on n.i between 1 and ar.rectilinears.STNumGeometries()
cross apply (select rectilinear = ar.rectilinears.STGeometryN(n.i)) r
cross apply (select envelope = r.rectilinear.STEnvelope()) e
cross apply (select
minY = e.envelope.STPointN(1).STY,
maxY = e.envelope.STPointN(3).STY
) mm;
SideNote - Performance Option:
I'm not implementing it here. But if you're working with big-data, and your 'rectilinears' (plural) field above is shared among many groupings (such as many students having the same schedule), then save the Well-known-text version of the rectilinear object (Just do ToString()). After this, create a second dataset with distinct rectilinears and perform the remaining geometric operations on that condensed dataset. Join it back in to the student-level later on. This has significantly improved performance in my real case.
Decompose the Ranges:
Next, those rectilinears have to be decomposed back into rectangles. Splitters are created by creating vertical lines at the x coordinates of each point. The y axis could just as easily been chosen, I just chose x for my own semantics. Both axis could also have been chosen, but this would result in more records than necessary.
Unfortunately, SQL Server does not split a shape if the splitter has zero-width (set-theoretically, that's inappropriate, but I imagine that you can't represent the result properly in WKT format). So we need to give the splitters a buffer so that they have an area. There's STBuffer, though I've had trouble with it so I just create one manually.
With this, the rectangles are split. When they're split, they still all reside in the same geometry object, so they enumerated and then inserted individually into the #rectangles table.
with
createSplitters as (
select r.student,
rectilinear = geometry::STGeomFromText(r.rectilinear.ToString(), 0),
splitters = geometry::UnionAggregate(sp.splitter)
from #rectilinears r
join #numbers n on n.i between 1 and r.rectilinear.STNumPoints()
cross apply (select
x = r.rectilinear.STPointN(n.i).STX,
buffer = 0.001
) px
cross apply (select splitter =
geometry::Point(x - buffer, minY - buffer, 0).STUnion(
geometry::Point(x + buffer, maxY + buffer, 0)
).STEnvelope()
) sp
group by r.student,
r.rectilinear.ToString()
)
select student,
rectangle = rectangles.STGeometryN(n.i)
into #rectangles
from createSplitters sp
cross apply (select
rectangles = rectilinear.STDifference(sp.splitters)
) r
join #numbers n on n.i between 1 and r.rectangles.STNumGeometries();
Parse the Ranges:
That's the crux of it. What remains is simply to extract the proper values from the rectangles to give the ranges.
To do this, we first invoke STEnvelope to ensure the rectangles are only represented by their corner points. Then we round the corner points to undo the effects of our buffer, and any issues with float representation. We also subtract 1 from the 'to' fields to undo what we did before converting to geometric points.
select student,
fromDate = round(minPt.STX,0),
toDate = round(maxPt.STX,0) - 1,
fromPeriod = round(minPt.STY,0),
toPeriod = round(maxPt.STY,0) - 1
into #normalized
from #rectangles r
cross apply (select
minPt = r.rectangle.STPointN(1),
maxPt = r.rectangle.STPointN(3)
) corners
order by student, fromDate, fromPeriod;
Optional - Visualize Before and After:
I've made it this far, so I minus well give a visual representation of the before and after results. Press the 'Spatial Results' tab in SSMS, choose 'student' as the label column, and toggle between 'unnormalized' and 'normalized' as the spatial column.
The gaps between Amy's rectangles seem like an error at first, but remember that our 'to' fields represent not just the number recorded in them, but the entire fractional part up-to but excluding the next integer number. So, for instance, a toDate of 2 is really a to date of 2.99999 etc.
select student,
unnormalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope(),
normalized = null
from #schedule s
union all
select student,
unnormalized = null,
normalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope()
from #normalized;
that is a very creative solution and an interesting read!!
A rather simplistic approach:
with
a as (
select student, fromdate from #schedule union
select student, todate+1 from #schedule
),
b as (
select *,
todate = (
select min(aa.fromdate)
from a as aa
where aa.student = a.student
and aa.fromdate > a.fromdate
) - 1
from a
)
select *
from b
where exists (
select *
from #schedule as s
where s.student = b.student
and s.fromdate < b.todate
and s.todate > b.fromdate
);

How to force Postgresql "cartesian product" behavior when unnest'ing multiple arrays in select?

Postgresql behaves strangely when unnesting multiple arrays in the select list:
select unnest('{1,2}'::int[]), unnest('{3,4}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
vs when arrays are of different lengths:
select unnest('{1,2}'::int[]), unnest('{3,4,5}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
1 | 5
2 | 3
1 | 4
2 | 5
Is there any way to force the latter behaviour without moving stuff to the from clause?
The SQL is generated by a mapping layer and it will be very much easier for me to implement the new feature I am adding if I can keep everything in the select.
https://www.postgresql.org/docs/10/static/release-10.html
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed in
a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods.
(emphasis mine)
I tested with version 12.12 of Postgres and as mentioned by OP in a comment, having two unnest() works as expected when you move those to the FROM clause like so:
SELECT a, b
FROM unnest('{1,2}'::int[]) AS a,
unnest('{3,4}'::int[]) AS b;
Then you get the expected table:
a | b
--+--
1 | 3
1 | 4
2 | 3
2 | 4
As we can see, in this case the arrays have the same size.
In my case, the arrays come from a table. You can first name the name and then name column inside the unnest() calls like so:
SELECT a, b
FROM my_table,
unnest(col1) AS a,
unnest(col2) AS b;
You can, of course, select other columns as required. This is a normal cartesian product.

Is it possible to use json_populate_recordset in a subquery?

I have a json array with a couple of records, all of which have 3 fields lat, lon, v.
I would like to create a select subquery from this array to join with another query. The problem is that I cannot make the example in the PostgreSQL documentation work.
select * from json_populate_recordset(null::x, '[{"a":1,"b":2},{"a":3,"b":4}]')
Should result in:
a | b
---+---
1 | 2
3 | 4
But I only get ERROR: type "x" does not exist Position: 45
It is necessary to pass a composite type to json_populate_recordset whereas a column list is passed to json_to_recordset:
select *
from json_to_recordset('[{"a":1,"b":2},{"a":3,"b":4}]') x (a int, b int)
;
a | b
---+---
1 | 2
3 | 4

Postgres, find the min value of array column and get its position

Let us assume the following table: test
ids | l_ids | distance
-----------------+-----------
53 | {150,40} | {1.235, 2.265}
22 | {20,520} | {0.158, 0.568}
The positions of the two arrays (l_ids, distance) are dependent, meaning that
150 corresponds to 1.235,
40 corresponds to 2.265 and so on.
I want to get the min distance and the respective l_id. Therefore the result should be like this:
ids | l_ids | distance
----------------+-----------
53 | 150 | 1.235
22 | 20 | 0.158
By running this:
select l_ids, min(dist) as min_distance
from test, unnest(test.distance) dist
group by 1;
the result is:
l_ids | dist
------------+-----------
{150,40} | 1.235
{20,520} | 0.158
I want to get the position of the minimum value in the array of distance in order to get the respective id from the array of l_ids.
Any guideline? Thank you in advance
SELECT DISTINCT ON (ctid) l_ids, distance
FROM (SELECT ctid,
unnest(l_ids) AS l_ids,
unnest(distance) AS distance
FROM test
) q
ORDER BY ctid, distance;
l_ids | distance
-------+----------
150 | 1.235
20 | 0.158
(2 rows)
If ids are unique identiter, and you don't have repeated values in distance array, this should work:
SELECT a.* from
(SELECT ids, UNNEST(l_ids) AS l_ids, UNNEST (distance) AS distance FROM test) a
INNER JOIN (SELECT ids, MIN(distance) as mind FROM (
SELECT ids, UNNEST (distance) AS distance FROM test
) t
GROUP BY ids ) t
ON
a.ids = t.ids and a.distance = t.mind
From 9.4, you can use UNNEST()'s special ROWS FROM()-like syntax:
select distinct on (ids) ids, l_id, dist min_distance
from test, unnest(l_ids, test.distance) d(l_id, dist)
order by ids, dist
Otherwise, this is typical greatest-n-per-group query.

Resources