Postgres, find the min value of array column and get its position - arrays

Let us assume the following table: test
ids | l_ids | distance
-----------------+-----------
53 | {150,40} | {1.235, 2.265}
22 | {20,520} | {0.158, 0.568}
The positions of the two arrays (l_ids, distance) are dependent, meaning that
150 corresponds to 1.235,
40 corresponds to 2.265 and so on.
I want to get the min distance and the respective l_id. Therefore the result should be like this:
ids | l_ids | distance
----------------+-----------
53 | 150 | 1.235
22 | 20 | 0.158
By running this:
select l_ids, min(dist) as min_distance
from test, unnest(test.distance) dist
group by 1;
the result is:
l_ids | dist
------------+-----------
{150,40} | 1.235
{20,520} | 0.158
I want to get the position of the minimum value in the array of distance in order to get the respective id from the array of l_ids.
Any guideline? Thank you in advance

SELECT DISTINCT ON (ctid) l_ids, distance
FROM (SELECT ctid,
unnest(l_ids) AS l_ids,
unnest(distance) AS distance
FROM test
) q
ORDER BY ctid, distance;
l_ids | distance
-------+----------
150 | 1.235
20 | 0.158
(2 rows)

If ids are unique identiter, and you don't have repeated values in distance array, this should work:
SELECT a.* from
(SELECT ids, UNNEST(l_ids) AS l_ids, UNNEST (distance) AS distance FROM test) a
INNER JOIN (SELECT ids, MIN(distance) as mind FROM (
SELECT ids, UNNEST (distance) AS distance FROM test
) t
GROUP BY ids ) t
ON
a.ids = t.ids and a.distance = t.mind

From 9.4, you can use UNNEST()'s special ROWS FROM()-like syntax:
select distinct on (ids) ids, l_id, dist min_distance
from test, unnest(l_ids, test.distance) d(l_id, dist)
order by ids, dist
Otherwise, this is typical greatest-n-per-group query.

Related

Running Delta Issue in Google Data Studio

Data
Datapull | product | value
8/30 | X | 2
8/30 | Y | 3
8/30 | Y | 4
9/30 | X | 5
9/30 | Y | 6
Report
date range dimension: datapull
dimensions: data pull & product
metric: running delta of record count
Chart
For 8/30, The totals to start for product Y are right but Product X shows nothing when the first row of data has an entry for product X in 8/30.
The variances in 9/30 are wrong too.
Can someone please let me know if there is a way to do running deltas with 2 dimensions? This is not calculating correctly.
Using the combination of Breakdown dimension and Running delta calculation brakes the chart!
If you not mind to create a field for each Breakdown dimension (product), this will work:
case when product="X" then 1 else 0 end
And put each of these fields into 'Metric':

Finding the Difference By Row Between 2 Columns that are Both Arrays in SnowSql

I have a dataset that is comprised of a date and two other columns that are in array format. I am trying to find all the values in array_1 that are not in array_2.
Date | Array_1 | Array_2
-------------------------
1/20 | [1,2,3] | [1,2]
2/20 | [4,5,6] | [[1,2,4]
Desired Output:
Date | Array_1
--------------
1/20 | [3]
2/20 | [5,6]
The idea is:
Unnest ("flatten") the values into two tables.
Use set functions for the operation you want.
Re-aggregate to an array.
I don't have Snowflake on hand, but I think this is how it works:
select t.*, array_3
from t left join lateral
(select array_agg(el) as array_3
from (select array_1
from table(flatten(input ==> t.array_1)) a1
except
select array_2
from table(flatten(input ==> t.array_2)) a2
) x
) x
Just to remind that if you were to use application code for this, it might be as simple as (using PHP for this example):
$array1 = array(4,5,6);
$array2 = array(5,6,7);
print_r(array_diff($array1, $array2));
Outputs: Array ( [0] => 4 )

Simultaneously Aggregating Overlapping Ranges (A Rectangle Problem)

My Problem
Consider a set of data with two intervals. For instance, consider a student schedule of classes. Each record has a begin and end date, and each class has a period start time and a period end time. But this schedule is not 'normalized' in the sense that some records overlap. So if you search for records encompassing a given date and period for a student, you might get multiple matches.
Here's a contrived example. I represent the dates as integers to simplify the problem:
declare #schedule table (
student char(3),
fromDate int,
toDate int,
fromPeriod int,
toPeriod int
)
insert #schedule values
('amy', 1, 7, 7, 9),
('amy', 3, 9, 5, 8),
('amy', 10, 12, 1, 3),
('ted', 1, 5, 11, 14),
('ted', 7, 11, 13, 16);
Amy's date and period ranges either overlap or are adjacent. If I queried for 'date 5 period 7', I would get two matches. I need these reworked so that they represent the same 'area' but no longer overlap.
Ted's periods overlap but his dates do not. This means there's no real overlap, so no need to re-work anything.
My Research
I've read many posts and some articles on working overlapping intervals. Namely:
Merge overlapping date intervals
Flattening intersecting timespans
Condense Time Periods with SQL
SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value
Flatten/merge overlapping time intervals
SQL Server separate overlapping dates
I've implemented one from Itzik from a blog entitled 'solutions-packing-date-and-time-intervals-puzzle' that has worked great for one particular project. I don't think it's a stable link, but I've found a copy of it here.
But I'm having difficulty extending the knowledge in those resources to my problem at hand. It might be my limitation. I have trouble following them. I've studied Itzik's solution and have come to understand a lot of it, but I do recall there's one piece I just couldn't understand. Or it might be that those solutions only work with singular ranges.
My Attempt
I resolved this question by treating the ranges as literal rectangle objects. It works. I've even made a version of it somewhat performant in my own application. So I'll post it as a solution in case it is of use to anyone with the same issue.
But it is so long and involved and there are enough quirks to it (e.g. buffering lines, looping shapes, working with float values, rounding issues) that I can't help but think that there's a much better way. Can the concepts of my listed resources be extended to dual ranges? Or do some SRID's allow cutting rectangles with zero-length lines?
Expected Results:
There is no one answer to this problem, because you can aggregate ranges and deconstruct them in different ways. But to minimize the number of resulting rectangles, there are really only two acceptable answers. Visually, with dates on the X axis and periods on the Y axis, overlapping ranges can start out like this:
+------------+
| |
| +------------+
| |||||||| | <- 2 overlapping rectangles
+----| |
| |
+------------+
We can rework it this way:
+---+ +-----+
| | | |
| | | | +---+ <- 3 non-overlapping
| | | | | | vertically cut rectangles
+---| | | | |
| | | |
+-----+ +---+
Or this way:
+-----------+
+-----------+
+-----------------+ <- 3 non-overlapping
+-----------------+ horizontally cut rectangles
+-----------+
+-----------+
Going with vertical cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |2 |7 |9 |
|amy |3 |7 |5 |9 |
|amy |8 |9 |5 |8 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Going with horizontal cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |7 |9 |9 |
|amy |1 |9 |7 |8 |
|amy |3 |9 |5 |6 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Either is acceptable. Though, to keep it deterministic and tractable, you would want to choose one strategy and stick with it.
Numbers table:
To address the problem geometrically as I indicate in my post, you have to work with the SQL Server geometry data type. Unfortunately, to get each individual shape or point inside a geometry value, you have to call for the shape by index. A numbers table helps with this. So I do that first (swap this out for your preferred implementation).
create table #numbers (i int);
declare #i int = 1;
while #i <= 100 begin
insert #numbers values (#i);
set #i += 1;
end;
Aggregate the Ranges:
The first required task is to convert the numeric ranges to geometric rectangles. Point creates the corner points. STUnion and STEnvelope serve to turn these into a
rectangle. Also, since we desire ranges to merge together when they are integer-adjacent, we add 1 to the 'to' fields before geometric conversion.
Then the rectangles must be unioned so that there are no overlaps. This is done by UnionAggregate. The result is a geometry object of rectilinearPolygons (boxy shapes).
The geometry object can still have multiple rectillinearPolygons. So these are listed and output as individual shapes to rectilinears.
with
aggregateRectangles as (
select student,
rectilinears = geometry::UnionAggregate(rectangle)
from #schedule s
cross apply (select
minPt = geometry::Point(s.fromDate, s.fromPeriod, 0),
maxPt = geometry::Point(s.toDate + 1, s.toPeriod + 1, 0)
) extremePoints
cross apply (select rectangle = minPt.STUnion(maxPt).STEnvelope()) enveloped
group by student
)
select ar.student,
r.rectilinear,
mm.minY,
mm.maxY
into #rectilinears
from aggregateRectangles ar
join #numbers n on n.i between 1 and ar.rectilinears.STNumGeometries()
cross apply (select rectilinear = ar.rectilinears.STGeometryN(n.i)) r
cross apply (select envelope = r.rectilinear.STEnvelope()) e
cross apply (select
minY = e.envelope.STPointN(1).STY,
maxY = e.envelope.STPointN(3).STY
) mm;
SideNote - Performance Option:
I'm not implementing it here. But if you're working with big-data, and your 'rectilinears' (plural) field above is shared among many groupings (such as many students having the same schedule), then save the Well-known-text version of the rectilinear object (Just do ToString()). After this, create a second dataset with distinct rectilinears and perform the remaining geometric operations on that condensed dataset. Join it back in to the student-level later on. This has significantly improved performance in my real case.
Decompose the Ranges:
Next, those rectilinears have to be decomposed back into rectangles. Splitters are created by creating vertical lines at the x coordinates of each point. The y axis could just as easily been chosen, I just chose x for my own semantics. Both axis could also have been chosen, but this would result in more records than necessary.
Unfortunately, SQL Server does not split a shape if the splitter has zero-width (set-theoretically, that's inappropriate, but I imagine that you can't represent the result properly in WKT format). So we need to give the splitters a buffer so that they have an area. There's STBuffer, though I've had trouble with it so I just create one manually.
With this, the rectangles are split. When they're split, they still all reside in the same geometry object, so they enumerated and then inserted individually into the #rectangles table.
with
createSplitters as (
select r.student,
rectilinear = geometry::STGeomFromText(r.rectilinear.ToString(), 0),
splitters = geometry::UnionAggregate(sp.splitter)
from #rectilinears r
join #numbers n on n.i between 1 and r.rectilinear.STNumPoints()
cross apply (select
x = r.rectilinear.STPointN(n.i).STX,
buffer = 0.001
) px
cross apply (select splitter =
geometry::Point(x - buffer, minY - buffer, 0).STUnion(
geometry::Point(x + buffer, maxY + buffer, 0)
).STEnvelope()
) sp
group by r.student,
r.rectilinear.ToString()
)
select student,
rectangle = rectangles.STGeometryN(n.i)
into #rectangles
from createSplitters sp
cross apply (select
rectangles = rectilinear.STDifference(sp.splitters)
) r
join #numbers n on n.i between 1 and r.rectangles.STNumGeometries();
Parse the Ranges:
That's the crux of it. What remains is simply to extract the proper values from the rectangles to give the ranges.
To do this, we first invoke STEnvelope to ensure the rectangles are only represented by their corner points. Then we round the corner points to undo the effects of our buffer, and any issues with float representation. We also subtract 1 from the 'to' fields to undo what we did before converting to geometric points.
select student,
fromDate = round(minPt.STX,0),
toDate = round(maxPt.STX,0) - 1,
fromPeriod = round(minPt.STY,0),
toPeriod = round(maxPt.STY,0) - 1
into #normalized
from #rectangles r
cross apply (select
minPt = r.rectangle.STPointN(1),
maxPt = r.rectangle.STPointN(3)
) corners
order by student, fromDate, fromPeriod;
Optional - Visualize Before and After:
I've made it this far, so I minus well give a visual representation of the before and after results. Press the 'Spatial Results' tab in SSMS, choose 'student' as the label column, and toggle between 'unnormalized' and 'normalized' as the spatial column.
The gaps between Amy's rectangles seem like an error at first, but remember that our 'to' fields represent not just the number recorded in them, but the entire fractional part up-to but excluding the next integer number. So, for instance, a toDate of 2 is really a to date of 2.99999 etc.
select student,
unnormalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope(),
normalized = null
from #schedule s
union all
select student,
unnormalized = null,
normalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope()
from #normalized;
that is a very creative solution and an interesting read!!
A rather simplistic approach:
with
a as (
select student, fromdate from #schedule union
select student, todate+1 from #schedule
),
b as (
select *,
todate = (
select min(aa.fromdate)
from a as aa
where aa.student = a.student
and aa.fromdate > a.fromdate
) - 1
from a
)
select *
from b
where exists (
select *
from #schedule as s
where s.student = b.student
and s.fromdate < b.todate
and s.todate > b.fromdate
);

How can I use the LAG function and return the same value if the subsequent value in a row is duplicated?

I am using the LAG function to move my values one row down.
However, I need to use the same value as previous if the items in source column is duplicated:
ID | SOURCE | LAG | DESIRED OUTCOME
1 | 4 | - | -
2 | 2 | 4 | 4
3 | 3 | 2 | 2
4 | 3 | 3 | 2
5 | 3 | 3 | 2
6 | 1 | 3 | 3
7 | 4 | 1 | 1
8 | 4 | 4 | 1
As you can see, for instance in ID range 3-5 the source data doesn't change and the desired outcome should be fed from the last row with different value (so in this case ID 2).
Sql server's version of lag supports an expression in the second argument to determine how many rows back to look. You can replace this with some sort of check to not look back e.g.
select lagged = lag(data,iif(decider < 0,0,1)) over (order by id)
from (values(0,1,'dog')
,(1,2,'horse')
,(2,-1,'donkey')
,(3,2,'chicken')
,(4,23,'cow'))f(id,decider,data)
This returns the following list
null
dog
donkey
donkey
chicken
Because the decider value on the row with id of 2 was negative.
Well, first lag may not be the tool for the job. This might be easier to solve with a recursive CTE. Sql and window functions work over set. That said, our goal here is to come up with a way of describing what we want. We'd like a way to partition our data so that sequential islands of the same value are part of the same set.
One way we can do that is by using lag to help us discover if the previous row was different or not.
From there, we can now having a running sum over these change events to create partitions. Once we have partitions, we can assign a row number to each element in the partition. Finally, once we have that, we can now use the row number to look
back that many elements.
;with d as (
select * from (values
(1,4)
,(2,2)
,(3,3)
,(4,3)
,(5,3)
,(6,1)
,(7,4)
,(8,4)
)f(id,source))
select *,lag(source,rn) over (order by Id)
from (
select *,rn=row_number() over (partition by partition_id order by id)
from (
select *, partition_id = sum(change) over (order by id)
from (
select *,change = iif(lag(source) over (order by id) != source,1,0)
from d
) source_with_change
) partitioned
) row_counted
As an aside, this an absolutely cruel interview question I was asked to do once.

Is it possible to use json_populate_recordset in a subquery?

I have a json array with a couple of records, all of which have 3 fields lat, lon, v.
I would like to create a select subquery from this array to join with another query. The problem is that I cannot make the example in the PostgreSQL documentation work.
select * from json_populate_recordset(null::x, '[{"a":1,"b":2},{"a":3,"b":4}]')
Should result in:
a | b
---+---
1 | 2
3 | 4
But I only get ERROR: type "x" does not exist Position: 45
It is necessary to pass a composite type to json_populate_recordset whereas a column list is passed to json_to_recordset:
select *
from json_to_recordset('[{"a":1,"b":2},{"a":3,"b":4}]') x (a int, b int)
;
a | b
---+---
1 | 2
3 | 4

Resources