How to force Postgresql "cartesian product" behavior when unnest'ing multiple arrays in select? - arrays

Postgresql behaves strangely when unnesting multiple arrays in the select list:
select unnest('{1,2}'::int[]), unnest('{3,4}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
vs when arrays are of different lengths:
select unnest('{1,2}'::int[]), unnest('{3,4,5}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
1 | 5
2 | 3
1 | 4
2 | 5
Is there any way to force the latter behaviour without moving stuff to the from clause?
The SQL is generated by a mapping layer and it will be very much easier for me to implement the new feature I am adding if I can keep everything in the select.

https://www.postgresql.org/docs/10/static/release-10.html
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed in
a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods.
(emphasis mine)

I tested with version 12.12 of Postgres and as mentioned by OP in a comment, having two unnest() works as expected when you move those to the FROM clause like so:
SELECT a, b
FROM unnest('{1,2}'::int[]) AS a,
unnest('{3,4}'::int[]) AS b;
Then you get the expected table:
a | b
--+--
1 | 3
1 | 4
2 | 3
2 | 4
As we can see, in this case the arrays have the same size.
In my case, the arrays come from a table. You can first name the name and then name column inside the unnest() calls like so:
SELECT a, b
FROM my_table,
unnest(col1) AS a,
unnest(col2) AS b;
You can, of course, select other columns as required. This is a normal cartesian product.

Related

Excel: Problems w. INDIRECT, Arrays, and Aggregate Functions (SUM, MAX, etc.)

Objective
I have a Microsoft Excel spreadsheet containing a price list that may change over time (B2:B5 in the example). Separately, I have a budget that too may change over time (D2). I am attempting to construct a formula for E2 to output the number of items that can be purchased with the budget in D2. Thereafter, I'll attempt to construct formulas to output any change that would be made (F2) and a comma-delimited list of purchasable items (G2).
Note: It unfortunately isn't possible to add an intermediate calculation column to the list, such as a running total. As such, I'm trying for formulas for single cells (i.e., E2, F2, and G2).
Note: I'm using Excel for Mac 2019.
A B C D E F G
+---------+---------+-----+---------+-------+---------+---------------------------+
1 | Label | Price | | Budget | Items | Change | Item(s) |
+---------+---------+-----+---------+-------+---------+---------------------------+
2 | Item #1 | $ 10.00 | | $ 40.00 | 3 | $ 4.50 | Item #1, Item #2, Item #3 |
+---------+---------+-----+---------+-------+---------+---------------------------+
3 | Item #2 | $ 20.00 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
4 | Item #3 | $ 5.50 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
5 | Item #4 | $ 25.00 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
6 | Item #5 | $ 12.50 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
For E2, I've attempted:
{=MAX(N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)}
Though, the above values and this formula result in an output of -1.
Note: The formula for F2 and G2 seemingly easily follow E2; e.g. {=$D2-SUM(IF((ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)<=$E2,$B$2:$B$6,0))} and {=TEXTJOIN(", ",TRUE,INDIRECT("$A$2:$A$"&(MIN(ROW($B$2:$B$6))+$E2-1)))} seem to work well, respectively.
Observations
{="$B$2:$B$"&ROW($B$2:$B$6)} evaluates to {"$B$2:$B$2";"$B$2:$B$3";...;"$B$2:$B$6"} (as desired);
{=INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)) should evaluate to the equivalent of {{$B$2:$B$2},{$B$2:$B$3},...,{$B$2:$B$6}}; though, as a 1x5 multi-cell array formula, evaluates to the equivalent of {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!} and, with F9 does to {10;#N/A;#N/A;#N/A;12.5};
{=SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2}, as a 1x5 multi-cell array formula, evaluates to the equivalent of {TRUE;TRUE;TRUE;FALSE;FALSE} (as desired); though, with F9 does to #VALUE!;
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)}, as a 1x5 multi-cell array formula, evaluates to the equivalent of 1;1;1;0;0 (as desired); though, with F9 does again to #VALUE!;
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6), as as 1x5 multi-cell array formula, evaluates to the equivalent of {2,3,4,0,0} (as desired); though, with F9 does to {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!};
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1}, as a 1x5 multi-cell array formula, evaluates to the equivalent of {1,2,3,-1,-1} (as desired); though, with F9 does again to {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!}; and,
{=MAX(N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)} evaluates to -1
Interestingly:
If {=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1} is placed as the multi-cell array formula in, say, E10:E14, a =MAX($E$10:$E$14) results in 3 (as desired).
Speculation
At present, I'm speculating that, when entered as a single cell array formula, the INDIRECT is not being assessed to be array producing and/or the SUM, as part of a single cell array formula, is not producing an array result.
Please assist. And, thank you in advance.
Solutions (Thanks to Contributors Below)
For E2, {=IF($B$2<=$D2,MATCH(1,0/(MMULT(N(ROW($B$2:$B$6)>=TRANSPOSE(ROW($B$2:$B$6))),$B$2:$B$6)<=$D2)),0)} (thank you Jos Woolley);
For F2, =IF($E2=0,MAX(0,$D2),$D2-SUM($B$2:INDEX($B$2:$B$6,$E2))) (thank you P.b); and,
For G2, =IF($E2=0,"",TEXTJOIN(", ",TRUE,$A$2:INDEX($A$2:$A$6,$E2))) (thank you P.b).
The first point to make, as I mentioned in the comments, is that it must be understood that piecemeal evaluation of a formula - via highlighting subsections of that formula and committing with F9 within the formula bar - will not necessarily correspond to the actual evaluation.
Evaluation via F9 in the formula bar always forces that part to be evaluated as an array. Though this is misleading, since the overall construction may not actually evaluate that part as an array.
The second point to make is that SUM cannot iterate over an array of ranges, though SUBTOTAL, for example, can, so replacing SUM with SUBTOTAL (9, in your current formula should work.
However, you would still be left with a construction which is volatile, so I would recommend this non-volatile alternative:
=MATCH(1,0/(MMULT(N(ROW(B2:B6)>=TRANSPOSE(ROW(B2:B6))),B2:B6)<=D2))
In E2 you can use:
=MATCH(TRUE,--SUBTOTAL(9,OFFSET(B2:B6,,,ROW(B2:B6)))>=D2,0)
In F2 you can use:
=D2-SUM(B2:INDEX(B2:B6,E2))
In G2 you can use:
=TEXTJOIN(", ",1,A2:INDEX(A2:A6,E2))

Equivalent of Excel Pivoting in Stata

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.
If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

make x in a cell equal 8 and total

I need an excel formula that will look at the cell and if it contains an x will treat it as a 8 and add it to the total at the bottom of the table. I have done these in the pass and I am so rusty that I cannot remember how I did it.
Generally, I try and break this sort of problem into steps. In this case, that'd be:
Determine if a cell is 'x' or not, and create new value accordingly.
Add up the new values.
If your values are in column A (for example), in column B, fill in:
=if(A1="x", 8, 0) (or in R1C1 mode, =if(RC[-1]="x", 8, 0).
Then just sum those values (eg sum(B1:B3)) for your total.
A | B
+---------+---------+
| VALUES | TEMP |
+---------+---------+
| 0 | 0 <------ '=if(A1="x", 8, 0)'
| x | 8 |
| fish | 0 |
+---------+---------+
| TOTAL | 8 <------ '=sum(B1:B3)'
+---------+---------+
If you want to be tidy, you could also hide the column with your intermediate values in.
(I should add that the way your question is worded, it almost sounds like you want to 'push' a value into the total; as far as I've ever known, you can really only 'pull' values into a total.)
Try this one for total sum:
=SUMIF(<range you want to sum>, "<>" & <x>, <range you want to sum>)+ <x> * COUNTIF(<range you want to sum>, <x>)

How can I use the LAG function and return the same value if the subsequent value in a row is duplicated?

I am using the LAG function to move my values one row down.
However, I need to use the same value as previous if the items in source column is duplicated:
ID | SOURCE | LAG | DESIRED OUTCOME
1 | 4 | - | -
2 | 2 | 4 | 4
3 | 3 | 2 | 2
4 | 3 | 3 | 2
5 | 3 | 3 | 2
6 | 1 | 3 | 3
7 | 4 | 1 | 1
8 | 4 | 4 | 1
As you can see, for instance in ID range 3-5 the source data doesn't change and the desired outcome should be fed from the last row with different value (so in this case ID 2).
Sql server's version of lag supports an expression in the second argument to determine how many rows back to look. You can replace this with some sort of check to not look back e.g.
select lagged = lag(data,iif(decider < 0,0,1)) over (order by id)
from (values(0,1,'dog')
,(1,2,'horse')
,(2,-1,'donkey')
,(3,2,'chicken')
,(4,23,'cow'))f(id,decider,data)
This returns the following list
null
dog
donkey
donkey
chicken
Because the decider value on the row with id of 2 was negative.
Well, first lag may not be the tool for the job. This might be easier to solve with a recursive CTE. Sql and window functions work over set. That said, our goal here is to come up with a way of describing what we want. We'd like a way to partition our data so that sequential islands of the same value are part of the same set.
One way we can do that is by using lag to help us discover if the previous row was different or not.
From there, we can now having a running sum over these change events to create partitions. Once we have partitions, we can assign a row number to each element in the partition. Finally, once we have that, we can now use the row number to look
back that many elements.
;with d as (
select * from (values
(1,4)
,(2,2)
,(3,3)
,(4,3)
,(5,3)
,(6,1)
,(7,4)
,(8,4)
)f(id,source))
select *,lag(source,rn) over (order by Id)
from (
select *,rn=row_number() over (partition by partition_id order by id)
from (
select *, partition_id = sum(change) over (order by id)
from (
select *,change = iif(lag(source) over (order by id) != source,1,0)
from d
) source_with_change
) partitioned
) row_counted
As an aside, this an absolutely cruel interview question I was asked to do once.

Is it possible to use json_populate_recordset in a subquery?

I have a json array with a couple of records, all of which have 3 fields lat, lon, v.
I would like to create a select subquery from this array to join with another query. The problem is that I cannot make the example in the PostgreSQL documentation work.
select * from json_populate_recordset(null::x, '[{"a":1,"b":2},{"a":3,"b":4}]')
Should result in:
a | b
---+---
1 | 2
3 | 4
But I only get ERROR: type "x" does not exist Position: 45
It is necessary to pass a composite type to json_populate_recordset whereas a column list is passed to json_to_recordset:
select *
from json_to_recordset('[{"a":1,"b":2},{"a":3,"b":4}]') x (a int, b int)
;
a | b
---+---
1 | 2
3 | 4

Resources