Postgresql slow array assignment - arrays

I am working on a stored procedure / function in postgresql. I have a table called sports, and I have an array of sports%ROWTYPE:
records = "sports"[10000];
I also have a loop (10 000 iterations), where I create a record and assign it to the records array:
for idx in ....
loop
record.x = something;
record.y = something_else
records[idx] = record;
end loop;
For some reason, the statement records[idx] = record; takes more than 20 seconds to get executed (for the 10 000 iterations).
I don't know why this assignment is taking so long.
Edit
I have a stored procedure with 3 parameters
x text;
y integer[][];
z integer[];
My goal is to store these data in a table like below:
x | y[0] | z[0]
x | y[1] | z[1]
x | y[..] | z[..]
x | y[n] | z[n]

PostgreSQL doesn't generally modify arrays in-place. When you modify one it gets copied to make the new one.
You should avoid trying to iteratively build or modify arrays. Instead use sets. In this case I would define a function to build one record, and I'd call it with array_agg to create the array, like
select array_agg(r)
from generate_series(1,10000) i,
my_function(whatever_input_you_need)
into records;
Even better if you can get away without defining the function by using a ROW(...) constructor, like:
select array_agg(ROW(col1, col2, col3))
from generate_series(1,10000) i,
my_function(whatever_input_you_need);
Daniel Vérté noted that this has changed for arrays modified in plpgsql in PostgreSQL 9.5, so it should be more efficient. On 9.5 your code might perform OK, and it'd be interesting to see the difference.

Related

Use TQuery.Locate() function to find other then first matching

Locate moves the cursor to the first row matching a specified set of search criteria.
Let's say that q is TQuery component, which is connected to the database with two columns TAG and TAGTEXT. With next code I am getting letter a. And I would like to use Locate() function to get letter d.
If q.Locate('TAG','1',[loPartialKey]) Then
begin
tag60 := q.FieldByName('TAGTEXT');
end
For example if I got table like this:
TAG | TAGTEXT
+---+--------+
| 1 | a |
+---+--------+
| 2 | b |
+---+--------+
| 3 | c |
+---+--------+
| 1 | d |
+---+--------+
| 4 | e |
+---+--------+
| 1 | f |
+---+--------+
is it possible to locate the second time number one occurred in table?
EDIT
My job is to find the occurrence of TAG with value 1 (which occurrence I need depends on the parameter I get), I need to iterate through table and get the values from all the TAGTEXT fields till I find that value in TAG field is again number 1. Number 1 in this case represents the start of new segment, and all between the two number 1s belongs to one segment. It doesn't have to be same number of rows in each segment. Also I am not allowed to do any changes on table.
What I thought I could do is to create a counter variable that is going to be increased by one every time it comes to TAG with value 1 in it. When the counter equals to the parameter that represents the occurrence I know that I am in the right segment and I am going to iterate through that segment and get the values I need.
But this might be slow solution, and I wanted to know if there was any faster.
You need to be a bit wary of using Locate for a purpose like this, because some
TDataSet descendants' implementation of Locate (or the underlying db-access layer) construct a temporary index on the dataset. which can be discarded immediately afterwards, so repeatedly calling Locate to iterate the rows of a given segment may be a lot more inefficient than one might expect it to be.
Also, TClientDataSet constructs, uses and then discards an expression parser for each invocation of Locate (in its internal call to LocateRecord), which is a lot of overhead for repeated calls, especial when they are entirely avoidable.
In any case, the best way to do this is to ensure that your table records which segment a given row belongs to, adding a column like the SegmentID below if your table does not already have one:
TAG | TAGTEXT|SegmentID
+---+--------+---------+
| 1 | a | 1
| 2 | b | 1
| 3 | c | 1
| 1 | d | 2
+---+--------+---------+ // btw, what happened to the 2 missing rows after this one?
| 4 | e | 2
| 1 | f | 3
+---+--------+---------+
Then, you could use code like this to iterate the rows of a segment:
procedure IterateSegment(Query : TSomeTypeOfQueryComponent; SegmentID : Integer);
var
Sql; String;
begin
Sql := Format('select * from mytable where SegmentID = %d order by Tag', [SegmentID]);
if Query.Active then
Query.Close;
Query.Sql.Text := Sql;
Query.Open;
Query.DisableControls;
try
while not Query.Eof do begin
// process row here
Query.Next;
end;
finally
Query.EnableControls;
end;
end;
Once you have the SegmentID column in the table, if you don't want to open a new query to iterate a block, you can set up a local index (by SegmentID then Tag), assuming your dataset type supports it, set a filter on the dataset to restrict it to a given SegmentID and then iterate over it
You have much options to do this.
If your component don´t provide a locateNext you can make your on function locateNext, comparing the value and make next until find.
You can also bring the sql with order by then use locate for de the first value and test if the next value match the comparision.
If you use a clientDataset you can filter into the component filter propertie, or set IndexFieldNames to order values instead the "order by" of sql in the prior suggestion.
You can filter it on the SQL Where clausule too.

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

How to get the dimensionality of an ARRAY column?

I'm working on a project that collects information about your schema from the database directly. I can get the data_type of the column using information_schema.columns, which will tell me if it's an ARRAY or not. I can also get the underlying type (integer, bytea etc) of the ARRAY by querying information_schema.element_types as described here:
https://www.postgresql.org/docs/9.1/static/infoschema-element-types.html
My problem is that I also need to know how many dimensions the array has, whether it is integer[], or integer[][] for example. Does anyone know of a way to do this? Google isn't being very helpful here, hopefully someone more familiar with the Postgres spec can lead me in the right direction.
For starters, the dimensionality of an array is not reflected in the data type in Postgres. The syntax integer[][] is tolerated, but it's really just integer[] internally.
Read the manual here.
This means that dimensions can vary within the same array type (the same table column).
To get actual dimensions of a particular array value:
SELECT array_dims(my_arr); -- [1:2][1:3]
Or to just get the number of dimensions:
SELECT array_ndims(my_arr); -- 2
There are more array functions for similar needs. See table of array functions in the manual.
Related:
Use string[][] with ngpsql
If you need to enforce particular dimensions in a column, add a CHECK constraint. To enforce 2-dimensional arrays:
ALTER TABLE tbl ADD CONSTRAINT tbl_arr_col_must_have_2_dims
CHECK (array_ndims(arr_col) = 2);
Multidimensional arrays support in Postgres is very specific. Multidimensional array types do not exist. If you declare an array as multidimensional, Postgres casts it automatically to a simple array type:
create table test(a integer[][]);
\d test
Table "public.test"
Column | Type | Modifiers
--------+-----------+-----------
a | integer[] |
You can store arrays of different dimensions in a column of an array type:
insert into test values
(array[1,2]),
(array[array[1,2], array[3,4]]);
select a, a[1] a1, a[2] a2, a[1][1] a11, a[2][2] a22
from test;
a | a1 | a2 | a11 | a22
---------------+----+----+-----+-----
{1,2} | 1 | 2 | |
{{1,2},{3,4}} | | | 1 | 4
(2 rows)
This is a key difference between Postgres and programming languages like C, python etc. The feature has its advantages and disadvantages but usually causes various problems for novices.
You can find the number of dimensions in the system catalog pg_attribute:
select attname, typname, attndims
from pg_class c
join pg_attribute a on c.oid = attrelid
join pg_type t on t.oid = atttypid
where relname = 'test'
and attnum > 0;
attname | typname | attndims
---------+---------+----------
a | _int4 | 2
(1 row)
It is not clear whether you can rely on this number, as for the documentation:
attndims - Number of dimensions, if the column is an array type; otherwise 0. (Presently, the number of dimensions of an array is not enforced, so any nonzero value effectively means "it's an array".)

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

PostgreSQL PL/pgSQL random value from array of values

How can I declare an array like variable with two or three values and get them randomly during execution?
a := [1, 2, 5] -- sample sake
select random(a) -- returns random value
Any suggestion where to start?
Try this one:
select (array['Yes', 'No', 'Maybe'])[floor(random() * 3 + 1)];
Updated 2023-01-10 to fix the broken array literal. Made it several times faster while being at it:
CREATE OR REPLACE FUNCTION random_pick()
RETURNS int
LANGUAGE sql VOLATILE PARALLEL SAFE AS
$func$
SELECT ('[0:2]={1,2,5}'::int[])[trunc(random() * 3)::int];
$func$;
random() returns a value x where 0.0 <= x < 1.0. Multiply by 3 and truncate it with trunc() (slightly faster than floor()) to get 0, 1, or 2 with exactly equal chance.
Postgres indexes are 1-based by default (as per SQL standard). This would be off-by-1. We could increment by 1 every time, but for efficiency I declare the array index to start with 0 instead. Slightly faster, yet. See:
Normalize array subscripts so they start with 1
The manual on mathematical functions.
PARALLEL SAFE for Postgres 9.6 or later. See:
PARALLEL label for a function with SELECT and INSERT
When to mark functions as PARALLEL RESTRICTED vs PARALLEL SAFE?
You can use the plain SELECT statement if you don't want to create a function:
SELECT ('[0:2]={1,2,5}'::int[])[trunc(random() * 3)::int];
Erwin Brandstetter answered the OP's question well enough. However, for others looking for understanding how to randomly pick elements from more complex arrays (like me some two months ago), I expanded his function:
CREATE OR REPLACE FUNCTION random_pick( a anyarray, OUT x anyelement )
RETURNS anyelement AS
$func$
BEGIN
IF a = '{}' THEN
x := NULL::TEXT;
ELSE
WHILE x IS NULL LOOP
x := a[floor(array_lower(a, 1) + (random()*( array_upper(a, 1) - array_lower(a, 1)+1) ) )::int];
END LOOP;
END IF;
END
$func$ LANGUAGE plpgsql VOLATILE RETURNS NULL ON NULL INPUT;
Few assumptions:
this is not only for integer arrays, but for arrays of any type
we ignore NULL data; NULL is returned only if the array is empty or if NULL is inserted (values of other non-array types produce an error)
the array don't need to be formatted as usual - the array index may start and end anywhere, may have gaps etc.
this is for one-dimensional arrays
Other notes:
without the first IF statement, empty array would lead to an endless loop
without the loop, gaps and NULLs would make the function return NULL
omit both array_lower calls if you know that your arrays start at zero
with gaps in the index, you will need array_upper instead of array_length; without gaps, it's the same (not sure which is faster, but they shouldn't be much different)
the +1 after second array_lower serves to get the last value in the array with the same probability as any other; otherwise it would need the random()'s output to be exactly 1, which never happens
this is considerably slower than Erwin's solution, and likely to be an overkill for the your needs; in practice, most people would mix an ideal cocktail from the two
Here is another way to do the same thing
WITH arr AS (
SELECT '{1, 2, 5}'::INT[] a
)
SELECT a[1 + floor((random() * array_length(a, 1)))::int] FROM arr;
You can change the array to any type you would like.
CREATE OR REPLACE FUNCTION pick_random( members anyarray )
RETURNS anyelement AS
$$
BEGIN
RETURN members[trunc(random() * array_length(members, 1) + 1)];
END
$$ LANGUAGE plpgsql VOLATILE;
or
CREATE OR REPLACE FUNCTION pick_random( members anyarray )
RETURNS anyelement AS
$$
SELECT (array_agg(m1 order by random()))[1]
FROM unnest(members) m1;
$$ LANGUAGE SQL VOLATILE;
For bigger datasets, see:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
http://www.depesz.com/2007/09/16/my-thoughts-on-getting-random-row/
https://blog.2ndquadrant.com/tablesample-and-other-methods-for-getting-random-tuples/
https://www.postgresql.org/docs/current/static/functions-math.html
CREATE FUNCTION random_pick(p_items anyarray)
RETURNS anyelement AS
$$
SELECT unnest(p_items) ORDER BY RANDOM() LIMIT 1;
$$ LANGUAGE SQL;

Resources