Union of two arrays in PostgreSQL without unnesting - arrays

I have two arrays in PostgreSQL that I need to union. For example:
{1,2,3} union {1,4,5} would return {1,2,3,4,5}
Using the concatenate (||) operator would not remove duplicate entries, i.e. it returns {1,2,3,1,4,5}
I found one solution on the web, however I do not like how it needs to unnest both arrays:
select ARRAY(select unnest(ARRAY[1,2,3]) as a UNION select unnest(ARRAY[2,3,4,5]) as a)
Is there an operator or built-in function that will cleanly union two arrays?

If your problem is to unnest twice this will unnest only once
select array_agg(a order by a)
from (
select distinct unnest(array[1,2,3] || array[2,3,4,5]) as a
) s;

There is a extension intarray (in contrib package) that contains some useful functions and operators:
postgres=# create extension intarray ;
CREATE EXTENSION
with single pipe operator:
postgres=# select array[1,2,3] | array[3,4,5];
?column?
─────────────
{1,2,3,4,5}
(1 row)
or with uniq function:
postgres=# select uniq(ARRAY[1,2,3] || ARRAY[3,4,5]);
uniq
─────────────
{1,2,3,4,5}
(1 row)
ANSI/SQL knows a multiset, but it is not supported by PostgreSQL yet.

Can be done like so...
select uniq(sort(array_remove(array_cat(ARRAY[1,2,3], ARRAY[1,4,5]), NULL)))
gives:
{1,2,3,4,5}
array_remove is needed because your can't sort arrays with NULLS.
Sort is needed because uniq de-duplicates only if adjacent elements are found.
A benefit of this approach over #Clodoaldo Neto's is that works entire within the select, and doesn't the unnest in the FROM clause. This makes it straightforward to operate on multiple arrays columns at the same time, and in a single table-scan. (Although see Ryan Guill version as a function in the comment).
Also, this pattern works for all array types (who's elements are sortable).
A downside is that, feasibly, its a little slower for longer arrays (due to the sort and the 3 intermediate array allocations).
I think both this and the accept answer fail if you want to keep NULL in the result.

The intarray-based answers don't work when you're trying to take the set union of an array-valued column from a group of rows. The accepted array_agg-based answer can be modified to work, e.g.
SELECT selector_column, array_agg(a ORDER BY a) AS array_valued_column
FROM (
SELECT DISTINCT selector_column, UNNEST(array_valued_column) AS a FROM table
) _ GROUP BY selector_column;
but, if this is buried deep in a complex query, the planner won't be able to push outer WHERE expressions past it, even when they would substantially reduce the number of rows that have to be processed. The right solution in that case is to define a custom aggregate:
CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
AS $$ SELECT s || n; $$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
AS $$
SELECT array_agg(i ORDER BY i) FROM (
SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
) AS w WHERE i IS NOT NULL;
$$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE AGGREGATE array_union (ANYARRAY) (
SFUNC = array_union_step,
STYPE = ANYARRAY,
FINALFUNC = array_union_final,
INITCOND = '{}',
PARALLEL = SAFE
);
Usage is
SELECT selector_column, array_union(array_valued_column) AS array_valued_column
FROM table
GROUP BY selector_column;
It's doing the same thing "under the hood", but because it's packaged into an aggregate function, the planner can see through it.
It's possible that this could be made more efficient by having the step function do the UNNEST and append the rows to a temporary table, rather than a scratch array, but I don't know how to do that and this is good enough for my use case.

Related

Postgres: calling function with text[] param fails with array literal

I have a Postgres function that accepts a text[] as input. For example
create function temp1(player_ids text[])
returns void
language plpgsql
as
$$
begin
update players set player_xp = 0
where id in (player_ids);
-- the body is actually 20 lines long, updating a lot of tables
end;
$$;
and I'm trying to call it, but I keep getting
[42883] ERROR: operator does not exist: text = text[] Hint: No operator matches the given name and argument types. You might need to add explicit type casts. Where: PL/pgSQL function temp1(text[]) line 3 at SQL statement
I have tried these so far
select temp1('{F7AWLJWYQ5BMPKGXLMDNQKQ4NY,AQPBAFKQONGLBKIMCSOD747GY4}');
select temp1('{F7AWLJWYQ5BMPKGXLMDNQKQ4NY,AQPBAFKQONGLBKIMCSOD747GY4}'::text[]);
select temp1(array['F7AWLJWYQ5BMPKGXLMDNQKQ4NY,AQPBAFKQONGLBKIMCSOD747GY4']);
select temp1(array['F7AWLJWYQ5BMPKGXLMDNQKQ4NY,AQPBAFKQONGLBKIMCSOD747GY4']::text[]);
I have to be missing something obvious...how do I call this function with an array literal?
Use = any instead of in:
...
update players set player_xp = 0
where id = any(player_ids);
...
The IN operator acts on an explicit list of values.
expression IN (value [, ...])
When you want to compare a value to each element of an array, use ANY instead.
expression operator ANY (array expression)
Note that there are variants of both constructs for subqueries expression IN (subquery) and expression operator ANY (subquery). The first one was properly used in the other answer though a subquery seems excessive in this case.
You can use unnest function, this function is very easy and same time best performanced. Unnest using for converting array elements to rows. Example:
create function temp1(player_ids text[])
returns void
language plpgsql
as
$$
begin
update players set player_xp = 0
where id in (select pl.id from unnest(player_ids) as pl(id));
-- the body is actually 20 lines long, updating a lot of tables
end;
$$;
And you can easily cast array elements to another type for using unnest.
Example:
update players set player_xp = 0
where id in (select pl.id::integer from unnest(player_ids) as pl(id));

Determine how similar two arrays are in PostgreSQL

I'm aware you can compare two arrays in PostgreSQL to see if the elements in one are contained in the elements of another like so,
SELECT ARRAY[1,2] <# ARRAY[1,2,3] --> true
Is there any way to get # of matches or say "if matches 2 of 3" ??
SELECT ARRAY[1,2] ?? ARRAY[1,2,3] --> 2/3 or 66.6666%
I'm open to interesting solutions.. I want to take an array and ultimately say it must match 2 of 3 elements from another array in an inline query.. or >= 66% or something of that nature.
Ideally like this..
SELECT * FROM SOMETABLE WHERE ARRAY[1,2] ?? ARRAY[1,2,3] >= 66.66666666666667
Thanks in advance.
From here Array functions
with match as
(select count(a) as match_ct from unnest(ARRAY[1,2] ) as a
join
(select * from unnest(ARRAY[2,1,3]) as b) t on a=t.b)
select
match_ct/total_ct::numeric from match,
(select count(*) as total_ct from unnest(ARRAY[1,2], ARRAY[2,1,3]) as t(a, b)) as total ;
?column?
------------------------
0.66666666666666666667
You could create a function for that:
CREATE FUNCTION array_similarity(anyarray, anyarray)
RETURNS double precision
LANGUAGE sql
IMMUTABLE STRICT AS
$$SELECT 100.0 * count(*) / cardinality($2)
FROM unnest($1) AS a1(e1)
WHERE ARRAY[e1] <# $2$$;

Return rows matching elements of input array in plpgsql function

I would like to create a PostgreSQL function that does something like the following:
CREATE FUNCTION avg_purchases( IN last_names text[] DEFAULT '{}' )
RETURNS TABLE(last_name text[], avg_purchase_size double precision)
AS
$BODY$
DECLARE
qry text;
BEGIN
qry := 'SELECT last_name, AVG(purchase_size)
FROM purchases
WHERE last_name = ANY($1)
GROUP BY last_name'
RETURN QUERY EXECUTE qry USING last_names;
END;
$BODY$
But I see two problems here:
It is not clear to me that array type is the most useful type of input.
This is currently returning zero rows when I do:
SELECT avg_purchases($${'Brown','Smith','Jones'}$$);
What am I missing?
This works:
CREATE OR REPLACE FUNCTION avg_purchases(last_names text[] = '{}')
RETURNS TABLE(last_name text, avg_purchase_size float8)
LANGUAGE sql AS
$func$
SELECT last_name, avg(purchase_size)::float8
FROM purchases
WHERE last_name = ANY($1)
GROUP BY last_name
$func$;
Call:
SELECT * FROM avg_purchases('{foo,Bar,baz,"}weird_name''$$"}');
Or (example with dollar-quoting):
SELECT * FROM avg_purchases($x${foo,Bar,baz,"}weird_name'$$"}$x$);
How to quote string literals:
Insert text with single quotes in PostgreSQL
You don't need dynamic SQL here.
While you can wrap it into a plpgsql function (which may be useful), a simple SQL function is doing the basic job just fine.
You had type mismatches:
The result of avg() may be numeric to hold a precise result. A cast to float8 (alias for double precision) makes it work. For perfect precision, use numeric instead.
The OUT parameter last_name must be text instead of text[].
VARIADIC
An array is a useful type of input. If it's easier for your client you can also use a VARIADIC input parameter that allows to pass the array as a list of elements:
CREATE OR REPLACE FUNCTION avg_purchases(VARIADIC last_names text[] = '{}')
RETURNS TABLE(last_name text, avg_purchase_size float8)
LANGUAGE sql AS
$func$
SELECT last_name, avg(purchase_size)::float8
FROM purchases
JOIN (SELECT unnest($1)) t(last_name) USING (last_name)
GROUP BY 1
$func$;
Call:
SELECT * FROM avg_purchases('foo', 'Bar', 'baz', '"}weird_name''$$"}');
Or (with dollar-quoting):
SELECT * FROM avg_purchases('foo', 'Bar', 'baz', $y$'"}weird_name'$$"}$y$);
Stock Postgres only allows a maximum of 100 elements. This is determined at compile time by the preset option:
max_function_args (integer)
Reports the maximum number of function arguments. It is determined by the value of FUNC_MAX_ARGS when building the server. The default value is 100 arguments.
You can still call it with array notation when prefixed with the keyword VARIADIC:
SELECT * FROM avg_purchases(VARIADIC '{1,2,3, ... 99,100,101}');
For bigger arrays (100+), consider unnest() in a subquery and JOIN to it, tends to scale better:
Optimizing a Postgres query with a large IN

PostgreSQL PL/pgSQL random value from array of values

How can I declare an array like variable with two or three values and get them randomly during execution?
a := [1, 2, 5] -- sample sake
select random(a) -- returns random value
Any suggestion where to start?
Try this one:
select (array['Yes', 'No', 'Maybe'])[floor(random() * 3 + 1)];
Updated 2023-01-10 to fix the broken array literal. Made it several times faster while being at it:
CREATE OR REPLACE FUNCTION random_pick()
RETURNS int
LANGUAGE sql VOLATILE PARALLEL SAFE AS
$func$
SELECT ('[0:2]={1,2,5}'::int[])[trunc(random() * 3)::int];
$func$;
random() returns a value x where 0.0 <= x < 1.0. Multiply by 3 and truncate it with trunc() (slightly faster than floor()) to get 0, 1, or 2 with exactly equal chance.
Postgres indexes are 1-based by default (as per SQL standard). This would be off-by-1. We could increment by 1 every time, but for efficiency I declare the array index to start with 0 instead. Slightly faster, yet. See:
Normalize array subscripts so they start with 1
The manual on mathematical functions.
PARALLEL SAFE for Postgres 9.6 or later. See:
PARALLEL label for a function with SELECT and INSERT
When to mark functions as PARALLEL RESTRICTED vs PARALLEL SAFE?
You can use the plain SELECT statement if you don't want to create a function:
SELECT ('[0:2]={1,2,5}'::int[])[trunc(random() * 3)::int];
Erwin Brandstetter answered the OP's question well enough. However, for others looking for understanding how to randomly pick elements from more complex arrays (like me some two months ago), I expanded his function:
CREATE OR REPLACE FUNCTION random_pick( a anyarray, OUT x anyelement )
RETURNS anyelement AS
$func$
BEGIN
IF a = '{}' THEN
x := NULL::TEXT;
ELSE
WHILE x IS NULL LOOP
x := a[floor(array_lower(a, 1) + (random()*( array_upper(a, 1) - array_lower(a, 1)+1) ) )::int];
END LOOP;
END IF;
END
$func$ LANGUAGE plpgsql VOLATILE RETURNS NULL ON NULL INPUT;
Few assumptions:
this is not only for integer arrays, but for arrays of any type
we ignore NULL data; NULL is returned only if the array is empty or if NULL is inserted (values of other non-array types produce an error)
the array don't need to be formatted as usual - the array index may start and end anywhere, may have gaps etc.
this is for one-dimensional arrays
Other notes:
without the first IF statement, empty array would lead to an endless loop
without the loop, gaps and NULLs would make the function return NULL
omit both array_lower calls if you know that your arrays start at zero
with gaps in the index, you will need array_upper instead of array_length; without gaps, it's the same (not sure which is faster, but they shouldn't be much different)
the +1 after second array_lower serves to get the last value in the array with the same probability as any other; otherwise it would need the random()'s output to be exactly 1, which never happens
this is considerably slower than Erwin's solution, and likely to be an overkill for the your needs; in practice, most people would mix an ideal cocktail from the two
Here is another way to do the same thing
WITH arr AS (
SELECT '{1, 2, 5}'::INT[] a
)
SELECT a[1 + floor((random() * array_length(a, 1)))::int] FROM arr;
You can change the array to any type you would like.
CREATE OR REPLACE FUNCTION pick_random( members anyarray )
RETURNS anyelement AS
$$
BEGIN
RETURN members[trunc(random() * array_length(members, 1) + 1)];
END
$$ LANGUAGE plpgsql VOLATILE;
or
CREATE OR REPLACE FUNCTION pick_random( members anyarray )
RETURNS anyelement AS
$$
SELECT (array_agg(m1 order by random()))[1]
FROM unnest(members) m1;
$$ LANGUAGE SQL VOLATILE;
For bigger datasets, see:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
http://www.depesz.com/2007/09/16/my-thoughts-on-getting-random-row/
https://blog.2ndquadrant.com/tablesample-and-other-methods-for-getting-random-tuples/
https://www.postgresql.org/docs/current/static/functions-math.html
CREATE FUNCTION random_pick(p_items anyarray)
RETURNS anyelement AS
$$
SELECT unnest(p_items) ORDER BY RANDOM() LIMIT 1;
$$ LANGUAGE SQL;

Compare arrays for equality, ignoring order of elements

I have a table with 4 array columns.. the results are like:
ids signed_ids new_ids new_ids_signed
{1,2,3} | {2,1,3} | {4,5,6} | {6,5,4}
Anyway to compare ids and signed_ids so that they come out equal, by ignoring the order of the elements?
You can use contained by operator:
(array1 <# array2 and array1 #> array2)
The additional module intarray provides operators for arrays of integer, which are typically (much) faster. Install once per database with (in Postgres 9.1 or later):
CREATE EXTENSION intarray;
Then you can:
SELECT uniq(sort(ids)) = uniq(sort(signed_ids));
Or:
SELECT ids #> signed_ids AND ids <# signed_ids;
Bold emphasis on functions and operators from intarray.
In the second example, operator resolution arrives at the specialized intarray operators if left and right argument are type integer[].
Both expressions will ignore order and duplicity of elements. Further reading in the helpful manual here.
intarray operators only work for arrays of integer (int4), not bigint (int8) or smallint (int2) or any other data type.
Unlike the default generic operators, intarray operators do not accept NULL values in arrays. NULL in any involved array raises an exception. If you need to work with NULL values, you can default to the standard, generic operators by schema-qualifying the operator with the OPERATOR construct:
SELECT ARRAY[1,4,null,3]::int[] OPERATOR(pg_catalog.#>) ARRAY[3,1]::int[]
The generic operators can't use indexes with an intarray operator class and vice versa.
Related:
GIN index on smallint[] column not used or error "operator is not unique"
The simplest thing to do is sort them and compare them sorted. See sorting arrays in PostgreSQL.
Given sample data:
CREATE TABLE aa(ids integer[], signed_ids integer[]);
INSERT INTO aa(ids, signed_ids) VALUES (ARRAY[1,2,3], ARRAY[2,1,3]);
the best thing to do is to if the array entries are always integers is to use the intarray extension, as Erwin explains in his answer. It's a lot faster than any pure-SQL formulation.
Otherwise, for a general version that works for any data type, define an array_sort(anyarray):
CREATE OR REPLACE FUNCTION array_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(x order by x) FROM unnest($1) x;
$$ LANGUAGE 'SQL';
and use it sort and compare the sorted arrays:
SELECT array_sort(ids) = array_sort(signed_ids) FROM aa;
There's an important caveat:
SELECT array_sort( ARRAY[1,2,2,4,4] ) = array_sort( ARRAY[1,2,4] );
will be false. This may or may not be what you want, depending on your intentions.
Alternately, define a function array_compare_as_set:
CREATE OR REPLACE FUNCTION array_compare_as_set(anyarray,anyarray) RETURNS boolean AS $$
SELECT CASE
WHEN array_dims($1) <> array_dims($2) THEN
'f'
WHEN array_length($1,1) <> array_length($2,1) THEN
'f'
ELSE
NOT EXISTS (
SELECT 1
FROM unnest($1) a
FULL JOIN unnest($2) b ON (a=b)
WHERE a IS NULL or b IS NULL
)
END
$$ LANGUAGE 'SQL' IMMUTABLE;
and then:
SELECT array_compare_as_set(ids, signed_ids) FROM aa;
This is subtly different from comparing two array_sorted values. array_compare_as_set will eliminate duplicates, making array_compare_as_set(ARRAY[1,2,3,3],ARRAY[1,2,3]) true, whereas array_sort(ARRAY[1,2,3,3]) = array_sort(ARRAY[1,2,3]) will be false.
Both of these approaches will have pretty bad performance. Consider ensuring that you always store your arrays sorted in the first place.
If your arrays have no duplicates and are of the same dimension:
use array contains #>
AND array_length where the length must match the size you want on both sides
select (string_agg(a,',' order by a) = string_agg(b,',' order by b)) from
(select unnest(array[1,2,3,2])::text as a,unnest(array[2,2,3,1])::text as b) A

Resources