Determine how similar two arrays are in PostgreSQL - arrays

I'm aware you can compare two arrays in PostgreSQL to see if the elements in one are contained in the elements of another like so,
SELECT ARRAY[1,2] <# ARRAY[1,2,3] --> true
Is there any way to get # of matches or say "if matches 2 of 3" ??
SELECT ARRAY[1,2] ?? ARRAY[1,2,3] --> 2/3 or 66.6666%
I'm open to interesting solutions.. I want to take an array and ultimately say it must match 2 of 3 elements from another array in an inline query.. or >= 66% or something of that nature.
Ideally like this..
SELECT * FROM SOMETABLE WHERE ARRAY[1,2] ?? ARRAY[1,2,3] >= 66.66666666666667
Thanks in advance.

From here Array functions
with match as
(select count(a) as match_ct from unnest(ARRAY[1,2] ) as a
join
(select * from unnest(ARRAY[2,1,3]) as b) t on a=t.b)
select
match_ct/total_ct::numeric from match,
(select count(*) as total_ct from unnest(ARRAY[1,2], ARRAY[2,1,3]) as t(a, b)) as total ;
?column?
------------------------
0.66666666666666666667

You could create a function for that:
CREATE FUNCTION array_similarity(anyarray, anyarray)
RETURNS double precision
LANGUAGE sql
IMMUTABLE STRICT AS
$$SELECT 100.0 * count(*) / cardinality($2)
FROM unnest($1) AS a1(e1)
WHERE ARRAY[e1] <# $2$$;

Related

indexOf alike function in HIVEQL

I have the following query:
SELECT
col1,
case when array_contains(col1, "c") then "c exists" end as col2
FROM
(
SELECT
*
FROM
(
SELECT
array("a","b","c") AS col1
) q1
) q2;
I want to check if the element "c" appears just before the element "b" in the array. In JavaScript I could use indexOf(), so if there were something similar in HiveQL I would do something like case when col1.indexOf("b") = col1.indexOf("c") - 1.
I have read the documentation, and it seems that the functions dealing with arrays are minimal.
I wouldn't like to split the array and check with LAG or LEAD.
I have tried with field("c", concat_ws(',',col1)) but this seems not to work neither.
You can concatenate array and use like or rlike. Example:
SELECT concat_ws(',',col1) rlike 'c,b' as c_before_b_flag
FROM
(
SELECT
array("a","b","c") AS col1
) q1
Result:
false
rlike 'b,c' gives true

Postgres: Need to select keywords as separate array values

Datatype:
id: int4
keywords: text
objectivable_id: int4
Postgres version: PostgreSQL 9.5.3
Business_objectives table:
id keywords objectivable_id
1 keyword1a,keyword1b,keyword1c 6
2 keyword2a 6
3 testing 5
Currently the query I'm using is :
select array(select b.keywords from business_objectives b where b.objectivable_id = 6)
It selects the keywords of matched objectivable_id as:
{"keyword1a,keyword1b,keyword1c","keyword2a"}
Over here I wanted the result to be :
{"keyword1a","keyword1b","keyword1c","keyword2a"}
I tried using "string_agg(text, delimiter)", but it just combines all the keywords into one single pocket of an array.
You can simply (and cheaply!) use:
SELECT string_to_array(string_agg(keywords, ','), ',')
FROM business_objectives
WHERE objectivable_id = 6;
Concatenate your comma separate lists with string_agg(), and then convert the complete text to an array with string_to_array().
So something like this can give you expected result:
SELECT array_agg( j.keys )
FROM business_objectives b,
LATERAL ( SELECT k
FROM unnest ( string_to_array( b.keywords, ',' ) ) u( k )
) j( keys )
WHERE b.objectivable_id = 6;
array_agg
-------------------------------------------
{keyword1a,keyword1b,keyword1c,keyword2a}
(1 row)
With the LATERAL part, we look at the outer query to create a new view. Simply it does split of your keywords as set of rows which you can then feed into array_agg() function.
See more about LATERAL: https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-LATERAL

Union of two arrays in PostgreSQL without unnesting

I have two arrays in PostgreSQL that I need to union. For example:
{1,2,3} union {1,4,5} would return {1,2,3,4,5}
Using the concatenate (||) operator would not remove duplicate entries, i.e. it returns {1,2,3,1,4,5}
I found one solution on the web, however I do not like how it needs to unnest both arrays:
select ARRAY(select unnest(ARRAY[1,2,3]) as a UNION select unnest(ARRAY[2,3,4,5]) as a)
Is there an operator or built-in function that will cleanly union two arrays?
If your problem is to unnest twice this will unnest only once
select array_agg(a order by a)
from (
select distinct unnest(array[1,2,3] || array[2,3,4,5]) as a
) s;
There is a extension intarray (in contrib package) that contains some useful functions and operators:
postgres=# create extension intarray ;
CREATE EXTENSION
with single pipe operator:
postgres=# select array[1,2,3] | array[3,4,5];
?column?
─────────────
{1,2,3,4,5}
(1 row)
or with uniq function:
postgres=# select uniq(ARRAY[1,2,3] || ARRAY[3,4,5]);
uniq
─────────────
{1,2,3,4,5}
(1 row)
ANSI/SQL knows a multiset, but it is not supported by PostgreSQL yet.
Can be done like so...
select uniq(sort(array_remove(array_cat(ARRAY[1,2,3], ARRAY[1,4,5]), NULL)))
gives:
{1,2,3,4,5}
array_remove is needed because your can't sort arrays with NULLS.
Sort is needed because uniq de-duplicates only if adjacent elements are found.
A benefit of this approach over #Clodoaldo Neto's is that works entire within the select, and doesn't the unnest in the FROM clause. This makes it straightforward to operate on multiple arrays columns at the same time, and in a single table-scan. (Although see Ryan Guill version as a function in the comment).
Also, this pattern works for all array types (who's elements are sortable).
A downside is that, feasibly, its a little slower for longer arrays (due to the sort and the 3 intermediate array allocations).
I think both this and the accept answer fail if you want to keep NULL in the result.
The intarray-based answers don't work when you're trying to take the set union of an array-valued column from a group of rows. The accepted array_agg-based answer can be modified to work, e.g.
SELECT selector_column, array_agg(a ORDER BY a) AS array_valued_column
FROM (
SELECT DISTINCT selector_column, UNNEST(array_valued_column) AS a FROM table
) _ GROUP BY selector_column;
but, if this is buried deep in a complex query, the planner won't be able to push outer WHERE expressions past it, even when they would substantially reduce the number of rows that have to be processed. The right solution in that case is to define a custom aggregate:
CREATE FUNCTION array_union_step (s ANYARRAY, n ANYARRAY) RETURNS ANYARRAY
AS $$ SELECT s || n; $$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE FUNCTION array_union_final (s ANYARRAY) RETURNS ANYARRAY
AS $$
SELECT array_agg(i ORDER BY i) FROM (
SELECT DISTINCT UNNEST(x) AS i FROM (VALUES(s)) AS v(x)
) AS w WHERE i IS NOT NULL;
$$
LANGUAGE SQL IMMUTABLE LEAKPROOF PARALLEL SAFE;
CREATE AGGREGATE array_union (ANYARRAY) (
SFUNC = array_union_step,
STYPE = ANYARRAY,
FINALFUNC = array_union_final,
INITCOND = '{}',
PARALLEL = SAFE
);
Usage is
SELECT selector_column, array_union(array_valued_column) AS array_valued_column
FROM table
GROUP BY selector_column;
It's doing the same thing "under the hood", but because it's packaged into an aggregate function, the planner can see through it.
It's possible that this could be made more efficient by having the step function do the UNNEST and append the rows to a temporary table, rather than a scratch array, but I don't know how to do that and this is good enough for my use case.

Compare arrays for equality, ignoring order of elements

I have a table with 4 array columns.. the results are like:
ids signed_ids new_ids new_ids_signed
{1,2,3} | {2,1,3} | {4,5,6} | {6,5,4}
Anyway to compare ids and signed_ids so that they come out equal, by ignoring the order of the elements?
You can use contained by operator:
(array1 <# array2 and array1 #> array2)
The additional module intarray provides operators for arrays of integer, which are typically (much) faster. Install once per database with (in Postgres 9.1 or later):
CREATE EXTENSION intarray;
Then you can:
SELECT uniq(sort(ids)) = uniq(sort(signed_ids));
Or:
SELECT ids #> signed_ids AND ids <# signed_ids;
Bold emphasis on functions and operators from intarray.
In the second example, operator resolution arrives at the specialized intarray operators if left and right argument are type integer[].
Both expressions will ignore order and duplicity of elements. Further reading in the helpful manual here.
intarray operators only work for arrays of integer (int4), not bigint (int8) or smallint (int2) or any other data type.
Unlike the default generic operators, intarray operators do not accept NULL values in arrays. NULL in any involved array raises an exception. If you need to work with NULL values, you can default to the standard, generic operators by schema-qualifying the operator with the OPERATOR construct:
SELECT ARRAY[1,4,null,3]::int[] OPERATOR(pg_catalog.#>) ARRAY[3,1]::int[]
The generic operators can't use indexes with an intarray operator class and vice versa.
Related:
GIN index on smallint[] column not used or error "operator is not unique"
The simplest thing to do is sort them and compare them sorted. See sorting arrays in PostgreSQL.
Given sample data:
CREATE TABLE aa(ids integer[], signed_ids integer[]);
INSERT INTO aa(ids, signed_ids) VALUES (ARRAY[1,2,3], ARRAY[2,1,3]);
the best thing to do is to if the array entries are always integers is to use the intarray extension, as Erwin explains in his answer. It's a lot faster than any pure-SQL formulation.
Otherwise, for a general version that works for any data type, define an array_sort(anyarray):
CREATE OR REPLACE FUNCTION array_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(x order by x) FROM unnest($1) x;
$$ LANGUAGE 'SQL';
and use it sort and compare the sorted arrays:
SELECT array_sort(ids) = array_sort(signed_ids) FROM aa;
There's an important caveat:
SELECT array_sort( ARRAY[1,2,2,4,4] ) = array_sort( ARRAY[1,2,4] );
will be false. This may or may not be what you want, depending on your intentions.
Alternately, define a function array_compare_as_set:
CREATE OR REPLACE FUNCTION array_compare_as_set(anyarray,anyarray) RETURNS boolean AS $$
SELECT CASE
WHEN array_dims($1) <> array_dims($2) THEN
'f'
WHEN array_length($1,1) <> array_length($2,1) THEN
'f'
ELSE
NOT EXISTS (
SELECT 1
FROM unnest($1) a
FULL JOIN unnest($2) b ON (a=b)
WHERE a IS NULL or b IS NULL
)
END
$$ LANGUAGE 'SQL' IMMUTABLE;
and then:
SELECT array_compare_as_set(ids, signed_ids) FROM aa;
This is subtly different from comparing two array_sorted values. array_compare_as_set will eliminate duplicates, making array_compare_as_set(ARRAY[1,2,3,3],ARRAY[1,2,3]) true, whereas array_sort(ARRAY[1,2,3,3]) = array_sort(ARRAY[1,2,3]) will be false.
Both of these approaches will have pretty bad performance. Consider ensuring that you always store your arrays sorted in the first place.
If your arrays have no duplicates and are of the same dimension:
use array contains #>
AND array_length where the length must match the size you want on both sides
select (string_agg(a,',' order by a) = string_agg(b,',' order by b)) from
(select unnest(array[1,2,3,2])::text as a,unnest(array[2,2,3,1])::text as b) A

"if, then, else" in SQLite

Without using custom functions, is it possible in SQLite to do the following. I have two tables, which are linked via common id numbers. In the second table, there are two variables. What I would like to do is be able to return a list of results, consisting of: the row id, and NULL if all instances of those two variables (and there may be more than two) are NULL, 1 if they are all 0 and 2 if one or more is 1.
What I have right now is as follows:
SELECT
a.aid,
(SELECT count(*) from W3S19 b WHERE a.aid=b.aid) as num,
(SELECT count(*) FROM W3S19 c WHERE a.aid=c.aid AND H110 IS NULL AND H112 IS NULL) as num_null,
(SELECT count(*) FROM W3S19 d WHERE a.aid=d.aid AND (H110=1 or H112=1)) AS num_yes
FROM W3 a
So what this requires is to step through each result as follows (rough Python pseudocode):
if row['num_yes'] > 0:
out[aid] = 2
elif row['num_null'] == row['num']:
out[aid] = 'NULL'
else:
out[aid] = 1
Is there an easier way? Thanks!
Use CASE...WHEN, e.g.
CASE x WHEN w1 THEN r1 WHEN w2 THEN r2 ELSE r3 END
Read more from SQLite syntax manual (go to section "The CASE expression").
There's another way, for numeric values, which might be easier for certain specific cases.
It's based on the fact that boolean values is 1 or 0, "if condition" gives a boolean result:
(this will work only for "or" condition, depends on the usage)
SELECT (w1=TRUE)*r1 + (w2=TRUE)*r2 + ...
of course #evan's answer is the general-purpose, correct answer

Resources