PostGIS: optimised way to find intersection between polygon and circle - postgis

I am trying to find intersection between incidents(polygons) and watchzones(circles - points and radius) using PostGIS. The baseline data is going to be somewhere like more than 10 000 polygons and 500 000 circles. Also, I am quite new to PostGIS.
I have tried a few things but the execution is taking quite long. Can someone please suggest any optimisations or a better way using PostGIS only. Here is what I have tried -
1. Using Geometry datatype:
I have stored the incidents and watchzones in type geometry.
created GIST index on them, used ST_DWITHIN to find the intersection.
The output with 1 incident and 500 000 watchzones took like 6.750sec. Here, the time taken is optimum, but the problem is that I have radius in meters and with geometry type ST_DWithin requires it to be in SRID unit. I am unable to figure out this conversion.
CREATE TABLE incident (
incident_id SERIAL NOT NULL,
incident_name VARCHAR(20),
incident_span GEOMETRY(POLYGON, 4326),
CONSTRAINT incident_id PRIMARY KEY (incident_id)
);
CREATE TABLE watchzones (
id SERIAL NOT NULL,
date_created timestamp with time zone DEFAULT now(),
latitude NUMERIC(10, 7) DEFAULT NULL,
Longitude NUMERIC(10, 7) DEFAULT NULL,
radius integer,
position GEOMETRY(POINT, 4326),
CONSTRAINT id PRIMARY KEY (id)
);
CREATE INDEX ix_spatial_geom on watchzones using gist(position);
CREATE INDEX ix_spatial_geom_1 on incident using gist(incident_span);
Insert into incident values (
1,
'test',
ST_GeomFromText('POLYGON((152.945470916 -29.212227933,152.942130026 -29.213431145,152.939345911 -29.2125423759999,152.935144791 -29.21454003,152.933185494 -29.2135838469999,152.929481762 -29.216065516,152.929698621 -29.217402937,152.927245999
-29.219576,152.921539 -29.217676,152.918487996 -29.2113786959999,152.919254355 -29.206029929,152.919692387 -29.2027824419999,152.936020197 -29.207567346,152.944901258 -29.207729953,152.945470916
-29.212227933))',
4326
)
);
insert into watchzones
SELECT generate_series(1, 500000) AS id,
now(),
-29.21073,
152.93322,
'50',
ST_GeomFromText('POINT( 152.93322 -29.21073)', 4326);
explain analyze SELECT wz.id,
i.incident_id
FROM watchzones wz,
incident i
WHERE ST_DWithin(incident_span,position,wz.radius);
"Nested Loop (cost=0.14..227467.00 rows=42 width=8) (actual time=0.142..1506.476 rows=500000 loops=1)"
" -> Seq Scan on watchzones wz (cost=0.00..11173.00 rows=500000 width=40) (actual time=0.109..47.822 rows=500000 loops=1)"
" -> Index Scan using ix_spatial_geom_1 on incident i (cost=0.14..0.42 rows=1 width=284) (actual time=0.002..0.002 rows=1 loops=500000)"
" Index Cond: (incident_span && st_expand(wz."position", (wz.radius)::double precision))"
" Filter: ((wz."position" && st_expand(incident_span, (wz.radius)::double precision)) AND _st_dwithin(incident_span, wz."position", (wz.radius)::double precision))"
"Planning time: 0.150 ms"
"Execution time: 1523.312 ms"
2. Using Geography data type:
The output with 1 incident and 500 000 watchzones here, took like 29.987sec which is quite slow. Please note that I have tried this with both GIST and BRIN indexes and also ran VACUUM ANALYZE on the tables.
CREATE TABLE watchzones_geog
(
id SERIAL PRIMARY KEY,
date_created TIMESTAMP with time zone DEFAULT now(),
latitude NUMERIC(10, 7) DEFAULT NULL,
longitude NUMERIC(10, 7) DEFAULT NULL,
radius INTEGER,
position geography(point)
);
CREATE INDEX watchzones_geog_gix ON watchzones_geog USING GIST (position);
insert into watchzones_geog
SELECT generate_series(1,500000) AS id, now(),-29.21073,152.93322,'50',ST_GeogFromText('POINT(152.93322 -29.21073)');
CREATE TABLE incident_geog (
incident_id SERIAL PRIMARY KEY,
incident_name VARCHAR(20),
incident_span GEOGRAPHY(POLYGON)
);
CREATE INDEX incident_geog_gix ON incident_geog USING GIST (incident_span);
Insert into incident_geog values (1,'test', ST_GeogFromText
('POLYGON((152.945470916 -29.212227933,152.942130026 -29.213431145,152.939345911 -29.2125423759999,152.935144791 -29.21454003,152.933185494 -29.2135838469999,152.929481762 -29.216065516,152.929698621 -29.217402937,152.927245999
-29.219576,152.921539 -29.217676,152.918487996 -29.2113786959999,152.919254355 -29.206029929,152.919692387 -29.2027824419999,152.936020197 -29.207567346,152.944901258 -29.207729953,152.945470916
-29.212227933))'));
explain analyze SELECT i.incident_id,
wz.id
FROM watchzones_geog wz,
incident_geog i
WHERE St_dwithin(position, incident_span, radius);
"Nested Loop (cost=0.27..348717.00 rows=17 width=8) (actual time=0.277..18551.844 rows=500000 loops=1)"
" -> Seq Scan on watchzones_geog wz (cost=0.00..11173.00 rows=500000 width=40) (actual time=0.102..50.052 rows=500000 loops=1)"
" -> Index Scan using incident_geog_gix on incident_geog i (cost=0.27..0.67 rows=1 width=711) (actual time=0.036..0.036 rows=1 loops=500000)"
" Index Cond: (incident_span && _st_expand(wz."position", (wz.radius)::double precision))"
" Filter: ((wz."position" && _st_expand(incident_span, (wz.radius)::double precision)) AND _st_dwithin(wz."position", incident_span, (wz.radius)::double precision, true))"
"Planning time: 0.155 ms"
"Execution time: 18587.041 ms"
3. I have also tried creating a circle using ST_Buffer(position, radius,'quad_segs=8') and then using ST_Intersects. With this the query takes more than a minute with both geometry and geography data types.
Would be great if someone can suggest a better way or some optimisations which would speed up the execution.
Thanks

The query is fine, but your sample is wrong. First, let's note that a query optimized for 1 polygon might not be the same as the optimized one for several thousands.
The main issue is with the sample points. As it is, you have 500,000 points at the exact same location, so depending on the intersecting polygon, the query will return 0 or 500 000 results. Postgis starts by using the index to intersects points/polygons using a square box, and then refine the results by computing the true distance. Using your sample, it has to compute the distance 500,000 times, which is slow.
Using a point layer with random locations (within 1 degree), the query takes less than 1 second as it has to compute the distance for 20 locations only.
INSERT INTO watchzones_geog
SELECT generate_series(1,500000) AS id, now(),0,0,'50',
ST_makePoint(152.93322+random(),-29.21073+random())::geography;
explain analyze SELECT i.incident_id,
wz.id
FROM watchzones_geog wz,
incident_geog i
WHERE St_dwithin(position, incident_span, radius);
Nested Loop (cost=0.00..272424.01 rows=1 width=8) (actual time=25.956..921.846 rows=20 loops=1)
--------------------------------------------
Join Filter: ((wz."position" && _st_expand(i.incident_span, (wz.radius)::double precision)) AND (i.incident_span && _st_expand(wz."position", (wz.radius)::double precision)) AND _st_dwithin(wz."position", i.incident_span, (wz.radius)::double precision, true))
Rows Removed by Join Filter: 499980
-> Seq Scan on incident_geog i (cost=0.00..1.01 rows=1 width=36) (actual time=0.009..0.009 rows=1 loops=1)
-> Seq Scan on watchzones_geog wz (cost=0.00..11173.00 rows=500000 width=40) (actual time=0.006..65.625 rows=500000 loops=1)
Planning time: 1.887 ms
Execution time: 921.895 ms

Related

Postgresql index best pratices / performance

I have a test Postgresql on 10.13 and looking for several answers on indexes. Let's take this table for example and admit there are 40k entries in it :
CREATE TABLE "public"."ecom_input"
(
"id" serial PRIMARY KEY,
"ecom_id" INTEGER NOT NULL,
"sku_supplier" CHARACTER VARYING(100) NOT NULL
);
Does a PRIMARY KEY have its own INDEX automatically?
With this query :
EXPLAIN SELECT * FROM ecom_input WHERE id = 27846;
I have the same results whether using the PRIMARY KEY or the INDEX :
--> Index Scan using ecom_input_pkey on ecom_input (cost=0.29..8.31 rows=1 width=853)
--> Index Scan using "ecom_input_id_idx" on ecom_input (cost=0.29..8.31 rows=1 width=853)
After creating a multi column INDEX on the same table :
CREATE INDEX idx_ecom_input ON ecom_input (ecom_id, sku_supplier);
Try :
EXPLAIN SELECT * FROM ecom_input WHERE ecom_id = 22 AND sku_supplier = 'MATHILDEJAS';
I have good cost performance:
--> Index Scan using idx_ecom_input on ecom_input (cost=0.41..8.43 rows=1 width=853)
Loosing a bit of performance when using only WHERE sku_supplier = 'MATHILDEJAS'; :
--> Index Scan using idx_ecom_input on ecom_input (cost=0.41..1044.64 rows=1 width=853)
But when using only WHERE ecom_id = 22; :
--> Bitmap Heap Scan on ecom_input (cost=5.58..547.25 rows=150 width=853)
Recheck Cond: (ecom_id = 22)
-> Bitmap Index Scan on idx_ecom_input (cost=0.00..5.54 rows=150 width=0)
Index Cond: (ecom_id = 22)
Why does the optimizer uses a Bitmap Heap Scan when using only the ecom_id in the WHERE clause but not when using the sku_supplier in the WHERE clause ? There are 150 rows where ecom_id = 22 but & only 1 row where sku_supplier = 'MATHILDEJAS', is this the reason ?
We usually operate JOIN on the column 'id' of our tables (which are PRIMARY KEYS). Is an INDEX used during JOIN operations ? If yes, is it a good pratice to always JOIN on a PRIMARY KEY ?
At what point (approximately) can we see a real performance difference when querying using INDEX and when NOT ? We are using cloudbased databases and even if the network is good, the round trip server-client through internet is what is taking longer.
Exemple :
EXPLAIN ANALYZE SELECT * FROM ecom_input WHERE sku_supplier = 'MATHILDEJAS' AND ecom_id = 22;
With INDEX :
"Index Scan using idx_ecom_input on ecom_input (cost=0.41..8.43 rows=1 width=853) (actual time=0.017..0.018 rows=1 loops=1)"
"Index Cond: ((ecom_id = 22) AND ((sku_supplier)::text = 'MATHILDEJAS'::text))"
"Planning time: 0.086 ms"
"Execution time: 0.034 ms"
Without INDEX :
"Seq Scan on ecom_input (cost=0.00..10034.43 rows=1 width=853) (actual time=0.006..13.555 rows=1 loops=1)"
"Filter: (((sku_supplier)::text = 'MATHILDEJAS'::text) AND (ecom_id = 22))"
"Rows Removed by Filter: 40561"
"Planning time: 0.097 ms"
"Execution time: 13.572 ms"
We gain 13ms in total. This requests take up to 250-700ms to go to the server and back. It is not really impacting on a 40k entries table. Do you know about how many entries and INDEX would be useful ?

postgresql 9.6.4: timestamp range query on large table takes forever

I need some help in analyzing the bad performance of a query executed on a large table containing 83.660.142 million rows which takes up to 25 minutes to more than one hour, depending on the system load, for computation.
I've created the following table that consists of a composite key and 3 indexes:
CREATE TABLE IF NOT EXISTS ds1records(
userid INT DEFAULT 0,
clientid VARCHAR(255) DEFAULT '',
ts TIMESTAMP,
site VARCHAR(50) DEFAULT '',
code VARCHAR(400) DEFAULT '');
CREATE UNIQUE INDEX IF NOT EXISTS primary_idx ON records (userid, clientid, ts, site, code);
CREATE INDEX IF NOT EXISTS userid_idx ON records (userid);
CREATE INDEX IF NOT EXISTS ts_idx ON records (ts);
CREATE INDEX IF NOT EXISTS userid_ts_idx ON records (userid ASC,ts DESC);
In a spring batch application I'm executing a query that looks as follows:
SELECT *
FROM records
WHERE userid = ANY(VALUES (2), ..., (96158 more userids) )
AND ( ts < '2017-09-02' AND ts >= '2017-09-01'
OR ts < '2017-08-26' AND ts >= '2017-08-25'
OR ts < '2017-08-19' AND ts >= '2017-08-18'
OR ts < '2017-08-12' AND ts >= '2017-08-11')
The User ID's are determined at runtime (number of id's lie between 95.000 and 110.000). For each user I need to extract the page views of the current day and the last same three weekdays. The query always returns rows between 3-4M rows.
Executing the query with the EXPLAIN ANALYZE option returns the following execution plan.
Nested Loop (cost=1483.40..1246386.43 rows=3761735 width=70) (actual time=108.856..1465501.596 rows=3643240 loops=1)
-> HashAggregate (cost=1442.38..1444.38 rows=200 width=4) (actual time=33.277..201.819 rows=96159 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..1201.99 rows=96159 width=4) (actual time=0.006..11.599 rows=96159 loops=1)
-> Bitmap Heap Scan on records (cost=41.02..6224.01 rows=70 width=70) (actual time=8.865..15.218 rows=38 loops=96159)
Recheck Cond: (userid = "*VALUES*".column1)
Filter: (((ts < '2017-09-02 00:00:00'::timestamp without time zone) AND (ts >= '2017-09-01 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-26 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-25 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-19 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-18 00:00:00'::timestamp without time zone)) OR ((ts < '2017-08-12 00:00:00'::timestamp without time zone) AND (ts >= '2017-08-11 00:00:00'::timestamp without time zone)))
Rows Removed by Filter: 792
Heap Blocks: exact=77251145
-> Bitmap Index Scan on userid_ts_idx (cost=0.00..41.00 rows=1660 width=0) (actual time=6.593..6.593 rows=830 loops=96159)
Index Cond: (userid = "*VALUES*".column1)
I've adjusted the values of some Postgres tuning parameters (unfortunately with no success):
effective_cache_size=15GB (probably useless as query is executed just once)
shared_buffers=15GB
work_mem=3GB
The application runs computationally expensive tasks (e.g. data fusion/data injection) and consumes roughly 100GB memory, so the system hardware is sufficiently dimensioned with 125GB RAM and 16 cores (OS: Debian).
I'm wondering why postgres is not using the combined index userid_ts_idx in its execution plan? Since the timestamp column in the index is sorted in reverse order I would expect postgres to use this to find matching tuples for the range part of the query as it could sequentially go through the index until the condition ts < '2017-09-02 00:00:00 holds true and return all values until condition ts >= 2017-09-01 00:00:00 is met. Instead postgres uses the expensive Bitmap Heap Scan which does a linear table scan if I understood correctly. Did I misconfigure the db settings or do I have a conceptual misunderstanding?
Update
The CTE as suggested in the comments unfortunately did not bring any improvements. The Bitmap Heap Scan has been replaced by a Sequantial Scan but the performance is still poor. Following is the updated execution plan:
Merge Join (cost=20564929.37..20575876.60 rows=685277 width=106) (actual time=2218133.229..2222280.192 rows=3907472 loops=1)
Merge Cond: (ids.id = r.userid)
Buffers: shared hit=2408684 read=181785
CTE ids
-> Values Scan on "*VALUES*" (cost=0.00..1289.70 rows=103176 width=4) (actual time=0.002..28.670 rows=103176 loops=1)
CTE ts
-> Values Scan on "*VALUES*_1" (cost=0.00..0.05 rows=4 width=32) (actual time=0.002..0.004 rows=4 loops=1)
-> Sort (cost=10655.37..10913.31 rows=103176 width=4) (actual time=68.476..83.312 rows=103176 loops=1)
Sort Key: ids.id
Sort Method: quicksort Memory: 7909kB
-> CTE Scan on ids (cost=0.00..2063.52 rows=103176 width=4) (actual time=0.007..47.868 rows=103176 loops=1)
-> Sort (cost=20552984.25..20554773.54 rows=715717 width=102) (actual time=2218059.941..2221230.585 rows=8085760 loops=1)
Sort Key: r.userid
Sort Method: quicksort Memory: 1410084kB
Buffers: shared hit=2408684 read=181785
-> Nested Loop (cost=0.00..20483384.24 rows=715717 width=102) (actual time=885849.043..2214665.723 rows=8085767 loops=1)
Join Filter: (ts.r #> r.ts)
Rows Removed by Join Filter: 707630821
Buffers: shared hit=2408684 read=181785
-> Seq Scan on records r (cost=0.00..4379760.52 rows=178929152 width=70) (actual time=0.024..645616.135 rows=178929147 loops=1)
Buffers: shared hit=2408684 read=181785
-> CTE Scan on ts (cost=0.00..0.08 rows=4 width=32) (actual time=0.000..0.000 rows=4 loops=178929147)
Planning time: 126.110 ms
Execution time: 2222514.566 ms
You should get different plan if you would cast that timestamp to date and filter by value list instead.
CREATE INDEX IF NOT EXISTS userid_ts_idx ON records (userid ASC,cast(ts AS date) DESC);
SELECT *
FROM records
WHERE userid = ANY(VALUES (2), ..., (96158 more userids) )
AND cast(ts AS date) IN('2017-09-01','2017-08-25','2017-08-18','2017-08-11');
Whether it will perform better depends on your data and date range, since I found in my case that Postgres will keep using that index even if date values cover whole table (so a seq scan would be better).
Demo

Postgres not Using different query plan for higher offsets

I have this postgres query
explain SELECT "facilities".* FROM "facilities" INNER JOIN
resource_indices ON resource_indices.resource_id = facilities.uuid WHERE
(client_id IS NULL OR (client_tag=NULL AND client_id=7))
AND (ARRAY['country:india']::varchar[] && resource_indices.tags)
AND "facilities"."is_listed" = 't'
ORDER BY resource_indices.name LIMIT 11 OFFSET 100;
Observe the offset. When the offset is less than say 200 it uses index and works fine.
The query plan for that is as follow
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=23416.57..24704.45 rows=11 width=1457) (actual time=41.951..43.035 rows=11 loops=1)
-> Nested Loop (cost=0.71..213202.15 rows=1821 width=1457) (actual time=2.107..43.007 rows=211 loops=1)
-> Index Scan using index_resource_indices_on_name on resource_indices (cost=0.42..190226.95 rows=12460 width=28) (actual time=2.096..40.790 rows=408 loops=1)
Filter: ('{country:india}'::character varying[] && tags)
Rows Removed by Filter: 4495
-> Index Scan using index_facilities_on_uuid on facilities (cost=0.29..1.83 rows=1 width=1445) (actual time=0.005..0.005 rows=1 loops=408)
Index Cond: (uuid = resource_indices.resource_id)
Filter: ((client_id IS NULL) AND is_listed)
Planning time: 1.259 ms
Execution time: 43.121 ms
(10 rows)
Increasing the offset for say four hundred starts using hash join and gives a much poorer performance. Increasing offsets gives much poorer performance.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34508.62..34508.65 rows=11 width=1457) (actual time=136.288..136.291 rows=11 loops=1)
-> Sort (cost=34507.62..34512.18 rows=1821 width=1457) (actual time=136.224..136.268 rows=411 loops=1)
Sort Key: resource_indices.name
Sort Method: top-N heapsort Memory: 638kB
-> Hash Join (cost=29104.96..34419.46 rows=1821 width=1457) (actual time=23.885..95.099 rows=6518 loops=1)
Hash Cond: (facilities.uuid = resource_indices.resource_id)
-> Seq Scan on facilities (cost=0.00..4958.39 rows=33790 width=1445) (actual time=0.010..48.732 rows=33711 loops=1)
Filter: ((client_id IS NULL) AND is_listed)
Rows Removed by Filter: 848
-> Hash (cost=28949.21..28949.21 rows=12460 width=28) (actual time=23.311..23.311 rows=12601 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 814kB
-> Bitmap Heap Scan on resource_indices (cost=1048.56..28949.21 rows=12460 width=28) (actual time=9.369..18.710 rows=12601 loops=1)
Recheck Cond: ('{country:india}'::character varying[] && tags)
Heap Blocks: exact=7334
-> Bitmap Index Scan on index_resource_indices_on_tags (cost=0.00..1045.45 rows=12460 width=0) (actual time=7.680..7.680 rows=13889 loops=1)
Index Cond: ('{country:india}'::character varying[] && tags)
Planning time: 1.408 ms
Execution time: 136.465 ms
(18 rows)
How do I resolve this? Thank you
That is unavoidable, because there is no other way to implement LIMIT 10 OFFSET 10000 but to fetch the first 10010 rows and throw away all but the last 10. This is bound to perform increasingly bad as the offset is raised.
PostgreSQL switches to a different plan because it has to retrieve more result rows, and “fast start” plans that are quick to retrieve the first few rows and usually involve nested loop joins will perform worse than other plans when more result rows are needed.
OFFSET is evil and you should avoid it in most cases. Read what Markus Winand has to say about this topic, particularly how to paginate result sets without OFFSET.

postgres: why does this GIN index not used for this "object in array" query

I am trying to index a JSONB column that contains array of objects :
create table tmp_t (a INTEGER PRIMARY KEY,o jsonb);
insert into tmp_t (a,o) values(1, '[{"frame": 1, "accession": "NM_001184642.1"}]');
insert into tmp_t (a,o) values (2, '[{"frame": 3, "accession": "NM_001178208.1"}]');
CREATE INDEX idx_tmp_t ON tmp_t USING gin (o);
EXPLAIN tells me the following query does not use the index :
EXPLAIN
SELECT * from tmp_t v where v.o #> '[{"accession": "NM_001178208.1"}]';
explain result:
QUERY PLAN
Seq Scan on tmp_t v (cost=0.00..1.02 rows=1 width=36)
Filter: (o #> '[{""accession"": ""NM_001178208.1""}]'::jsonb)
My setup seems identical to the one given in answer to this question :
Using indexes in json array in PostgreSQL
I have created the example table in the question, and the index does get used :
"QUERY PLAN"
"Bitmap Heap Scan on tracks (cost=16.01..20.02 rows=1 width=36)"
" Recheck Cond: (artists #> '[{""z"": 2}]'::jsonb)"
" -> Bitmap Index Scan on tracks_artists_gin_idx (cost=0.00..16.01 rows=1 width=0)"
" Index Cond: (artists #> '[{""z"": 2}]'::jsonb)"
Actually the index is used, just use larger test data.
The planner can choose different plans depending on the data. It often happens that the planner doesn't use an index on a dataset with a small number of rows and starts using it when amount of data grows.

Index not used when doing join on two text arrays using overlap operator with Postgresql 9.4

Trying to optimize the join below, but cost still seems to be too high. Is there any way how to force postgres to use index when doing join on two text array fields?
-> Unique (cost=16508500.04..16510899.32 rows=319904 width=38) (actual time=580978.121..581078.948 rows=415229 loops=1)
-> Sort (cost=16508500.04..16509299.80 rows=319904 width=38) (actual time=580978.120..581013.446 rows=415229 loops=1)
Sort Key: t992_1.name, t294_1.name
Sort Method: quicksort Memory: 51186kB
-> Nested Loop (cost=0.00..16479249.17 rows=319904 width=38) (actual time=1.335..579142.184 rows=415229 loops=1)
Join Filter: (array_lowercase((t294_1.name)::character varying[]) && array_lowercase((t992_1.name)::character varying[]))
Rows Removed by Join Filter: 31577903
-> Seq Scan on c04 t992_1 (cost=0.00..4106.69 rows=69848 width=195) (actual time=0.003..40.408 rows=69854 loops=1)
Filter: __name_flag
Rows Removed by Filter: 15
-> Materialize (cost=0.00..95.87 rows=458 width=83) (actual time=0.000..0.031 rows=458 loops=69854)
-> Seq Scan on cat t294_1 (cost=0.00..93.58 rows=458 width=83) (actual time=0.003..0.381 rows=458 loops=1)
The problematic part of the query is array_lowercase(t294_1.name) && array_lowercase(t992_1.name). I have GIN index on both columns (with array_lowercase).
In the end I've solved it by using slightly different function for the index. Original function as defined below is won't be used:
CREATE OR REPLACE FUNCTION array_lowercase(character varying[])
RETURNS character varying[] AS
$BODY$
SELECT array_agg(q.tag) FROM (
SELECT btrim(lower(unnest($1)))::varchar AS tag
) AS q;
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;`
But if the array_lowercase function is defined slightly differently using text[] instead of character varying[] it's picked and used automatically:
CREATE OR REPLACE FUNCTION array_lowercase(text[])
RETURNS text[] AS
$BODY$
SELECT array_agg(q.tag) FROM (
SELECT btrim(lower(unnest($1))) AS tag
) AS q;
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;

Resources