Indexing a jsonb array in Postgres [duplicate] - arrays

This question already has an answer here:
Index for finding an element in a JSON array
(1 answer)
Closed 27 days ago.
I try to set up a GIN index but I do not think my index is used when I run the request, whether I use an operator or a function.
Environment
In our table we have a JSONB field (json_aip) containing a Json that looks like that:
{
"properties": {
"pdi": {
"contextInformation": {
"tags": ["SOME_TAG"]
},
},
}
Table creation :
create table t_aip (
json_aip jsonb,
[...]
);
CREATE INDEX idx_aip_tags
ON t_aip
USING gin ((json_aip -> 'properties' -> 'pdi' -> 'contextInformation' -> 'tags'));
Operator query
We can't use the operator ?| as we use JDBC. But rumors indicate I should see my index when I run that type of query.
EXPLAIN ANALYZE SELECT count(*)
FROM storage.t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' ?| array['SOME_TAG']
Result:
Aggregate
(cost=27052.16..27052.17 rows=1 width=8) (actual time=488.085..488.087 rows=1 loops=1)
-> Seq Scan on t_aip (cost=0.00..27052.06 rows=42 width=0) (actual time=0.134..456.978 rows=16502 loops=1)
Filter: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Rows Removed by Filter: 17511
Planning time: 23.202 ms
Execution
time: 488.449 ms
Functional query
EXPLAIN ANALYZE SELECT count(*)
FROM storage.t_aip
WHERE jsonb_exists_any(
json_aip#>'{properties,pdi,contextInformation,tags}',
array['SOME_TAG']
)
Result:
QUERY PLAN
Aggregate (cost=27087.00..27087.01 rows=1 width=8) (actual time=369.931..369.933 rows=1 loops=1)
-> Seq Scan on t_aip (cost=0.00..27052.06 rows=13979 width=0) (actual time=0.173..350.437 rows=16502 loops=1)
Filter: jsonb_exists_any((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]), '{SOME_TAG}'::text[])
Rows Removed by Filter: 17511
Planning time: 56.021 ms
Execution time: 370.252 ms
There is nothing about the index at all. Any help would be much appreciated !
I think my index is wrong because it considers that at the end of the path json_aip -> 'properties' -> 'pdi' -> 'contextInformation' -> 'tags' it index a String whether that's an array. That's my opinion.

There is a general rule that you have to use exactly the same expression both in an index and a query to use the index. With this index:
CREATE INDEX idx_aip_tags
ON t_aip
USING gin ((json_aip#>'{properties,pdi,contextInformation,tags}'));
the query will use the index
EXPLAIN ANALYZE
SELECT count(*)
FROM t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' ?| array['SOME_TAG']
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=149.97..149.98 rows=1 width=0) (actual time=27.783..27.783 rows=1 loops=1)
-> Bitmap Heap Scan on t_aip (cost=20.31..149.87 rows=40 width=0) (actual time=1.504..25.726 rows=20000 loops=1)
Recheck Cond: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Heap Blocks: exact=345
-> Bitmap Index Scan on idx_aip_tags (cost=0.00..20.30 rows=40 width=0) (actual time=1.455..1.455 rows=20000 loops=1)
Index Cond: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Note that the GIN index supports also #> operator:
SELECT count(*)
FROM t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> '["SOME_TAG"]'
but be careful when searching for more than one tag:
SELECT count(*)
FROM t_aip
-- this gives objects containing both tags:
-- WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> '["SOME_TAG", "ANOTHER_TAG"]'
-- and this gives objects with any of two tags:
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> ANY(ARRAY['["SOME_TAG"]', '["ANOTHER_TAG"]']::jsonb[])

EDITED:
I thought the opposite, but in fact THERE IS a difference between operators (?|) and functions (jsonb_exists_any) about index usage, as the index is never used when the query uses (jsonb) functions.
You can get more about it here: https://dba.stackexchange.com/a/91007
That was the other question of this topic.
EDIT2:
You can create function alias that can use index and be used as functions in your code, like this :
-- Define functions that calls the postgres native operator, to overpass the JDBC issue related to question mark
CREATE OR REPLACE FUNCTION rs_jsonb_exists_all(jsonb, text[])
RETURNS bool AS
'SELECT $1 ?& $2' LANGUAGE sql IMMUTABLE;
CREATE OR REPLACE FUNCTION rs_jsonb_exists(jsonb, text)
RETURNS bool AS
'SELECT $1 ? $2' LANGUAGE sql IMMUTABLE;
CREATE OR REPLACE FUNCTION rs_jsonb_exists_any(jsonb, text[])
RETURNS bool AS
'SELECT $1 ?| $2' LANGUAGE sql IMMUTABLE;

Related

Postgres index-only scan taking too long

I have a table with the below structure and indexes:
Table "public.client_data"
Column | Type | Modifiers
-------------------------+---------+-----------
account_id | text | not_null
client_id | text | not null
client_type | text | not null
creation_time | bigint | not null
last_modified_time | bigint | not null
Indexes:
"client_data_pkey" PRIMARY KEY, btree (account_id, client_id)
"client_data_last_modified_time_index" btree (last_modified_time)
From this table I need to find the oldest record - for this I used the following query:
SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
However this query on this table with around 61 million rows is running very slow (90-100 mins) in a db.r4.2xlarge RDS instance in AWS Aurora Postgres 9.6 with no other concurrent queries running.
However changing the query to use DESC finishes instantly. What could be the problem? I was expecting that since I have an index of the last_modified_time querying only for that column ordered by that column with the limit applied would involve an index-only query that should stop after the first entry in the index.
Here is the output of the explain analyze:
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.31 rows=1 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
-> Index Only Scan using client_data_last_modified_time_index on client_data (cost=0.57..1049731749.38 rows=606590292 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
Heap Fetches: 26575013
Planning time: 0.078 ms
Execution time: 6297823.306 ms
The same for the DESC version of the query results in the following
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time DESC LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.32 rows=1 width=8) (actual time=2.265..2.266 rows=1 loops=1)
-> Index Only Scan Backward using client_data_last_modified_time_index on client_data (cost=0.57..1066049674.69 rows=611336085 width=8) (actual time=2.264..2.264 rows=1 loops=1)
Heap Fetches: 9
Planning time: 0.095 ms
Execution time: 2.278 ms
Any pointers?
The difference is this:
The slow plan has
Heap Fetches: 26575013
and the fast plan
Heap Fetches: 9
Heap fetches is what turns a fast index only scan to a slow normal index scan.
Did the table experience mass updates or deletions recently?
The reason for the slow scan is that it has to wade through a lot of invisible (deleted) tuples before it hits the first match.
Run VACUUM on the table, and both scans will be fast.

Can I provide extra context somehow to allow postgres to efficiently sort/limit views without calculating their rows?

Given this somewhat contrived query
select id, pg_sleep(0.001)::text from administrative_areas;
When I add an order and limit to it directly, the sleep executes only once and results are returned quickly.
> explain analyze select id, pg_sleep(0.001)::text from administrative_areas order by id desc limit 1;
Limit (cost=0.28..0.39 rows=1 width=36) (actual time=4.227..4.228 rows=1 loops=1)
-> Index Only Scan Backward using administrative_areas_pkey on administrative_areas (cost=0.28..69.50 rows=604 width=36) (actual time=4.227..4.227 rows=1 loops=1)
Heap Fetches: 1
Planning time: 0.066 ms
Execution time: 4.243 ms
If I throw the same query in a view
CREATE OR REPLACE VIEW sleepy AS
select id, pg_sleep(0.001)::text from administrative_areas;
When querying with an order and limit the sleep executes once for every item in the underlaying administrative_areas table.
> explain analyze select * from sleepy order by id desc limit 1;
Limit (cost=30.63..30.63 rows=1 width=36) (actual time=3794.827..3794.829 rows=1 loops=1)
-> Sort (cost=30.63..32.14 rows=604 width=36) (actual time=3794.825..3794.825 rows=1 loops=1)
Sort Key: administrative_areas.id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on administrative_areas (cost=0.00..21.57 rows=604 width=36) (actual time=6.432..3792.566 rows=604 loops=1)
Planning time: 0.072 ms
Execution time: 3794.851 ms
Is there any additional context I can add to the view or provide at query time to allow the planner to optimize this?
I believe this is because pg_sleep is a volatile function. When you are querying the view, you are in effect, doing this:
select id from (select id, pg_sleep(0.001)::text from administrative_areas) order by id desc limit 1;
Postgres sees that volatile function in the subquery and runs it for each row. Let's test this.
create table test as select id from generate_series(1, 1000) g(id);
create index on test(id);
analyze test;
create view sleepy as select id, pg_sleep(0.001)::text from test;
explain analyze select * from sleepy order by id desc limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit (cost=37.50..37.50 rows=1 width=36) (actual time=1640.368..1640.439 rows=1 loops=1)
-> Sort (cost=37.50..40.00 rows=1000 width=36) (actual time=1640.336..1640.358 rows=1 loops=1)
Sort Key: test.id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on test (cost=0.00..22.50 rows=1000 width=36) (actual time=1.511..1623.058 rows=1000 loops=1)
Planning Time: 0.175 ms
Execution Time: 1640.617 ms
(7 rows)
This ran pg_sleep for each row in test, as expected.
Now try a stable function:
create function not_so_sleepy()
returns void AS
$$
select pg_sleep(0.001)
$$ language sql
stable; -- NOTE: this is just to trick postgres
create view not_as_sleepy as
select id,
not_so_sleepy()::text
FROM test;
explain analyze select *
from not_as_sleepy
order by id desc limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.32 rows=1 width=36) (actual time=0.049..0.198 rows=1 loops=1)
-> Index Only Scan Backward using test_id_idx on test (cost=0.28..43.27 rows=1000 width=36) (actual time=0.024..0.048 rows=1 loops=1)
Heap Fetches: 1
Planning Time: 1.786 ms
Execution Time: 0.308 ms
(5 rows)
In the second case, we told postgres that the function would not have any side-effects so it could safely ignore it. So, the function must be marked as stable or immutable (and of course, it must be actually stable/immutable) for postgres to not bother running the function.

Postgres using primary_key index in almost every query

We're upgrading our postgres database from version 9.3.14 to 9.4.9. We're currently under testing phase. We've encountered a problem while testing which leads to high CPU usages when the database is updated to 9.4.9. There are queries where Postgres 9.4 is using primary_key_index while cheaper options are there. As for example, running explain analyze for below query:
SELECT a.id as a_id, b.col_id as col_id
FROM a
INNER JOIN b ON b.id = a.b_id
WHERE (a.col_text = 'pqrs' AND a.col_int = 1)
ORDER BY a.id ASC LIMIT 1
gives this:
Limit (cost=0.87..4181.94 rows=1 width=8) (actual time=93014.991..93014.992 rows=1 loops=1)
-> Nested Loop (cost=0.87..1551177.78 rows=371 width=8) (actual time=93014.990..93014.990 rows=1 loops=1)
-> Index Scan using a_pkey on a (cost=0.43..1548042.20 rows=371 width=8) (actual time=93014.968..93014.968 rows=1 loops=1)
Filter: ((col_int = 1) AND ((col_text)::text = 'pqrs'::text))
Rows Removed by Filter: 16114217
-> Index Scan using b_pkey on b (cost=0.43..8.44 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=1)
Index Cond: (id = a.b_id)
Planning time: 0.291 ms
Execution time: 93015.041 ms
While the query plan for the same query in 9.3.14 gives this:
Limit (cost=17.06..17.06 rows=1 width=8) (actual time=5.066..5.067 rows=1 loops=1)
-> Sort (cost=17.06..17.06 rows=1 width=8) (actual time=5.065..5.065 rows=1 loops=1)
Sort Key: a.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.00..17.05 rows=1 width=8) (actual time=5.047..5.049 rows=1 loops=1)
-> Index Scan using index_a_on_col_text on a (cost=0.56..8.58 rows=1 width=8) (actual time=3.154..3.155 rows=1 loops=1)
Index Cond: ((col_text)::text = 'pqrs'::text)
Filter: (col_int = 1)
-> Index Scan using b_pkey on b (cost=0.43..8.46 rows=1 width=8) (actual time=1.888..1.889 rows=1 loops=1)
Index Cond: (id = a.b_id)
Total runtime: 5.112 ms
If I remove the ORDER BY clause from the query, then the query works fine using proper index. I can understand that in this case (using ORDER BY) the planner is trying to use the primary key index to scan all the rows and fetch the valid rows. But as it's evident using sort explicitly is much cheaper.
I've explored Postgres parameters like enable_indexscan and enable_seqscan which are by default on. We want to leave it on the database to decide to go for index scan or sequential scan. We've also tried tweaking effective_cache_size, random_page_cost and seq_page_cost. enable_sort is also on.
It's not happening only for this particular query but with a few more other queries where primary_key_index is being used rather than other effective methods possible.
P.S.:
After opening a case to AWS Support, this is what I got:
I understand that you want to know why you have degraded performance
on your recently upgraded instance. This is the expected and general
behavior of upgrade on a Postgres instance. Once upgrade is completed,
you need to run ANALYZE on each user database to update statistics of
the tables. This also makes SQLs performing better. A better way to do
that is using vacuumdb[1], like this:
vacuumdb -U [your user] -d [your database] -Ze -h [your rds endpoint]
It will optmize your database execution plan only, not freeing space,
but will take less time than a complete vacuum.
And this has resolved the issue. Hope this helps others who stumble upon such issue.

Postgres. the query sometimes to long.

I need help or any hint. I have Postgres DB 9.4 and have one query processed very slow SOMETIMES.
SELECT COUNT(*) FROM "table_a" INNER JOIN "table_b" ON "table_b"."id" = "table_a"."table_b_id" AND "table_b"."deleted_at" IS NULL WHERE "table_a"."deleted_at" IS NULL AND "table_b"."company_id" = ? AND "table_a"."company_id" = ?
Query plan for this -
Aggregate (cost=308160.70..308160.71 rows=1 width=0)
-> Hash Join (cost=284954.16..308160.65 rows=20 width=0)
Hash Cond: ?
-> Bitmap Heap Scan on table_a (cost=276092.39..299260.96 rows=6035 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_table_a_on_created_at_and_company_id (cost=0.00..276090.89 rows=6751 width=0)
Index Cond: ?
-> Hash (cost=8821.52..8821.52 rows=3220 width=4)
-> Bitmap Heap Scan on table_b (cost=106.04..8821.52 rows=3220 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_ table_b_on_company_id (cost=0.00..105.23 rows=3308 width=0)
Index Cond: ?
But usually, this is query executed enough fast (about 69.7ms). I don't understand why this happened sometimes. I saw in performance logs by this period, that my RDS instance consumes a lot of memory and count this queries reaches about 100 per seconds. so guys, any helps please, where do I move for solve this problem.
I am not sure if this will solve your problem or not :)
When this query is returning very fast result it is returning result from cache and not executing query again and not preparing result at that time.
First of all you have to check if there are too much queries are being executed on these tables, especially inserts/updated/deletes. This type of queries are causing locking and select have to wait until lock is being released.
Query can be slow because there is too much comparison cost of join and where clause between table_a and table_b.
You can reduce your cost by applying indexes to columns "table_b"."id", "table_a"."table_b_id", "table_a"."deleted_at", "table_b"."company_id", AND "table_a"."company_id".
You can create a view to reduce the cost as well. Views are returning cached information.
One last thing is you can reduce cost by using temporary table as well. I have given an example below.
QUERIES:
CREATE TEMPORARY TABLE table_a_temp as
SELECT "table_a"."table_b_id" FROM "table_a"
WHERE "table_a"."deleted_at" IS NULL AND "table_a"."company_id" = ? ;
CREATE TEMPORARY TABLE table_b_temp as
SELECT "table_b"."id" FROM "table_a"
WHERE"table_b"."deleted_at" IS NULL AND "table_b"."company_id" = ?;
SELECT COUNT(*) FROM "table_a_temp" INNER JOIN "table_b_temp"
ON "table_b_temp"."id" = "table_a_temp"."table_b_id" ;

Is there a way to search partial match in multivalue field in PostgreSQL?

I have a table quiet like this:
CREATE TABLE myTable (
family text,
names text[]
)
I can search like this:
SELECT family
FROM myTable where names #> array['B0WP04'];
But I would like to do:
SELECT family
FROM myTable where names #> array['%P0%'];
Is this possible ?
In postgreSQL 9.3 you can:
select family
from myTable
join lateral unnest(mytable.names) as un(name) on true
where un.name like '%P0%';
But keep in mind that it can produce duplicates so perhaphs you'd like to add distinct.
For earlier versions:
select family
from myTable where
exists (select 1 from unnest(names) as un(name) where un.name like '%P0%');
Adding a bit on Radek's answer, I tried
select family
from myTable where
exists (select 1 from unnest(names) as name where name like '%P0%');
and it also works. I searched in the PostgreSQL docs for the un() function, but can't find anything.
I'm not saying it doesn't do anything, but I'm just curious about what the un() function should do (and happy to have my problem solved)
You can use the parray_gin extension https://github.com/theirix/parray_gin
This extension is said to work only up to 9.2 but I just installed and tested it on 9.3 and it works well.
Here is how to install it on ubuntu-like systems :)
# install postgresql extension network client and postgresql extension build tools
sudo apt-get install python-setuptools
easy_install pgxnclient
sudo apt-get install postgresql-server-dev-9.3
# get the extension
pgxn install parray_gin
And here is my test
-- as a superuser: add the extension to the current database
CREATE EXTENSION parray_gin;
-- as a normal user
CREATE TABLE test (
id SERIAL PRIMARY KEY,
names TEXT []
);
INSERT INTO test (names) VALUES
(ARRAY ['nam1', 'nam2']),
(ARRAY ['2nam1', '2nam2']),
(ARRAY ['Hello', 'Woooorld']),
(ARRAY ['Woooorld', 'Hello']),
(ARRAY [] :: TEXT []),
(NULL),
(ARRAY ['Hello', 'is', 'it', 'me', 'you''re', 'looking', 'for', '?']);
-- double up the rows in test table, with many rows, the index is used
INSERT INTO test (names) (SELECT names FROM test);
SELECT count(*) from test; /*
count
--------
997376
(1 row)
*/
Now that we have some test data, it's magic time:
-- http://pgxn.org/dist/parray_gin/doc/parray_gin.html
CREATE INDEX names_idx ON test USING GIN (names parray_gin_ops);
--- now it's time for some tests
EXPLAIN ANALYZE SELECT * FROM test WHERE names #> ARRAY ['is']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..25667.00 rows=1138 width=49) (actual time=0.021..508.599 rows=51200 loops=1)
Filter: (names #> '{is}'::text[])
Rows Removed by Filter: 946176
Total runtime: 653.879 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=455.73..3463.37 rows=997 width=49) (actual time=14.327..240.365 rows=51200 loops=1)
Recheck Cond: (names #> '{is}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..455.48 rows=997 width=0) (actual time=12.241..12.241 rows=51200 loops=1)
Index Cond: (names #> '{is}'::text[])
Total runtime: 341.750 ms
(5 rows)
*/
EXPLAIN ANALYZE SELECT * FROM test WHERE names ##> ARRAY ['%nam%']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..23914.20 rows=997 width=49) (actual time=0.023..590.093 rows=102400 loops=1)
Filter: (names ##> '{%nam%}'::text[])
Rows Removed by Filter: 894976
Total runtime: 796.636 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=159.73..3167.37 rows=997 width=49) (actual time=20.164..293.942 rows=102400 loops=1)
Recheck Cond: (names ##> '{%nam%}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..159.48 rows=997 width=0) (actual time=18.539..18.539 rows=102400 loops=1)
Index Cond: (names ##> '{%nam%}'::text[])
Total runtime: 490.060 ms
(5 rows)
*/
The final performance totally depend on your data and queries but on my dummy example, this extension is very efficient cut query time in half.

Resources