Indexing only an attribute on a json array - Postgres - arrays

I have a table with a jsonb field named "data" with the following content:
{
customerId: 1,
something: "..."
list: [{ nestedId: 1, attribute: "a" }, { nestedId: 2, attribute: "b" }]
}
I need to retrieve the whole row based on its 'nestedId' attribute, note that the field is inside an array.
After checking the query plans I found out I could benefit from an index. So I added:
CREATE INDEX i1 ON mytable using gin ((data->'list') jsonb_path_ops))
From what I understood from the doc, this creates index items for the values in the "list", the solution solves my problem.
For the sake of completion follow the query I can use to retrieve my data
SELECT data FROM mytable where data->'list' #> '[{"nestedId": 1}]'
Tho, I wonder if there are more optimal indexing I could do. Is it possible to create an index only for the "nestedId" field for example?

You can index only the numeric values and not also the keys, using functional indexes. You probably need to make a helper function to do so.
create function jsonb_objarray_to_intarray(jsonb,text) returns int[] immutable language sql as
$$ select array_agg((x->>$2)::int) from jsonb_array_elements($1) f(x) $$;
create index on mytable using gin (jsonb_objarray_to_intarray(data->'list','nestedId'));
SELECT data FROM mytable where jsonb_objarray_to_intarray(data->'list','nestedId') #> ARRAY[3];
I wrote it this way so the function could be reused in other similar situations. If you don't care about it being re-used, you can make the code that uses it look prettier by hard coding the dereference and the key value into the function:
create function mytable_to_intarray(jsonb) returns int[] immutable language sql as
$$ select array_agg((x->>'nestedId')::int) from jsonb_array_elements($1->'list') f(x) $$;
create index on mytable using gin (mytable_to_intarray(data));
SELECT data FROM mytable where mytable_to_intarray(data) #> ARRAY[3];
Now those indexes do take longer to make than your original, but they are about half the size and are at least as fast to query. More importantly, the planner has better statistics about the selectivity, and so in more complicated queries is likely to come up with better query plans.

Related

Postgres: Searching over jsonb array field is slow

We are changing DB(PostgreSQL 10.11) structure for one of our projects. And one of the changes is moving field of type uuid[] (called “areasoflawid”) into the jsonb field (called “data”).
So, we have a table which look like this:
CREATE TABLE public.documents
(
id serial,
areasoflawid uuid[], --the field to be moved into the ‘data’
data jsonb,
….
)
We are not changing the values of the array or its structure.
i.e. documents.data->'metadata'->'areaoflawids' contains the same items as documents.areasoflawid)
After data migration, the JSON stored in the “data” field has following structure:
{
...
"metadata": {
...
"areaoflawids": [
"e34e0ee5-78e0-4d92-9186-ac69c109408b",
"b3af9163-d910-4d19-8f40-0602b75c25b0",
"50dc7fd8-ebdf-4cd2-bcab-b8d755fe96e8",
"8955c062-363f-4a1a-ac3c-d1c2ffe96c9b",
"bdb79f9f-4539-45f5-ac82-92baaf915f6c"
],
....
},
...
}
So, after migrating data we started benchmarking jsonb field-related queries and figured out that searching over array field documents.data->’metadata’->’areaoflawids’ takes MUCH longer than searching over uuid[] field documents.areasoflawid.
Here are the queries:
--search over jsonb array field, takes 6.2 sec, returns 13615 rows
SELECT id FROM documents WHERE data->'metadata'->'areaoflawids' #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'
--search over uuid[] field, takes 600ms, returns 13615 rows
SELECT id FROM documents WHERE areasoflawid #> ARRAY['e34e0ee5-78e0-4d92-9186-ac69c109408b']::uuid[]
Here is the index over jsonb field:
CREATE INDEX test_documents_aols_gin_idx
ON public.documents
USING gin
(((data -> 'metadata'::text) -> 'areaoflawids'::text) jsonb_path_ops);
And here is the execution plan:
EXPLAIN ANALYZE SELECT id FROM documents WHERE data->'metadata'->'areaoflawids' #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'
"Bitmap Heap Scan on documents (cost=6.31..390.78 rows=201 width=4) (actual time=2.297..5859.886 rows=13614 loops=1)"
" Recheck Cond: (((data -> 'metadata'::text) -> 'areaoflawids'::text) #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'::jsonb)"
" Heap Blocks: exact=4859"
" -> Bitmap Index Scan on test_documents_aols_gin_idx (cost=0.00..6.30 rows=201 width=0) (actual time=1.608..1.608 rows=13614 loops=1)"
" Index Cond: (((data -> 'metadata'::text) -> 'areaoflawids'::text) #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'::jsonb)"
"Planning time: 0.133 ms"
"Execution time: 5862.807 ms"
Other queries over jsonb field work with acceptable speed, but this particular search is about 10 times slower than search over separated field. We were expecting it to be a bit slower but not that bad. We consider option of leaving this “areasoflawid” field as a separated field but we would definitely prefer to move it inside the json. I’ve been playing with different indexes and operations (also used ? and ?|) but the search is still slow. Any help is appreciated!
Finding the 13,614 candidate matches in the index is very fast (1.608 milliseconds). The slow part is reading all of those rows from the table itself. If you turn on track_io_timing, then do EXPLAIN (ANALYZE, BUFFERS), I'm sure you will find you are waiting on IO. If you run the query several times in a row, does it get faster?
I think you are doing an unequal benchmark here, where one table is already in cache and the alternative table is not. But it could also be that the new table is too large to actually fit in cache.
thank you for your response! We came up with another solution taken from this post: https://www.postgresql.org/message-id/CAONrwUFOtnR909gs+7UOdQQB12+pXsGUYu5YHPtbQk5vaE9Gaw#mail.gmail.com . The query now takes about 600-800ms to execute.
So, here is the solution:
CREATE OR REPLACE FUNCTION aol_uuids(data jsonb) RETURNS TEXT[] AS
$$
SELECT
array_agg(value::TEXT) as val
FROM
jsonb_array_elements(case jsonb_typeof(data) when 'array' then data else '[]' end)
$$ LANGUAGE SQL IMMUTABLE;
SELECT id FROM documents WHERE aol_uuids(data->'metadata'->'areaoflawids')#>ARRAY['"e34e0ee5-78e0-4d92-9186-ac69c109408b"']

Sigle-column or multi-column index(-es)?

I have a table with 3 columns: "Id", "A", "B".
All of them are searchable. Id is identity and used only to search exact rows so it's clear. But I have doubts about "A" and "B". I have a 3 cases to search in my application: search by "A", search by "B" and search by "A" and "B" simultaneously. So i'm not sure which index type to choose. Should I use two single-column indexes or one multi-column? Or maybe it's better to combine single-column indexes with multi-column (3 indexes in total)? I don't really care about INSERT/UPDATE/DELETE duration, my target priority is to make SELECT as fast as possible.
I use SQL Server 2017.
Thank you.
I think two additional indexes will be enough:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) -- the first column here which has more different values
CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A)
If you have other columns in this table you can create included indexes:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) INCLUDE(C,D,E,...)
CREATE INDEX IDX_YourTable_B ON YourTable(B) INCLUDE(A,C,D,E,...)
Index IDX_YourTable_AB might used for conditions WHERE A='...' or WHERE A='...' AND B='...' or WHERE A LIKE '...%' AND B='...' - used only A column or A&B columns.
Index IDX_YourTable_B might used for conditions with B column only (WHERE B='...' or WHERE B LIKE '...%').
Also try to test CREATE INDEX IDX_YourTable_BA ON YourTable(B,A) instead of CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A). Maybe it will be better.

LIKE query on elements of flat jsonb array

I have a Postgres table posts with a column of type jsonb which is basically a flat array of tags.
What i need to do is to somehow run a LIKE query on that tags column elements so that i can find a posts which has a tags beginning with some partial string.
Is such thing possible in Postgres? I'm constantly finding super complex examples and no one is ever describing such basic and simple scenario.
My current code works fine for checking if there are posts having specific tags:
select * from posts where tags #> '"TAG"'
and I'm looking for a way of running something among the lines of
select * from posts where tags #> '"%TAG%"'
SELECT *
FROM posts p
WHERE EXISTS (
SELECT FROM jsonb_array_elements_text(p.tags) tag
WHERE tag LIKE '%TAG%'
);
Related, with explanation:
Search a JSON array for an object containing a value matching a pattern
Or simpler with the #? operator since Postgres 12 implemented SQL/JSON:
SELECT *
-- optional to show the matching item:
-- , jsonb_path_query_first(tags, '$[*] ? (# like_regex "^ tag" flag "i")')
FROM posts
WHERE tags #? '$[*] ? (# like_regex "TAG")';
The operator #? is just a wrapper around the function jsonb_path_exists(). So this is equivalent:
...
WHERE jsonb_path_exists(tags, '$[*] ? (# like_regex "TAG")');
Neither has index support. (May be added for the #? operator later, but not there in pg 13, yet). So those queries are slow for big tables. A normalized design, like Laurenz already suggested would be superior - with a trigram index:
PostgreSQL LIKE query performance variations
For just prefix matching (LIKE 'TAG%', no leading wildcard), you could make it work with a full text index:
CREATE INDEX posts_tags_fts_gin_idx ON posts USING GIN (to_tsvector('simple', tags));
And a matching query:
SELECT *
FROM posts p
WHERE to_tsvector('simple', tags) ## 'TAG:*'::tsquery
Or use the english dictionary instead of simple (or whatever fits your case) if you want stemming for natural English language.
to_tsvector(json(b)) requires Postgres 10 or later.
Related:
Get partial match from GIN indexed TSVECTOR column
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

PostgreSQL count results within jsonb array across multiple rows

As stated in the title, I am in a situation where I need to return a count of occurrences within an array, that is within a jsonb column. A pseudo example is as follows:
CREATE TABLE users (id int primary key, tags jsonb);
INSERT INTO users (id, j) VALUES
(1, '{"Friends": ["foo", "bar", "baz"]}'),
(2, '{"Friends": ["bar", "bar"]}');
please note that the value for friends can contain the same value more than once. This will be relevant later (in this case the second value contains contains the name "bar" twice in jsonb column under the key "Friends".)
Question:
For the example above, if I were to search for the value "bar" (given a query that I need help to solve), I want the number of times "bar" appears in the j (jsonb) column within the key "Friends"; in this case the end result I would be looking for is the integer 3. As the term "bar" appears 3 times across 2 rows.
Where I'm at:
Currently I have sql written, that returns a text array containing all of the friends values (from the multiple selected rows) in a single, 1 dimensional array. That sql is as follows
SELECT jsonb_array_elements_text(j->'Friends') FROM users;
yielding result is the following:
jsonb_array_elements_text
-------------------------
foo
bar
baz
bar
bar
Given that this is an array, is it possible to filter this by the term "bar" in some fashion in order to get the count of the number of times it appears? Or am I way off in my approach?
Other Details:
Version: psql (PostgreSQL) 9.5.2
The table in question and a gin index on it.
Please let me know if any additional information is needed, thanks in advance.
You need to use the result of the function as a proper table, then you can easily count the number of times the value appears.
select count(x.val)
from users
cross join lateral jsonb_array_elements_text(tags->'Friends') as x(val)
where x.val = 'bar'

Optimize "= any" operator using index [duplicate]

I can't find a definite answer to this question in the documentation. If a column is an array type, will all the entered values be individually indexed?
I created a simple table with one int[] column, and put a unique index on it. I noticed that I couldn't add the same array of ints, which leads me to believe the index is a composite of the array items, not an index of each item.
INSERT INTO "Test"."Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test"."Test" VALUES ('{10, 20, 30}');
SELECT * FROM "Test"."Test" WHERE 20 = ANY ("Column1");
Is the index helping this query?
Yes you can index an array, but you have to use the array operators and the GIN-index type.
Example:
CREATE TABLE "Test"("Column1" int[]);
INSERT INTO "Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test" VALUES ('{10, 20, 30}');
CREATE INDEX idx_test on "Test" USING GIN ("Column1");
-- To enforce index usage because we have only 2 records for this test...
SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT * FROM "Test" WHERE "Column1" #> ARRAY[20];
Result:
Bitmap Heap Scan on "Test" (cost=4.26..8.27 rows=1 width=32) (actual time=0.014..0.015 rows=2 loops=1)
Recheck Cond: ("Column1" #> '{20}'::integer[])
-> Bitmap Index Scan on idx_test (cost=0.00..4.26 rows=1 width=0) (actual time=0.009..0.009 rows=2 loops=1)
Index Cond: ("Column1" #> '{20}'::integer[])
Total runtime: 0.062 ms
Note
it appears that in many cases the gin__int_ops option is required
create index <index_name> on <table_name> using GIN (<column> gin__int_ops)
I have not yet seen a case where it would work with the && and #> operator without the gin__int_ops options
#Tregoreg raised a question in the comment to his offered bounty:
I didn't find the current answers working. Using GIN index on
array-typed column does not increase the performance of ANY()
operator. Is there really no solution?
#Frank's accepted answer tells you to use array operators, which is still correct for Postgres 11. The manual:
... the standard distribution of PostgreSQL includes a GIN operator
class for arrays, which supports indexed queries using these
operators:
<#
#>
=
&&
The complete list of built-in operator classes for GIN indexes in the standard distribution is here.
In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. That's a heritage from the original Berkeley design of Postgres and very hard to change now. And it's generally working just fine. Here is a thread on pgsql-bugs with Tom Lane commenting on this.
Some PostGis functions (like ST_DWithin()) seem to violate this principal, but that is not so. Those functions are rewritten internally to use respective operators.
The indexed expression must be to the left of the operator. For most operators (including all of the above) the query planner can achieve this by flipping operands if you place the indexed expression to the right - given that a COMMUTATOR has been defined. The ANY construct can be used in combination with various operators and is not an operator itself. When used as constant = ANY (array_expression) only indexes supporting the = operator on array elements would qualify and we would need a commutator for = ANY(). GIN indexes are out.
Postgres is not currently smart enough to derive a GIN-indexable expression from it. For starters, constant = ANY (array_expression) is not completely equivalent to array_expression #> ARRAY[constant]. Array operators return an error if any NULL elements are involved, while the ANY construct can deal with NULL on either side. And there are different results for data type mismatches.
Related answers:
Check if value exists in Postgres array
Index for finding an element in a JSON array
SQLAlchemy: how to filter on PgArray column types?
Can IS DISTINCT FROM be combined with ANY or ALL somehow?
Asides
While working with integer arrays (int4, not int2 or int8) without NULL values (like your example implies) consider the additional module intarray, that provides specialized, faster operators and index support. See:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
As for the UNIQUE constraint in your question that went unanswered: That's implemented with a btree index on the whole array value (like you suspected) and does not help with the search for elements at all. Details:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
It's now possible to index the individual array elements. For example:
CREATE TABLE test (foo int[]);
INSERT INTO test VALUES ('{1,2,3}');
INSERT INTO test VALUES ('{4,5,6}');
CREATE INDEX test_index on test ((foo[1]));
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * from test WHERE foo[1]=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test_index on test (cost=0.00..8.27 rows=1 width=32) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (foo[1] = 1)
Total runtime: 0.112 ms
(3 rows)
This works on at least Postgres 9.2.1. Note that you need to build a separate index for each array index, in my example I only indexed the first element.

Resources