PostgreSQL Reversed ILIKE on column name - database

I am trying to figure out the given username by frontend is a valid name or not.
I have a table that contains lot of names.
So for example I got Adam18 I need to give an answer in real time (< 500ms)
My query:
SELECT * FROM names WHERE 'Adam18' ILIKE '%' || name || '%'
The query is correct but it uses sequential scan
Explain result:
Seq Scan on names (cost=0.00..2341.20 rows=527 width=516) (actual time=1.452..24.774 rows=12 loops=1)
Filter: ('Adam18'::text ~~ (('%'::text || (name)::text) || '%'::text))
Rows Removed by Filter: 105314
Buffers: shared hit=498
Planning time: 0.062 ms
Execution time: 24.796 ms
Is there a way to create index on this case?
My current index:
CREATE INDEX names_gin_idx ON names USING gin (name gin_trgm_ops)
I cannot use this. Can you help me?

Related

Why in my case ST_DWithin is not using index

I'm using Postgis extension for Postgres and trying to optimize my query for searching points in circle.
Consider I have this table with index:
create table position
(
id bigserial not null primary key,
date timestamp with time zone,
point GEOMETRY(Point, 4326),
alias varchar(50)
);
create index position_point_idx on position using gist (point);
Now when I use query with polygon everything work as expected. In explain plan I can see that query uses index.
SELECT distinct alias
FROM position
WHERE date > '2021-11-28T19:26:18.574Z'
AND date < '2021-11-28T20:26:18.574Z'
AND ST_contains(ST_GeomFromText(
'POLYGON ((13.970947489142418 49.59174558308953, 13.970947489142418 50.12515341892287, 15.208740681409838 50.12515341892287, 15.208740681409838 49.59174558308953, 13.970947489142418 49.59174558308953))',
4326), point);
-> Bitmap Index Scan on position_point_idx (cost=0.00..183.82 rows=5254 width=0) (actual time=5.981..5.981 rows=94462 loops=1)
Okey now I want to search aliases in circle but for some reason it takes seconds and not using index at all.
SELECT distinct alias
FROM position
WHERE
date > '2021-11-28T19:26:18.574Z' AND date < '2021-11-28T20:26:18.574Z'
AND
ST_DWithin (point,ST_GeomFromText('POINT (14.32983409613371
49.91815471231952)',4326),62815.14152820495);
ST_DWithin is in list here so it should use index but it's ignoring it.
What I'm doing wrong here? Thanks for any hint.
Here is my query plan
HashAggregate (cost=687537.59..687538.59 rows=100 width=9) (actual time=2874.991..2875.003 rows=100 loops=1)
Output: alias
" Group Key: ""position"".alias"
-> Gather (cost=1000.00..686702.70 rows=333955 width=9) (actual time=0.254..2041.354 rows=5008801 loops=1)
Output: alias
Workers Planned: 2
Workers Launched: 2
" -> Parallel Seq Scan on public.""position"" (cost=0.00..652307.20 rows=139148 width=9) (actual time=0.021..2117.644 rows=1669600 loops=3)"
Output: alias
" Filter: ((""position"".date > '2021-11-28 19:26:18.574+00'::timestamp with time zone) AND (""position"".date < '2021-11-28 20:26:18.574+00'::timestamp with time zone) AND (""position"".point && '0103000020E6100000010000000500000077EC65F919AAEEC0B42AE025A7A5EEC077EC65F919AAEEC03A26ECE821B2EE4077646615AFADEE403A26ECE821B2EE4077646615AFADEE40B42AE025A7A5EEC077EC65F919AAEEC0B42AE025A7A5EEC0'::geometry) AND ('0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry && st_expand(""position"".point, '62815.1415282049493'::double precision)) AND _st_dwithin(""position"".point, '0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry, '62815.1415282049493'::double precision))"
Rows Removed by Filter: 86028
Worker 0: actual time=0.023..2492.854 rows=1922778 loops=1
Worker 1: actual time=0.025..2493.448 rows=2024544 loops=1
Planning Time: 0.211 ms
Execution Time: 2876.783 ms
PostgreSQL chooses a sequential scan because it thinks that that is the most efficient access strategy, and I would say it is right. After all, the WHERE condition removed only 250000 out of approximately 5 million rows.
I think you wanted to use geography, not geometry. In geometry 4326, the entire earth (and the rest of the universe, I suppose) is well within 62815.14152820495 degrees of every other point, so the index would be profoundly ineffective.
If you were using geography, that would be 39 miles, for which the index would be useful, and in my hands it would be used.
The stats on your table also seem to be way off.

Indexing a jsonb array in Postgres [duplicate]

This question already has an answer here:
Index for finding an element in a JSON array
(1 answer)
Closed 27 days ago.
I try to set up a GIN index but I do not think my index is used when I run the request, whether I use an operator or a function.
Environment
In our table we have a JSONB field (json_aip) containing a Json that looks like that:
{
"properties": {
"pdi": {
"contextInformation": {
"tags": ["SOME_TAG"]
},
},
}
Table creation :
create table t_aip (
json_aip jsonb,
[...]
);
CREATE INDEX idx_aip_tags
ON t_aip
USING gin ((json_aip -> 'properties' -> 'pdi' -> 'contextInformation' -> 'tags'));
Operator query
We can't use the operator ?| as we use JDBC. But rumors indicate I should see my index when I run that type of query.
EXPLAIN ANALYZE SELECT count(*)
FROM storage.t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' ?| array['SOME_TAG']
Result:
Aggregate
(cost=27052.16..27052.17 rows=1 width=8) (actual time=488.085..488.087 rows=1 loops=1)
-> Seq Scan on t_aip (cost=0.00..27052.06 rows=42 width=0) (actual time=0.134..456.978 rows=16502 loops=1)
Filter: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Rows Removed by Filter: 17511
Planning time: 23.202 ms
Execution
time: 488.449 ms
Functional query
EXPLAIN ANALYZE SELECT count(*)
FROM storage.t_aip
WHERE jsonb_exists_any(
json_aip#>'{properties,pdi,contextInformation,tags}',
array['SOME_TAG']
)
Result:
QUERY PLAN
Aggregate (cost=27087.00..27087.01 rows=1 width=8) (actual time=369.931..369.933 rows=1 loops=1)
-> Seq Scan on t_aip (cost=0.00..27052.06 rows=13979 width=0) (actual time=0.173..350.437 rows=16502 loops=1)
Filter: jsonb_exists_any((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]), '{SOME_TAG}'::text[])
Rows Removed by Filter: 17511
Planning time: 56.021 ms
Execution time: 370.252 ms
There is nothing about the index at all. Any help would be much appreciated !
I think my index is wrong because it considers that at the end of the path json_aip -> 'properties' -> 'pdi' -> 'contextInformation' -> 'tags' it index a String whether that's an array. That's my opinion.
There is a general rule that you have to use exactly the same expression both in an index and a query to use the index. With this index:
CREATE INDEX idx_aip_tags
ON t_aip
USING gin ((json_aip#>'{properties,pdi,contextInformation,tags}'));
the query will use the index
EXPLAIN ANALYZE
SELECT count(*)
FROM t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' ?| array['SOME_TAG']
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=149.97..149.98 rows=1 width=0) (actual time=27.783..27.783 rows=1 loops=1)
-> Bitmap Heap Scan on t_aip (cost=20.31..149.87 rows=40 width=0) (actual time=1.504..25.726 rows=20000 loops=1)
Recheck Cond: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Heap Blocks: exact=345
-> Bitmap Index Scan on idx_aip_tags (cost=0.00..20.30 rows=40 width=0) (actual time=1.455..1.455 rows=20000 loops=1)
Index Cond: ((json_aip #> '{properties,pdi,contextInformation,tags}'::text[]) ?| '{SOME_TAG}'::text[])
Note that the GIN index supports also #> operator:
SELECT count(*)
FROM t_aip
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> '["SOME_TAG"]'
but be careful when searching for more than one tag:
SELECT count(*)
FROM t_aip
-- this gives objects containing both tags:
-- WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> '["SOME_TAG", "ANOTHER_TAG"]'
-- and this gives objects with any of two tags:
WHERE json_aip#>'{properties,pdi,contextInformation,tags}' #> ANY(ARRAY['["SOME_TAG"]', '["ANOTHER_TAG"]']::jsonb[])
EDITED:
I thought the opposite, but in fact THERE IS a difference between operators (?|) and functions (jsonb_exists_any) about index usage, as the index is never used when the query uses (jsonb) functions.
You can get more about it here: https://dba.stackexchange.com/a/91007
That was the other question of this topic.
EDIT2:
You can create function alias that can use index and be used as functions in your code, like this :
-- Define functions that calls the postgres native operator, to overpass the JDBC issue related to question mark
CREATE OR REPLACE FUNCTION rs_jsonb_exists_all(jsonb, text[])
RETURNS bool AS
'SELECT $1 ?& $2' LANGUAGE sql IMMUTABLE;
CREATE OR REPLACE FUNCTION rs_jsonb_exists(jsonb, text)
RETURNS bool AS
'SELECT $1 ? $2' LANGUAGE sql IMMUTABLE;
CREATE OR REPLACE FUNCTION rs_jsonb_exists_any(jsonb, text[])
RETURNS bool AS
'SELECT $1 ?| $2' LANGUAGE sql IMMUTABLE;

Why select result takes long time in partitioned table in postgreSql?

I have a daily partitioned table in postgresql. It uses cdr_date for partitioning. When I select a simple query, it takes a long time I dont know why!
this is a simple sql
EXPLAIN (ANALYZE , BUFFERS )
select * FROM cdr
WHERE cdr_date >= '2018-05-24 11:59:00.937000 +00:00'
AND cdr_date <= '2018-05-25 23:59:59.937000 +00:00'
and it result
Append (cost=0.56..1036393.46 rows=14908437 width=295) (actual time=5019.283..335535.305 rows=15191628 loops=1)
Buffers: shared hit=252735 read=1443977 written=125'
-> Index Scan using ind_cdr_cdr_date on cdr (cost=0.56..8.58 rows=1 width=286) (actual time=5019.190..5019.190 rows=0 loops=1)'
Index Cond: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=178464 read=708130 written=125
-> Index Scan using ind_cdr_2018_05_24 on cdr_2018_05_24 (cost=0.43..567998.02 rows=7158579 width=295) (actual time=0.091..311773.252 rows=7846816 loops=1)
Index Cond: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=74264 read=383715
-> Seq Scan on cdr_2018_05_25 (cost=0.00..468386.85 rows=7749857 width=295) (actual time=5.192..16189.737 rows=7344812 loops=1)
Filter: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=7 read=352132
Planning time: 3.394 ms
Execution time: 336984.703 ms
here is my root table
CREATE TABLE cdr
(
id BIGSERIAL NOT NULL
CONSTRAINT cdr_pkey
PRIMARY KEY,
username VARCHAR(256) NOT NULL,
user_id BIGINT,
cdr_date TIMESTAMP WITH TIME ZONE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
last_reset_time TIMESTAMP WITH TIME ZONE,
prev_cdr_date TIMESTAMP WITH TIME ZONE NOT NULL
);
CREATE INDEX ind_cdr_user_id
ON cdr (user_id);
CREATE INDEX ind_cdr_cdr_date
ON cdr (cdr_date);
and here is my one of the child table
-- auto-generated definition
CREATE TABLE cdr_2018_05_25
(
CONSTRAINT cdr_2018_05_25_cdr_date_check
CHECK ((cdr_date >= '2018-05-25 00:00:00+00' :: TIMESTAMP WITH TIME ZONE) AND
(cdr_date <= '2018-05-26 00:23:29.064408+00' :: TIMESTAMP WITH TIME ZONE))
)
INHERITS (cdr);
CREATE INDEX ind_cdr_2018_05_25_user_id
ON cdr_2018_05_25 (user_id);
CREATE INDEX ind_cdr_2018_05_25
ON cdr_2018_05_25 (cdr_date);
Because your partition is big, and you're basically selecting most of the data in the partition.
The filter is not equal to the check, so after it determines which partition to use, it still scans the index.
There are 3 solutions that I can propose that can work together:
Don't partition on ranges with such a high resolution. Consider adding another field, which is just the DATE component, and have the check with an equality operator instead. This will also ensure that your partitions don't overlap like in this case. This won't help much in this exact case, unless you really want to select all the data from a single partition.
Cluster the table on the cdr_date index, which will drastically speed up such queries.
CLUSTER cdr_2018_05_24 USING ind_cdr_2018_05_24
Consider partitioning the partitions, by hour, in case you often select smaller time ranges. 7 million rows are quite a lot for such a query.
There is no way it should take 5 seconds to find 0 rows on an index scan of the root table. I would say your root table (or indexes, anyway) is massively bloated. And if that is the case, maybe your other ones are as well. Are you vacuuming these tables sufficiently, or even at all? Look in pg_stat_user_tables for the last time they were vacuumed, either manually or auto.

Postgres. the query sometimes to long.

I need help or any hint. I have Postgres DB 9.4 and have one query processed very slow SOMETIMES.
SELECT COUNT(*) FROM "table_a" INNER JOIN "table_b" ON "table_b"."id" = "table_a"."table_b_id" AND "table_b"."deleted_at" IS NULL WHERE "table_a"."deleted_at" IS NULL AND "table_b"."company_id" = ? AND "table_a"."company_id" = ?
Query plan for this -
Aggregate (cost=308160.70..308160.71 rows=1 width=0)
-> Hash Join (cost=284954.16..308160.65 rows=20 width=0)
Hash Cond: ?
-> Bitmap Heap Scan on table_a (cost=276092.39..299260.96 rows=6035 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_table_a_on_created_at_and_company_id (cost=0.00..276090.89 rows=6751 width=0)
Index Cond: ?
-> Hash (cost=8821.52..8821.52 rows=3220 width=4)
-> Bitmap Heap Scan on table_b (cost=106.04..8821.52 rows=3220 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_ table_b_on_company_id (cost=0.00..105.23 rows=3308 width=0)
Index Cond: ?
But usually, this is query executed enough fast (about 69.7ms). I don't understand why this happened sometimes. I saw in performance logs by this period, that my RDS instance consumes a lot of memory and count this queries reaches about 100 per seconds. so guys, any helps please, where do I move for solve this problem.
I am not sure if this will solve your problem or not :)
When this query is returning very fast result it is returning result from cache and not executing query again and not preparing result at that time.
First of all you have to check if there are too much queries are being executed on these tables, especially inserts/updated/deletes. This type of queries are causing locking and select have to wait until lock is being released.
Query can be slow because there is too much comparison cost of join and where clause between table_a and table_b.
You can reduce your cost by applying indexes to columns "table_b"."id", "table_a"."table_b_id", "table_a"."deleted_at", "table_b"."company_id", AND "table_a"."company_id".
You can create a view to reduce the cost as well. Views are returning cached information.
One last thing is you can reduce cost by using temporary table as well. I have given an example below.
QUERIES:
CREATE TEMPORARY TABLE table_a_temp as
SELECT "table_a"."table_b_id" FROM "table_a"
WHERE "table_a"."deleted_at" IS NULL AND "table_a"."company_id" = ? ;
CREATE TEMPORARY TABLE table_b_temp as
SELECT "table_b"."id" FROM "table_a"
WHERE"table_b"."deleted_at" IS NULL AND "table_b"."company_id" = ?;
SELECT COUNT(*) FROM "table_a_temp" INNER JOIN "table_b_temp"
ON "table_b_temp"."id" = "table_a_temp"."table_b_id" ;

Is there a way to search partial match in multivalue field in PostgreSQL?

I have a table quiet like this:
CREATE TABLE myTable (
family text,
names text[]
)
I can search like this:
SELECT family
FROM myTable where names #> array['B0WP04'];
But I would like to do:
SELECT family
FROM myTable where names #> array['%P0%'];
Is this possible ?
In postgreSQL 9.3 you can:
select family
from myTable
join lateral unnest(mytable.names) as un(name) on true
where un.name like '%P0%';
But keep in mind that it can produce duplicates so perhaphs you'd like to add distinct.
For earlier versions:
select family
from myTable where
exists (select 1 from unnest(names) as un(name) where un.name like '%P0%');
Adding a bit on Radek's answer, I tried
select family
from myTable where
exists (select 1 from unnest(names) as name where name like '%P0%');
and it also works. I searched in the PostgreSQL docs for the un() function, but can't find anything.
I'm not saying it doesn't do anything, but I'm just curious about what the un() function should do (and happy to have my problem solved)
You can use the parray_gin extension https://github.com/theirix/parray_gin
This extension is said to work only up to 9.2 but I just installed and tested it on 9.3 and it works well.
Here is how to install it on ubuntu-like systems :)
# install postgresql extension network client and postgresql extension build tools
sudo apt-get install python-setuptools
easy_install pgxnclient
sudo apt-get install postgresql-server-dev-9.3
# get the extension
pgxn install parray_gin
And here is my test
-- as a superuser: add the extension to the current database
CREATE EXTENSION parray_gin;
-- as a normal user
CREATE TABLE test (
id SERIAL PRIMARY KEY,
names TEXT []
);
INSERT INTO test (names) VALUES
(ARRAY ['nam1', 'nam2']),
(ARRAY ['2nam1', '2nam2']),
(ARRAY ['Hello', 'Woooorld']),
(ARRAY ['Woooorld', 'Hello']),
(ARRAY [] :: TEXT []),
(NULL),
(ARRAY ['Hello', 'is', 'it', 'me', 'you''re', 'looking', 'for', '?']);
-- double up the rows in test table, with many rows, the index is used
INSERT INTO test (names) (SELECT names FROM test);
SELECT count(*) from test; /*
count
--------
997376
(1 row)
*/
Now that we have some test data, it's magic time:
-- http://pgxn.org/dist/parray_gin/doc/parray_gin.html
CREATE INDEX names_idx ON test USING GIN (names parray_gin_ops);
--- now it's time for some tests
EXPLAIN ANALYZE SELECT * FROM test WHERE names #> ARRAY ['is']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..25667.00 rows=1138 width=49) (actual time=0.021..508.599 rows=51200 loops=1)
Filter: (names #> '{is}'::text[])
Rows Removed by Filter: 946176
Total runtime: 653.879 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=455.73..3463.37 rows=997 width=49) (actual time=14.327..240.365 rows=51200 loops=1)
Recheck Cond: (names #> '{is}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..455.48 rows=997 width=0) (actual time=12.241..12.241 rows=51200 loops=1)
Index Cond: (names #> '{is}'::text[])
Total runtime: 341.750 ms
(5 rows)
*/
EXPLAIN ANALYZE SELECT * FROM test WHERE names ##> ARRAY ['%nam%']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..23914.20 rows=997 width=49) (actual time=0.023..590.093 rows=102400 loops=1)
Filter: (names ##> '{%nam%}'::text[])
Rows Removed by Filter: 894976
Total runtime: 796.636 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=159.73..3167.37 rows=997 width=49) (actual time=20.164..293.942 rows=102400 loops=1)
Recheck Cond: (names ##> '{%nam%}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..159.48 rows=997 width=0) (actual time=18.539..18.539 rows=102400 loops=1)
Index Cond: (names ##> '{%nam%}'::text[])
Total runtime: 490.060 ms
(5 rows)
*/
The final performance totally depend on your data and queries but on my dummy example, this extension is very efficient cut query time in half.

Resources