I have the following Query/View:
CREATE OR REPLACE VIEW "SumAndSalesPerCountryYear" AS
SELECT date_part('year'::text, "Invoice"."InvoiceDate") AS year,
"Invoice"."BillingCountry" AS country,
sum("Invoice"."Total") AS total
FROM "Invoice"
GROUP BY date_part('year'::text, "Invoice"."InvoiceDate"), "Invoice"."BillingCountry"
ORDER BY date_part('year'::text, "Invoice"."InvoiceDate") DESC, sum("Invoice"."Total") DESC;
My table structure is as follows:
CREATE TABLE "Invoice"
(
"InvoiceId" integer NOT NULL,
"CustomerId" integer NOT NULL,
"InvoiceDate" timestamp without time zone NOT NULL,
"BillingAddress" character varying(70),
"BillingCity" character varying(40),
"BillingState" character varying(40),
"BillingCountry" character varying(40),
"BillingPostalCode" character varying(10),
"Total" numeric(10,2) NOT NULL,
CONSTRAINT "PK_Invoice" PRIMARY KEY ("InvoiceId"),
CONSTRAINT "FK_InvoiceCustomerId" FOREIGN KEY ("CustomerId")
REFERENCES "Customer" ("CustomerId") MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=FALSE
);
The current execution plan is
Sort (cost=33.65..34.54 rows=354 width=21) (actual time=0.691..0.698 rows=101 loops=1)"
Sort Key: (date_part('year'::text, "Invoice"."InvoiceDate")), (sum("Invoice"."Total"))
Sort Method: quicksort Memory: 32kB
-> HashAggregate (cost=14.24..18.67 rows=354 width=21) (actual time=0.540..0.567 rows=101 loops=1)
-> Seq Scan on "Invoice" (cost=0.00..11.15 rows=412 width=21) (actual time=0.015..0.216 rows=412 loops=1)
Total runtime: 0.753 ms
My task is to optimize the query by using indices, however i cannot think of a way to use indices for optimizing aggregate results.
You can try to penalize Hashagg by "SET enable_hashagg to OFF", but probably for small data, there will not be any benefit from index .. in this use case - hashagg is usually most fast method for aggregation and sort 32kB is pretty quick.
But .. you are trying do performance benchmark on table with 412 rows. It is nonsense. Any thinking about performance has sense on data with size related 2..3 years of production usage.
As noted by Pavel, Ramfjord, and horse, using an index is of little use with such a tiny amount of data. It's so small that it's faster for Postgres to read disk page or two and process everything in memory.
Further, you have the best possible plan for your query already. You're asking Postgres to compute an aggregate over an entire table and returning it in a certain order. Postgres proceeds by computing the aggregate in memory without bothering to sort the data first, by assigning intermediary results using a hash; it then sorts the small number of results according to your criteria.
Related
I have a column foo_bar of type text indexed with text_pattern_ops:
profiles_foo_bar_txt" btree (foo_bar text_pattern_ops)
My psql server is configured with UTF-8:
lc_collate
------------
en_US.utf8
(1 row)
But for some reason the index is not being used for pattern matching queries:
EXPLAIN SELECT "profiles".* FROM "profiles" WHERE foo_bar LIKE 'Mar%';
results in
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on profiles (cost=0.00..2288.12 rows=51370 width=142)
Filter: (foo_bar ~~ 'Mar%'::text)
(2 rows)
Am I missing something? Any ideas on what I might be doing wrong?
EDIT:
As requested on Frank Heiken's comment, here's the out put when using EXPLAIN(ANALYZE, VERBOSE, BUFFERS):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on public.profiles (cost=0.00..2436.12 rows=51370 width=11) (actual time=0.013..14.224 rows=51370 loops=1)
Output: foo_bar
Filter: (profiles.foo_bar ~~ 'Mar%'::text)
Rows Removed by Filter: 1
Buffers: shared hit=1794
Planning Time: 0.077 ms
Execution Time: 16.932 ms
(7 rows)
So it seems that indded the index with text_pattern_ops it not being for that pattern matching query.
Any ideas?
So it, seems that #jjanes comment was right. It turns out that I added that foo_bar for test purposes and all the 50k rows had the same value on that column (eg Mary John). So probably the query Mar% was so unselective (ie would match everything) that using the index didn't make sense.
I then updated that column and populated with different names, ie, there were now names starting with Mas*, Mar* and other completely different names.
I then reran EXPLAIN and could confirm that the index was indeed being used now:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using profiles_foo_bar_txt on public.profiles (cost=0.29..597.03 rows=16923 width=6) (actual time=0.020..4.951 rows=16978 loops=1)
Output: foo_bar
Index Cond: ((profiles.foo_bar ~>=~ 'Mar'::text) AND (profiles.foo_bar ~<~ 'Mas'::text))
Filter: (profiles.foo_bar ~~ 'Mar%'::text)
Heap Fetches: 0
Buffers: shared hit=69
Planning Time: 0.104 ms
Execution Time: 5.910 ms
(8 rows)
Btw, without the index, that same query would take between 12ms - 18ms for 50k+ rows. With the index, it takes between 3ms - 7ms now.
There are 16,978 results which match that Mar% query.
In terms of whether the gain of performance makes sense for this scenario (or if it should've been more), I cannot say.
According to this article, for the same amount of rows 50k, the author had an improvement for a prefix match query from 8ms (without the index) to 1ms (with the index). So something looks off still.
I want to index an array column with either GIN or GiST. The fact that GIN is slower in insert/update operations, however, made me wonder if it would have any impact on performance - even though the indexed column itself will remain static.
So, assuming that, for instance, I have a table with columns (A, B, C) and that B is indexed, does the index get updated if I update only column C?
It depends :^)
Normally, PostgreSQL will have to modify the index, even if nothing changes in the indexed column, because an UPDATE in PostgreSQL creates a new row version, so you need a new index entry to point to the new location of the row in the table.
Since this is unfortunate, there is an optimization called “HOT update”: If none of the indexed columns are modified and there is enough free space in the block that contains the original row, PostgreSQL can create a “heap-only tuple” that is not referenced from the outside and therefore does not require a new index entry.
You can lower the fillfactor on the table to increase the likelihood for HOT updates.
For details, you may want to read my article on the topic.
Laurenz Albe answer is great. The following part is my interpretation.
Because the gin array_ops can not do index only scan. Which means that even if you only query the array column, you can only use bitmap index scan. for bitmap scan. with low fillfactor, you probably don't need to visit extract pages.
demo:
begin;
create table test_gin_update(cola int, colb int[]);
insert into test_gin_update values (1,array[1,2]);
insert into test_gin_update values (1,array[1,2,3]);
insert into test_gin_update(cola, colb) select g, array[g, g + 1] from generate_series(10, 10000) g;
commit;
for example, select colb from test_gin_update where colb = array[1,2]; see the following query plan.
because GIN cannot distinguish array[1,2] and array[1,2,3] then even if we created gin index. create index on test_gin_update using gin(colb array_ops ); We can only use bitmap index scan.
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on test_gin_update (actual rows=1 loops=1)
Recheck Cond: (colb = '{1,2}'::integer[])
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
-> Bitmap Index Scan on test_gin_update_colb_idx (actual rows=2 loops=1)
Index Cond: (colb = '{1,2}'::integer[])
(6 rows)
We are changing DB(PostgreSQL 10.11) structure for one of our projects. And one of the changes is moving field of type uuid[] (called “areasoflawid”) into the jsonb field (called “data”).
So, we have a table which look like this:
CREATE TABLE public.documents
(
id serial,
areasoflawid uuid[], --the field to be moved into the ‘data’
data jsonb,
….
)
We are not changing the values of the array or its structure.
i.e. documents.data->'metadata'->'areaoflawids' contains the same items as documents.areasoflawid)
After data migration, the JSON stored in the “data” field has following structure:
{
...
"metadata": {
...
"areaoflawids": [
"e34e0ee5-78e0-4d92-9186-ac69c109408b",
"b3af9163-d910-4d19-8f40-0602b75c25b0",
"50dc7fd8-ebdf-4cd2-bcab-b8d755fe96e8",
"8955c062-363f-4a1a-ac3c-d1c2ffe96c9b",
"bdb79f9f-4539-45f5-ac82-92baaf915f6c"
],
....
},
...
}
So, after migrating data we started benchmarking jsonb field-related queries and figured out that searching over array field documents.data->’metadata’->’areaoflawids’ takes MUCH longer than searching over uuid[] field documents.areasoflawid.
Here are the queries:
--search over jsonb array field, takes 6.2 sec, returns 13615 rows
SELECT id FROM documents WHERE data->'metadata'->'areaoflawids' #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'
--search over uuid[] field, takes 600ms, returns 13615 rows
SELECT id FROM documents WHERE areasoflawid #> ARRAY['e34e0ee5-78e0-4d92-9186-ac69c109408b']::uuid[]
Here is the index over jsonb field:
CREATE INDEX test_documents_aols_gin_idx
ON public.documents
USING gin
(((data -> 'metadata'::text) -> 'areaoflawids'::text) jsonb_path_ops);
And here is the execution plan:
EXPLAIN ANALYZE SELECT id FROM documents WHERE data->'metadata'->'areaoflawids' #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'
"Bitmap Heap Scan on documents (cost=6.31..390.78 rows=201 width=4) (actual time=2.297..5859.886 rows=13614 loops=1)"
" Recheck Cond: (((data -> 'metadata'::text) -> 'areaoflawids'::text) #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'::jsonb)"
" Heap Blocks: exact=4859"
" -> Bitmap Index Scan on test_documents_aols_gin_idx (cost=0.00..6.30 rows=201 width=0) (actual time=1.608..1.608 rows=13614 loops=1)"
" Index Cond: (((data -> 'metadata'::text) -> 'areaoflawids'::text) #> '"e34e0ee5-78e0-4d92-9186-ac69c109408b"'::jsonb)"
"Planning time: 0.133 ms"
"Execution time: 5862.807 ms"
Other queries over jsonb field work with acceptable speed, but this particular search is about 10 times slower than search over separated field. We were expecting it to be a bit slower but not that bad. We consider option of leaving this “areasoflawid” field as a separated field but we would definitely prefer to move it inside the json. I’ve been playing with different indexes and operations (also used ? and ?|) but the search is still slow. Any help is appreciated!
Finding the 13,614 candidate matches in the index is very fast (1.608 milliseconds). The slow part is reading all of those rows from the table itself. If you turn on track_io_timing, then do EXPLAIN (ANALYZE, BUFFERS), I'm sure you will find you are waiting on IO. If you run the query several times in a row, does it get faster?
I think you are doing an unequal benchmark here, where one table is already in cache and the alternative table is not. But it could also be that the new table is too large to actually fit in cache.
thank you for your response! We came up with another solution taken from this post: https://www.postgresql.org/message-id/CAONrwUFOtnR909gs+7UOdQQB12+pXsGUYu5YHPtbQk5vaE9Gaw#mail.gmail.com . The query now takes about 600-800ms to execute.
So, here is the solution:
CREATE OR REPLACE FUNCTION aol_uuids(data jsonb) RETURNS TEXT[] AS
$$
SELECT
array_agg(value::TEXT) as val
FROM
jsonb_array_elements(case jsonb_typeof(data) when 'array' then data else '[]' end)
$$ LANGUAGE SQL IMMUTABLE;
SELECT id FROM documents WHERE aol_uuids(data->'metadata'->'areaoflawids')#>ARRAY['"e34e0ee5-78e0-4d92-9186-ac69c109408b"']
I am new to PostgreSQL and PostGIS but the question is not trivial. I am using PostgreSQL 9.5 with PostGIS 2.2.
I need to run some queries that take a horrible amount of time.
First, let me explain the problem in non-GIS terms :
Basically, I have a set of several hundreds of thousands of points spread over a territory of about half a million square kilometres a (country).
Over this territory, I have about a dozen sets of areas coming from various databases. In each set, I have between a few hundreds and a few thousands of areas. I want to find which points are in any of these areas.
Now, how I am currently working out the problem in GIS terms :
Each set of areas is a Postgresql table with a geometry column of the type multipolygon and with, as explained before a few hundreds to a few thousand records.
All these tables are contained in a schema donnees but I am using a different schema for these operations, called traitements.
So the process is a/ merging all the geometries into a single geometry, and then b/ finding which points are contained in this geometry.
The problem is that, if step a/ took a reasonable amount of time (several minutes), step b/ takes forever.
I am currently working with only a sample of the points I must process (about 1% of them, i.e. about 7000) and it is not finished after several hours (the database connection eventually times out).
I am making tests running the query by limiting the number of return rows to 10 or 50 and it still takes about half an hour for that.
I am using a Linux Mint 18 machine with 4 CPU and 8 Gb of RAM if you wonder.
I have created indexes on the geometry columns. All geometry columns use the same SRID.
Creating the tables :
CREATE TABLE traitements.sites_candidats (
pkid serial PRIMARY KEY,
statut varchar(255) NOT NULL,
geom geometry(Point, 2154)
);
CREATE UNIQUE INDEX ON traitements.sites_candidats (origine, origine_id ) ;
CREATE INDEX ON traitements.sites_candidats (statut);
CREATE INDEX sites_candidats_geométrie ON traitements.sites_candidats USING GIST ( geom );
CREATE TABLE traitements.zones_traitements (
pkid serial PRIMARY KEY,
définition varchar(255) NOT NULL,
geom geometry (MultiPolygon, 2154)
);
CREATE UNIQUE INDEX ON traitements.zones_traitements (définition) ;
CREATE INDEX zones_traitements_geométrie ON traitements.zones_traitements USING GIST ( geom );
Please note that I specified the geometry type of the geom column in table traitements only because I wanted to specify the SRID but I was not sure what is the correct syntax for any type of Geometry. Maybe "geom geometry (Geometry, 2154)" ?
Merging all the geometries of the various sets of areas :
As said before, all the tables hold geometries of the type multipolygon.
This is the code I am using to merge all the geometries from one of the tables :
INSERT INTO traitements.zones_traitements
( définition, , geom )
VALUES
(
'first-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT ST_Force2D((ST_Dump(geom)).geom) AS dumpedGeometries
FROM donnees.one_table
) AS dumpingGeometries
)
) ;
I found that some of the geometries in some of the records are in 3D, so that's why I am using _ST_Force2D_.
I do this for all the tables and then merge the geometries again using :
INSERT INTO traitements.zones_traitements
( définition, geom )
VALUES
(
'second-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT (ST_Dump(geom)).geom AS dumpedGeometries
FROM traitements.zones_traitements
WHERE définition != 'second-level merge'
) AS dumpingGeometries
)
) ;
As said before, these queries take several minutes but that's fine.
Not the query that takes forever :
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'zones_rédhibitoires' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 50;
Analysing the problem :
Obviously, it is the subquery selecting the points that takes a lot of time, not the update.
So I have run an EXPLAIN (ANALYZE, BUFFERS) on the query :
EXPLAIN (ANALYZE, BUFFERS)
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'second_level_merge' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 10;
---------------------------------
"Limit (cost=4.18..20.23 rows=1 width=22) (actual time=6052.069..4393634.244 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Nested Loop (cost=4.18..20.23 rows=1 width=22) (actual time=6052.068..4391938.803 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Seq Scan on zones_traitements (cost=0.00..1.23 rows=1 width=54939392) (actual time=0.016..0.016 rows=1 loops=1)"
" Filter: (("définition")::text = 'zones_rédhibitoires'::text)"
" Rows Removed by Filter: 17"
" Buffers: shared hit=1"
" -> Bitmap Heap Scan on sites_candidats sites (cost=4.18..19.00 rows=1 width=54) (actual time=6052.044..4391260.053 rows=10 loops=1)"
" Recheck Cond: (zones_traitements.geom ~ geom)"
" Filter: _st_contains(zones_traitements.geom, geom)"
" Heap Blocks: exact=1"
" Buffers: shared read=688784"
" -> Bitmap Index Scan on "sites_candidats_geométrie" (cost=0.00..4.18 rows=4 width=0) (actual time=23.284..23.284 rows=3720 loops=1)"
" Index Cond: (zones_traitements.geom ~ geom)"
" Buffers: shared read=51"
"Planning time: 91.967 ms"
"Execution time: 4399271.394 ms"
I am not sure how to read this output.
Nevertheless, I suspect that the query is so slow because of the geometry obtained by merging all these multipolygons into a single one.
Questions :
Would that work better using a different type of geometry to merge the others, like a GeometryCollection ?
How does the indexes work in this case ?
Is there more efficient than ST_Contains() ?
Let´s see. First off, you should ask GIS specific questions over at GIS Stackexchange. But I´ll try to help here:
Technically, your geometry column definition is correct, and using
'primitives' (e.g. POINT, LINE, POLYGON and their MULTIs) is favorable
over GEOMETRYCOLLECTIONs.However, it is almost always the better
choice to run spatial relation functions on as small a geometry as
possible; for most of those functions, PostGIS has to check each and
every vertice of the input geometries against each other (so in this
case, it has to traverse the polygon's millions of vertices once for each point
to be checked in ST_Contains).PostGIS will in fact fire up a bbox
comparison prior to the relation checks (if an index is present on
both geometries) to limit the possible matches and effectively
speeding up the check by several magnitudes; this is rendered useless
here.(I would almost recommend to actually dump the MULTIs into simple POLYGONS, but not without knowing your data).
Why are you dumping the MULTI geometries just to collect them
back into MULTIs? If your source table's geometries are actually stored as MULTIPOLYGONS (and hopefully for good reason), simply copy them into the intermediate table, with ST_Force2D used on the MULTIs and ST_IsValid in the WHERE block (you can try ST_MakeValidon the geometries, but there's no guarantee it will work).If you have inserted all tables into the zones_traitements table, run VACUUM ANALYZE and REINDEX to actually make use of the index!
In your 'second merge' query...are you simply adding the 'merged' geometries to the existing ones in the table? Don´t, that´s just wrong. It messes up table statistics and the index and is quite the unnecessary overhead. You should do these things within your query, but it´s not necessary here.
Keep in mind that geometries of different types or extends created or derived by or within queries can neither have an index nor use the initial one. This applies to your 'merging' queries!
Then run
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN traitements.zones_traitements AS zones
ON ST_Intersects(zones.geom, sites.geom)
to return one pkid for every intersection with a zone so that if one point intersects two MULTIOLYGONs, you´ll get two rows for that point. Use SELECT DISTINCT pkid ... to only get one row per pkid that is intersecting any zone.(Note: I used ST_Intersection because that should imply on less check on the relation. If you absolutely need ST_Contains, just replace it)
Hope this helps. If not, say a word.
Again, thanks.
I had come to the same conclusion as your advice : that, instead of merging all the thousands of multipolygons into a single huge one, whose bbox is too huge, it would be more efficient to decompose all the multipolygons into simple polygons using ST_Dump and insert these into a dedicated table with an appropriate index.
Nevertheless, to do this, I first had to correct geometries : certain multipolygons had indeed unvalid geometries. St_MakeValid would make valid 90% of them as multipolygons but the rest was transformed into either GeometryCollections or MultilineStrings. To correct these, I used ST_Buffer, with a buffer of 0.01 meter, the result of which being a correct multipolygon.
Once this was done, all my multipolygons were valid and I could dump them into simple polygons.
Doing this, I reduced the search time by a factor of +/- 5000 !
:D
Below is the table structure with about 6 million records:
CREATE TABLE "ip_loc" (
"start_ip" inet,
"end_ip" inet,
"iso2" varchar(4),
"state" varchar(100),
"city" varchar(100)
);
CREATE INDEX "index_ip_loc" on ip_loc using gist(iprange(start_ip,end_ip));
It takes about 1 second to do the query.
EXPLAIN ANALYZE select * from ip_loc where iprange(start_ip,end_ip)#>'180.167.1.25'::inet;
Bitmap Heap Scan on ip_loc (cost=1080.76..49100.68 rows=28948 width=41) (actual time=1039.428..1039.429 rows=1 loops=1)
Recheck Cond: (iprange(start_ip, end_ip) #> '180.167.1.25'::inet)
Heap Blocks: exact=1
-> Bitmap Index Scan on index_ip_loc (cost=0.00..1073.53 rows=28948 width=0) (actual time=1039.411..1039.411 rows=1 loops=1)
Index Cond: (iprange(start_ip, end_ip) #> '180.167.1.25'::inet) Planning time: 0.090 ms Execution time: 1039.466 ms
iprange is a customized type:
CREATE TYPE iprange AS RANGE (
SUBTYPE = inet
);
Is there a way to do the query faster?
The inet type is a composite type and not the simple 32-bits needed to construct an IPv4 address; it includes a netmask for instance. That makes storage, indexing and retrieval needlessly complex if all you are interested in is actual IP addresses (i.e. the 32 bit of the actual address, as opposed to addresses with netmasks, such as you would get from a web server listing the clients of an app) and you do not manipulate the IP addresses inside the database. If that is the case, you could store your start_ip and end_ip as simple integers and operate on those using simple integer comparison. (The same can be done for IPv6 addresses using an integer[4] data type.)
A point to keep in mind is that the default range constructor behaviour is to include the lower bound and exclude the upper bound so in your index and query the actual end_ip is not included.
Lastly, if you stick with a range type, on your index you should add the range_ops operator class for maximum performance.
These ranges are non-overlapping? I'd try to btree index end_ip and do:
with candidate as (
select * from ip_loc
where end_ip<='38.167.1.53'::inet
order by end_ip desc
limit 1
)
select * from candidate
where start_ip<='38.167.1.53'::inet;
Works in 0.1ms on 4M rows on my computer.
Remember to analyze table after populating it with data.
Add a clustered index for end_ip only