Bad optimization/planning on Postgres window-based queries (partition by(, group by?)) - 1000x speedup - query-optimization

We are running Postgres 9.3.5. (07/2014)
We have quite some complex datawarehouse/reporting setup in place (ETL, materialized views, indexing, aggregations, analytical functions, ...).
What I discovered right now may be difficult to implement in the optimizer (?), but it makes a huge difference in performance (only sample code with huge similarity to our query to reduce unnecessary complexity):
create view foo as
select
sum(s.plan) over w_pyl as pyl_plan, -- money planned to spend in this pot/loc/year
sum(s.booked) over w_pyl as pyl_booked, -- money already booked in this pot/loc/year
-- money already booked in this pot/loc the years before (stored as sum already)
last_value(s.booked_prev_years) over w_pl as pl_booked_prev_years,
-- update 2014-10-08: maybe the following additional selected columns
-- may be implementation-/test-relevant since they could potentially be determined
-- by sorting within the partition:
min(s.id) over w_pyl,
max(s.id) over w_pyl,
-- ... anything could follow here ...
x.*,
s.*
from
pot_location_year x -- may be some materialized view or (cache/regular) table
left outer join spendings s
on (s.pot = x.pot and s.loc = x.loc and s.year = x.year)
window
w_pyl as (partition by x.pot, x.year, x.loc)
w_pl as (partition by x.pot, x.loc order by x.year)
We have these two relevant indexes in place:
pot_location_year_idx__p_y_l -- on pot, year, loc
pot_location_year_idx__p_l_y -- on pot, loc, year
Now we run an explain for some test query
explain select * from foo fetch first 100 rows only
This shows us some very bad performance, because the pyl index is used, where the result set has to be unnecessarily sorted twice :-( (the outmost WindowAgg/Sort step sorts ply because this is necessary for our last_value(..) as pl_booked_prev_years):
Limit (cost=289687.87..289692.12 rows=100 width=512)
-> WindowAgg (cost=289687.87..292714.85 rows=93138 width=408)
-> Sort (cost=289687.87..289920.71 rows=93138 width=408)
Sort Key: x.pot, x.loc, x.year
-> WindowAgg (cost=1.25..282000.68 rows=93138 width=408)
-> Nested Loop Left Join (cost=1.25..278508.01 rows=93138 width=408)
Join Filter: ...
-> Nested Loop Left Join (cost=0.83..214569.60 rows=93138 width=392)
-> Index Scan using pot_location_year_idx__p_y_l on pot_location_year x (cost=0.42..11665.49 rows=93138 width=306)
-> Index Scan using ... (cost=0.41..2.17 rows=1 width=140)
Index Cond: ...
-> Index Scan using ... (cost=0.41..0.67 rows=1 width=126)
Index Cond: ...
So the obvious problem is, that the planner should choose the existing ply index instead, to not have to sort twice.

Luckily I found out that I could give the planner an (implicit) hint to do this by making sure that the column order of the other view partitions/windows is more homogenous although not semantically necessary.
The following change now returned what I had expected to get in the first place (the usage of the ply index):
...
window
-- w_pyl as (partition by x.pot, x.year, x.loc) -- showstopper (from above)
w_pyl as (partition by x.pot, x.loc, x.year) -- speedy
w_pl as (partition by x.pot, x.loc order by x.year)
The 1000 times faster performing result:
Limit (cost=1.25..308.02 rows=100 width=512)
-> WindowAgg (cost=1.25..284794.82 rows=93138 width=408)
-> WindowAgg (cost=1.25..282000.68 rows=93138 width=408)
-> Nested Loop Left Join (cost=1.25..278508.01 rows=93138 width=408)
Join Filter: ...
-> Nested Loop Left Join (cost=0.83..214569.60 rows=93138 width=392)
-> Index Scan using pot_location_year_idx__p_l_y on pot_location_year x (cost=0.42..11665.49 rows=93138 width=306)
-> Index Scan using ... (cost=0.41..2.17 rows=1 width=140)
Index Cond: ...
-> Index Scan using ... (cost=0.41..0.67 rows=1 width=126)
Index Cond: ...
Update 2014-10-09:
Tom Lane-2 wrote this (one of the major postgres developers) related to another (likely related) window function problem I am facing here as well on 2013-02 related to pg 9.2.2:
... There's not nearly that amount of intelligence
in the system about window functions, as yet. So you'll have to write
out the query longhand and put the WHERE clause at the lower level, if
you want this optimization to happen.
So some more (debatable) general thoughts on the subject of window functions, data warehouse functionality etc. that could be considered here:
The above is a good statement to strengthen my assumption, when it was decided to do some Oracle->Postgres migration in general projects and in a DWH environment, that the risk of spending much more time and money doing so would be quite high. (Although the investigated functionality may seem sufficient.)
I like Postgres in important areas much more than Oracle, looking e.g. at the syntax and clarity of the code and other things (I guess even the source code and thus maintainability (in all its aspects) is much better there), but Oracle is clearly the much more advanced player in the resource optimization, support and tooling areas, when you are dealing with more complex db functionality outside the typical CRUD management.
I guess the open source Postgres (as well as the EnterpriseDB topups) will catch up in the long run in those areas, but it will take them at least 10 years, and maybe only if it is pushed heavily by big, altruistic1 global players like Google etc.)
1 altruistic in the sense, that if the pushed areas stay "free", the benefit for those companies must be surely somewhere else (maybe with some advertisement rows added randomly - I guess we could live with it here and there ;))
Update 2014-10-13:
As linked in my previous update above (2014-10-09), the optimization problems and their workaround solutions go on in a quite similiar way (after the above fix), when you want to query the above view with constraints/filters (here on pot_id):
explain select * foo where pot_id = '12345' fetch first 100 rows only
...
Limit (cost=1.25..121151.44 rows=100 width=211)
-> Subquery Scan on foo (cost=1.25..279858.20 rows=231 width=211)
Filter: ((foo.pot_id)::text = '12345'::text)
-> WindowAgg (cost=1.25..277320.53 rows=203013 width=107)
-> WindowAgg (cost=1.25..271230.14 rows=203013 width=107)
-> Nested Loop Left Join (cost=1.25..263617.16 rows=203013 width=107)
-> Merge Left Join (cost=0.83..35629.02 rows=203013 width=91)
Merge Cond: ...
-> Index Scan using pot_location_year_idx__p_l_y on pot_location_year x (cost=0.42..15493.80 rows=93138 width=65)
-> Materialize (cost=0.41..15459.42 rows=33198 width=46)
-> Index Scan using ... (cost=0.41..15376.43 rows=33198 width=46)
-> Index Scan using ... (cost=0.42..1.11 rows=1 width=46)
Index Cond: ...
And as suggested in the above link, if you want to "push down" the contraint/filter before the window aggregation, you have to do it explicitly in the view itself already, which will be efficient for this type of query then with another 1000 times speedup for the 100th row:
create view foo as
...
where pot_id='12345'
...
...
Limit (cost=1.25..943.47 rows=100 width=211)
-> WindowAgg (cost=1.25..9780.52 rows=1039 width=107)
-> WindowAgg (cost=1.25..9751.95 rows=1039 width=107)
-> Nested Loop Left Join (cost=1.25..9715.58 rows=1039 width=107)
-> Nested Loop Left Join (cost=0.83..1129.47 rows=1039 width=91)
-> Index Scan using pot_location_year_idx__p_l_y on pot_location_year x (cost=0.42..269.77 rows=106 width=65)
Index Cond: ((pot_id)::text = '12345'::text)
-> Index Scan using ... (cost=0.41..8.10 rows=1 width=46)
Index Cond: ...
-> Index Scan using ... (cost=0.42..8.25 rows=1 width=46)
Index Cond: ...
After some more view parameterization effort2 this approach will help speedup certain queries constraining those columns, but is still quite inflexible regarding a more general foo-view usage and query optimization.
2: You can "parameterize such a view" putting it (its SQL) in a (set-returning) table function (the Oracle equivalent to a pipelined table function). Further details regarding this may be found in the forum link above.

Related

Does postgresql update index even when the updated columns aren't the indexed ones?

I want to index an array column with either GIN or GiST. The fact that GIN is slower in insert/update operations, however, made me wonder if it would have any impact on performance - even though the indexed column itself will remain static.
So, assuming that, for instance, I have a table with columns (A, B, C) and that B is indexed, does the index get updated if I update only column C?
It depends :^)
Normally, PostgreSQL will have to modify the index, even if nothing changes in the indexed column, because an UPDATE in PostgreSQL creates a new row version, so you need a new index entry to point to the new location of the row in the table.
Since this is unfortunate, there is an optimization called “HOT update”: If none of the indexed columns are modified and there is enough free space in the block that contains the original row, PostgreSQL can create a “heap-only tuple” that is not referenced from the outside and therefore does not require a new index entry.
You can lower the fillfactor on the table to increase the likelihood for HOT updates.
For details, you may want to read my article on the topic.
Laurenz Albe answer is great. The following part is my interpretation.
Because the gin array_ops can not do index only scan. Which means that even if you only query the array column, you can only use bitmap index scan. for bitmap scan. with low fillfactor, you probably don't need to visit extract pages.
demo:
begin;
create table test_gin_update(cola int, colb int[]);
insert into test_gin_update values (1,array[1,2]);
insert into test_gin_update values (1,array[1,2,3]);
insert into test_gin_update(cola, colb) select g, array[g, g + 1] from generate_series(10, 10000) g;
commit;
for example, select colb from test_gin_update where colb = array[1,2]; see the following query plan.
because GIN cannot distinguish array[1,2] and array[1,2,3] then even if we created gin index. create index on test_gin_update using gin(colb array_ops ); We can only use bitmap index scan.
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on test_gin_update (actual rows=1 loops=1)
Recheck Cond: (colb = '{1,2}'::integer[])
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
-> Bitmap Index Scan on test_gin_update_colb_idx (actual rows=2 loops=1)
Index Cond: (colb = '{1,2}'::integer[])
(6 rows)

PostGIS : improving the speed of query over complex geometry

I am new to PostgreSQL and PostGIS but the question is not trivial. I am using PostgreSQL 9.5 with PostGIS 2.2.
I need to run some queries that take a horrible amount of time.
First, let me explain the problem in non-GIS terms :
Basically, I have a set of several hundreds of thousands of points spread over a territory of about half a million square kilometres a (country).
Over this territory, I have about a dozen sets of areas coming from various databases. In each set, I have between a few hundreds and a few thousands of areas. I want to find which points are in any of these areas.
Now, how I am currently working out the problem in GIS terms :
Each set of areas is a Postgresql table with a geometry column of the type multipolygon and with, as explained before a few hundreds to a few thousand records.
All these tables are contained in a schema donnees but I am using a different schema for these operations, called traitements.
So the process is a/ merging all the geometries into a single geometry, and then b/ finding which points are contained in this geometry.
The problem is that, if step a/ took a reasonable amount of time (several minutes), step b/ takes forever.
I am currently working with only a sample of the points I must process (about 1% of them, i.e. about 7000) and it is not finished after several hours (the database connection eventually times out).
I am making tests running the query by limiting the number of return rows to 10 or 50 and it still takes about half an hour for that.
I am using a Linux Mint 18 machine with 4 CPU and 8 Gb of RAM if you wonder.
I have created indexes on the geometry columns. All geometry columns use the same SRID.
Creating the tables :
CREATE TABLE traitements.sites_candidats (
pkid serial PRIMARY KEY,
statut varchar(255) NOT NULL,
geom geometry(Point, 2154)
);
CREATE UNIQUE INDEX ON traitements.sites_candidats (origine, origine_id ) ;
CREATE INDEX ON traitements.sites_candidats (statut);
CREATE INDEX sites_candidats_geométrie ON traitements.sites_candidats USING GIST ( geom );
CREATE TABLE traitements.zones_traitements (
pkid serial PRIMARY KEY,
définition varchar(255) NOT NULL,
geom geometry (MultiPolygon, 2154)
);
CREATE UNIQUE INDEX ON traitements.zones_traitements (définition) ;
CREATE INDEX zones_traitements_geométrie ON traitements.zones_traitements USING GIST ( geom );
Please note that I specified the geometry type of the geom column in table traitements only because I wanted to specify the SRID but I was not sure what is the correct syntax for any type of Geometry. Maybe "geom geometry (Geometry, 2154)" ?
Merging all the geometries of the various sets of areas :
As said before, all the tables hold geometries of the type multipolygon.
This is the code I am using to merge all the geometries from one of the tables :
INSERT INTO traitements.zones_traitements
( définition, , geom )
VALUES
(
'first-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT ST_Force2D((ST_Dump(geom)).geom) AS dumpedGeometries
FROM donnees.one_table
) AS dumpingGeometries
)
) ;
I found that some of the geometries in some of the records are in 3D, so that's why I am using _ST_Force2D_.
I do this for all the tables and then merge the geometries again using :
INSERT INTO traitements.zones_traitements
( définition, geom )
VALUES
(
'second-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT (ST_Dump(geom)).geom AS dumpedGeometries
FROM traitements.zones_traitements
WHERE définition != 'second-level merge'
) AS dumpingGeometries
)
) ;
As said before, these queries take several minutes but that's fine.
Not the query that takes forever :
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'zones_rédhibitoires' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 50;
Analysing the problem :
Obviously, it is the subquery selecting the points that takes a lot of time, not the update.
So I have run an EXPLAIN (ANALYZE, BUFFERS) on the query :
EXPLAIN (ANALYZE, BUFFERS)
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'second_level_merge' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 10;
---------------------------------
"Limit (cost=4.18..20.23 rows=1 width=22) (actual time=6052.069..4393634.244 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Nested Loop (cost=4.18..20.23 rows=1 width=22) (actual time=6052.068..4391938.803 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Seq Scan on zones_traitements (cost=0.00..1.23 rows=1 width=54939392) (actual time=0.016..0.016 rows=1 loops=1)"
" Filter: (("définition")::text = 'zones_rédhibitoires'::text)"
" Rows Removed by Filter: 17"
" Buffers: shared hit=1"
" -> Bitmap Heap Scan on sites_candidats sites (cost=4.18..19.00 rows=1 width=54) (actual time=6052.044..4391260.053 rows=10 loops=1)"
" Recheck Cond: (zones_traitements.geom ~ geom)"
" Filter: _st_contains(zones_traitements.geom, geom)"
" Heap Blocks: exact=1"
" Buffers: shared read=688784"
" -> Bitmap Index Scan on "sites_candidats_geométrie" (cost=0.00..4.18 rows=4 width=0) (actual time=23.284..23.284 rows=3720 loops=1)"
" Index Cond: (zones_traitements.geom ~ geom)"
" Buffers: shared read=51"
"Planning time: 91.967 ms"
"Execution time: 4399271.394 ms"
I am not sure how to read this output.
Nevertheless, I suspect that the query is so slow because of the geometry obtained by merging all these multipolygons into a single one.
Questions :
Would that work better using a different type of geometry to merge the others, like a GeometryCollection ?
How does the indexes work in this case ?
Is there more efficient than ST_Contains() ?
Let´s see. First off, you should ask GIS specific questions over at GIS Stackexchange. But I´ll try to help here:
Technically, your geometry column definition is correct, and using
'primitives' (e.g. POINT, LINE, POLYGON and their MULTIs) is favorable
over GEOMETRYCOLLECTIONs.However, it is almost always the better
choice to run spatial relation functions on as small a geometry as
possible; for most of those functions, PostGIS has to check each and
every vertice of the input geometries against each other (so in this
case, it has to traverse the polygon's millions of vertices once for each point
to be checked in ST_Contains).PostGIS will in fact fire up a bbox
comparison prior to the relation checks (if an index is present on
both geometries) to limit the possible matches and effectively
speeding up the check by several magnitudes; this is rendered useless
here.(I would almost recommend to actually dump the MULTIs into simple POLYGONS, but not without knowing your data).
Why are you dumping the MULTI geometries just to collect them
back into MULTIs? If your source table's geometries are actually stored as MULTIPOLYGONS (and hopefully for good reason), simply copy them into the intermediate table, with ST_Force2D used on the MULTIs and ST_IsValid in the WHERE block (you can try ST_MakeValidon the geometries, but there's no guarantee it will work).If you have inserted all tables into the zones_traitements table, run VACUUM ANALYZE and REINDEX to actually make use of the index!
In your 'second merge' query...are you simply adding the 'merged' geometries to the existing ones in the table? Don´t, that´s just wrong. It messes up table statistics and the index and is quite the unnecessary overhead. You should do these things within your query, but it´s not necessary here.
Keep in mind that geometries of different types or extends created or derived by or within queries can neither have an index nor use the initial one. This applies to your 'merging' queries!
Then run
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN traitements.zones_traitements AS zones
ON ST_Intersects(zones.geom, sites.geom)
to return one pkid for every intersection with a zone so that if one point intersects two MULTIOLYGONs, you´ll get two rows for that point. Use SELECT DISTINCT pkid ... to only get one row per pkid that is intersecting any zone.(Note: I used ST_Intersection because that should imply on less check on the relation. If you absolutely need ST_Contains, just replace it)
Hope this helps. If not, say a word.
Again, thanks.
I had come to the same conclusion as your advice : that, instead of merging all the thousands of multipolygons into a single huge one, whose bbox is too huge, it would be more efficient to decompose all the multipolygons into simple polygons using ST_Dump and insert these into a dedicated table with an appropriate index.
Nevertheless, to do this, I first had to correct geometries : certain multipolygons had indeed unvalid geometries. St_MakeValid would make valid 90% of them as multipolygons but the rest was transformed into either GeometryCollections or MultilineStrings. To correct these, I used ST_Buffer, with a buffer of 0.01 meter, the result of which being a correct multipolygon.
Once this was done, all my multipolygons were valid and I could dump them into simple polygons.
Doing this, I reduced the search time by a factor of +/- 5000 !
:D

Optimize "= any" operator using index [duplicate]

I can't find a definite answer to this question in the documentation. If a column is an array type, will all the entered values be individually indexed?
I created a simple table with one int[] column, and put a unique index on it. I noticed that I couldn't add the same array of ints, which leads me to believe the index is a composite of the array items, not an index of each item.
INSERT INTO "Test"."Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test"."Test" VALUES ('{10, 20, 30}');
SELECT * FROM "Test"."Test" WHERE 20 = ANY ("Column1");
Is the index helping this query?
Yes you can index an array, but you have to use the array operators and the GIN-index type.
Example:
CREATE TABLE "Test"("Column1" int[]);
INSERT INTO "Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test" VALUES ('{10, 20, 30}');
CREATE INDEX idx_test on "Test" USING GIN ("Column1");
-- To enforce index usage because we have only 2 records for this test...
SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT * FROM "Test" WHERE "Column1" #> ARRAY[20];
Result:
Bitmap Heap Scan on "Test" (cost=4.26..8.27 rows=1 width=32) (actual time=0.014..0.015 rows=2 loops=1)
Recheck Cond: ("Column1" #> '{20}'::integer[])
-> Bitmap Index Scan on idx_test (cost=0.00..4.26 rows=1 width=0) (actual time=0.009..0.009 rows=2 loops=1)
Index Cond: ("Column1" #> '{20}'::integer[])
Total runtime: 0.062 ms
Note
it appears that in many cases the gin__int_ops option is required
create index <index_name> on <table_name> using GIN (<column> gin__int_ops)
I have not yet seen a case where it would work with the && and #> operator without the gin__int_ops options
#Tregoreg raised a question in the comment to his offered bounty:
I didn't find the current answers working. Using GIN index on
array-typed column does not increase the performance of ANY()
operator. Is there really no solution?
#Frank's accepted answer tells you to use array operators, which is still correct for Postgres 11. The manual:
... the standard distribution of PostgreSQL includes a GIN operator
class for arrays, which supports indexed queries using these
operators:
<#
#>
=
&&
The complete list of built-in operator classes for GIN indexes in the standard distribution is here.
In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. That's a heritage from the original Berkeley design of Postgres and very hard to change now. And it's generally working just fine. Here is a thread on pgsql-bugs with Tom Lane commenting on this.
Some PostGis functions (like ST_DWithin()) seem to violate this principal, but that is not so. Those functions are rewritten internally to use respective operators.
The indexed expression must be to the left of the operator. For most operators (including all of the above) the query planner can achieve this by flipping operands if you place the indexed expression to the right - given that a COMMUTATOR has been defined. The ANY construct can be used in combination with various operators and is not an operator itself. When used as constant = ANY (array_expression) only indexes supporting the = operator on array elements would qualify and we would need a commutator for = ANY(). GIN indexes are out.
Postgres is not currently smart enough to derive a GIN-indexable expression from it. For starters, constant = ANY (array_expression) is not completely equivalent to array_expression #> ARRAY[constant]. Array operators return an error if any NULL elements are involved, while the ANY construct can deal with NULL on either side. And there are different results for data type mismatches.
Related answers:
Check if value exists in Postgres array
Index for finding an element in a JSON array
SQLAlchemy: how to filter on PgArray column types?
Can IS DISTINCT FROM be combined with ANY or ALL somehow?
Asides
While working with integer arrays (int4, not int2 or int8) without NULL values (like your example implies) consider the additional module intarray, that provides specialized, faster operators and index support. See:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
As for the UNIQUE constraint in your question that went unanswered: That's implemented with a btree index on the whole array value (like you suspected) and does not help with the search for elements at all. Details:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
It's now possible to index the individual array elements. For example:
CREATE TABLE test (foo int[]);
INSERT INTO test VALUES ('{1,2,3}');
INSERT INTO test VALUES ('{4,5,6}');
CREATE INDEX test_index on test ((foo[1]));
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * from test WHERE foo[1]=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test_index on test (cost=0.00..8.27 rows=1 width=32) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (foo[1] = 1)
Total runtime: 0.112 ms
(3 rows)
This works on at least Postgres 9.2.1. Note that you need to build a separate index for each array index, in my example I only indexed the first element.

Postgres optimize query index based on aggregates

I have the following Query/View:
CREATE OR REPLACE VIEW "SumAndSalesPerCountryYear" AS
SELECT date_part('year'::text, "Invoice"."InvoiceDate") AS year,
"Invoice"."BillingCountry" AS country,
sum("Invoice"."Total") AS total
FROM "Invoice"
GROUP BY date_part('year'::text, "Invoice"."InvoiceDate"), "Invoice"."BillingCountry"
ORDER BY date_part('year'::text, "Invoice"."InvoiceDate") DESC, sum("Invoice"."Total") DESC;
My table structure is as follows:
CREATE TABLE "Invoice"
(
"InvoiceId" integer NOT NULL,
"CustomerId" integer NOT NULL,
"InvoiceDate" timestamp without time zone NOT NULL,
"BillingAddress" character varying(70),
"BillingCity" character varying(40),
"BillingState" character varying(40),
"BillingCountry" character varying(40),
"BillingPostalCode" character varying(10),
"Total" numeric(10,2) NOT NULL,
CONSTRAINT "PK_Invoice" PRIMARY KEY ("InvoiceId"),
CONSTRAINT "FK_InvoiceCustomerId" FOREIGN KEY ("CustomerId")
REFERENCES "Customer" ("CustomerId") MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=FALSE
);
The current execution plan is
Sort (cost=33.65..34.54 rows=354 width=21) (actual time=0.691..0.698 rows=101 loops=1)"
Sort Key: (date_part('year'::text, "Invoice"."InvoiceDate")), (sum("Invoice"."Total"))
Sort Method: quicksort Memory: 32kB
-> HashAggregate (cost=14.24..18.67 rows=354 width=21) (actual time=0.540..0.567 rows=101 loops=1)
-> Seq Scan on "Invoice" (cost=0.00..11.15 rows=412 width=21) (actual time=0.015..0.216 rows=412 loops=1)
Total runtime: 0.753 ms
My task is to optimize the query by using indices, however i cannot think of a way to use indices for optimizing aggregate results.
You can try to penalize Hashagg by "SET enable_hashagg to OFF", but probably for small data, there will not be any benefit from index .. in this use case - hashagg is usually most fast method for aggregation and sort 32kB is pretty quick.
But .. you are trying do performance benchmark on table with 412 rows. It is nonsense. Any thinking about performance has sense on data with size related 2..3 years of production usage.
As noted by Pavel, Ramfjord, and horse, using an index is of little use with such a tiny amount of data. It's so small that it's faster for Postgres to read disk page or two and process everything in memory.
Further, you have the best possible plan for your query already. You're asking Postgres to compute an aggregate over an entire table and returning it in a certain order. Postgres proceeds by computing the aggregate in memory without bothering to sort the data first, by assigning intermediary results using a hash; it then sorts the small number of results according to your criteria.

Postgresql query optimization No inner/outer join allowed

I am given this query to optimize on POSTGRESQL 9.2:
SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id)
FROM Categories C INNER JOIN Items I ON(C.id = I.category)
INNER JOIN Bids B ON (I.id = B.item_id)
GROUP BY C.name
As part of my school assignment.
I have created these indexes on respective table: items(category)-->2ndary b+tree, bids(item_id)-->2ndary b+tree, and categories(id)-->primary index here,
The weird part is, PostgreSQL is doing a sequential scan on my Items, Categories, and Bids table, and when i set the enable_seqscan=off, the index search turns out to be more horrendous than the result below.
When I run explain in PostgreSQL this is the result: PLEASE DON'T REMOVE THE INDENTATIONS AS THEY ARE IMPORTANT!
GroupAggregate (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1)
Buffers: shared hit=30 read=12306, temp read=6600 written=6598
-> Sort (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1)
Sort Key: c.name
Sort Method: external merge Disk: 20160kB
Buffers: shared hit=30 read=12306, temp read=6274 written=6272
-> Hash Join (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1)
Hash Cond: (b.item_id = i.id)
Buffers: shared hit=30 read=12306, temp read=994 written=992
-> Seq Scan on bids b (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1)
Buffers: shared hit=2 read=4999
-> Hash (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1)
Buckets: 4096 Batches: 2 Memory Usage: 989kB
Buffers: shared hit=28 read=7307, temp written=111
-> Hash Join (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1)
Hash Cond: (i.category = c.id)
Buffers: shared hit=28 read=7307
-> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1)
Buffers: shared hit=27 read=7307
-> Hash (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=1
-> Seq Scan on categories c (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1)
Buffers: shared hit=1
Total runtime: 9473.257 ms
See this plan on explain.depesz.com.
I just want to know why this happens, i.e. why indexes make the query horrendously slow compared to sequential scan.
Edit:
I think i have managed to uncover a couple of stuff by going through the postgresql documentation.
Postgresql decided to do seq scan on some table such as bids and items because it predicted it has to retrieve every single rows in the table (compare the number of rows in the bracket before the actual time and the number of rows in the actual time part). Sequential scan is better in retrieving all rows. Well nothing can be done in that part.
I have created extra index for categories(name), and the result below is what i have. It somehow improved but now hash join is replaced with nested loop. Any clue in why?
GroupAggregate (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1)
Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436
-> Nested Loop (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1)
Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110
-> Nested Loop (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1)
Join Filter: (c.id = i.category)
Rows Removed by Join Filter: 950000
Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110
-> Index Scan using categories_name_idx on categories c (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1)
Buffers: shared hit=1 read=1
-> Materialize (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20)
Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110
-> Seq Scan on items i (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1)
Buffers: shared hit=1 read=7333 written=1
-> Index Scan using bid_itemid_idx on bids b (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000)
Index Cond: (item_id = i.id)
Buffers: shared hit=178580 read=30139 written=13
Total runtime: 7726.392 ms
Have a look on the plan here if it is better.
I have managed to reduce it to 114062.92 by creating index on category(id) and items(category). Postgresql used both of the indexes to get to 114062.92 cost.
However, now postgresql is playing game with me by not using the index! why is it so buggy?
Thankyou for posting EXPLAIN output without being asked, and for the EXPLAIN (BUFFERS, ANALYZE).
A significant part of your query's performance issue is likely to be the outer sort plan node, which is doing an on-disk merge sort with a temporary file:
Sort Method: external merge Disk: 20160kB
You could do this sort in memory by setting:
SET work_mem = '50MB';
before running your query. This setting can also be set per-user, per-database or globally in postgresql.conf.
I'm not convinced that adding indexes will be of much benefit as the query is currently structured. It needs to read and join all rows from all three tables, and hash joins are likely to be the fastest way to do so.
I suspect there are other ways to express that query that will use entirely different and more efficient execution strategies, but I'm having a brain-fade about what they might be and don't want to spend the time to make up dummy tables to play around. More work_mem should significantly improve the query as it stands.
From query plans we can see that:
1. Result and categories have 20 records
2. Items with category are 5% of all amount of Items
"Rows Removed by Join Filter: 950000"
"rows=50000" in sequential scan
3. Bids matched is rows=600036 (could you give us total number of bids?)
4. Every category have a bid?
So we want to use indexes on items(category) and bids(item_id).
We want also to sort fit in memory.
select
(select name from Categories where id = foo.category) as name,
count(foo.id),
sum(foo.bids_count)
from
(select
id,
category,
(select count(item_id) from Bids where item_id = i.id) as bids_count
from Items i
where category in (select id from Categories)
and exists (select 1 from Bids where item_id = i.id)
) as foo
group by foo.category
order by name
Of course you have to remember that it strictly depends on data in points 1 and 2.
If 4 is true you can remove exists part from query.
Any advises or ideas?
Note that if the size of bids is systematically and significantly larger than items then it may actually be cheaper to traverse items twice (especially so if items fits in RAM) than to pick off those distinct item IDs from the join result (even if you sort in-memory). Moreover, depending on how the Postgres pipeline happens to pull data from the duplicate tables, there may be limited penalty even under adverse load or memory conditions (this would be a good question that you could ask on pgsql-general.) Take:
SELECT name, IC.cnt, BC.cnt FROM
Categories C,
( SELECT category, count(1) cnt from Items I GROUP BY category ) IC,
( SELECT category, count(1) cnt from Bids B INNER JOIN Items I ON (I.id = B.item_id) GROUP BY category ) BC
WHERE IC.category=C.id AND BC.category=id;
How much cheaper? At least 4x given sufficient caching, i.e. 610ms vs. 2500ms (in-memory sort) with 20 categories, 50k items and 600k bids, and still faster than 2x after a filesystem cache flush on my machine.
Note that the above is not a direct substitute for your original query; for one it assumes that there is a 1:1 mapping between category IDs and names (which could turn out to be a very reasonable assumption; if not, simply SUM(BC.cnt) and SUM(IC.cnt) as you GROUP BY name), but more importantly the per-category item count includes items that have no bids, unlike your original INNER JOIN. If only bid-for item counts are required you can add WHERE EXISTS (SELECT 1 FROM Bids B where item_id=I.id) in the IC subquery; this will traverse Bids a second time as well (in my case that added a ~200ms penalty to the existing ~600ms plan, still well under 2400ms.)

Resources