I'm researching now on creating indexes for our tables.
I found out about multicolumn indexes but I'm not sure on the impact.
Example:
We have SQLs on findById, findByIdAndStatus, findByResult.
It says that the most used on WHERE should be listed first in the columns list. But I was wondering if it'll have a huge impact if I create index on different combination where clauses.
This: (creating one index for all)
CREATE INDEX CONCURRENTLY ON Students (id, status, result)
vs.
This: (creating different indexes on different queries)
CREATE INDEX CONCURRENTLY ON Students (id)
CREATE INDEX CONCURRENTLY ON Students (id, status)
CREATE INDEX CONCURRENTLY ON Students (result)
Thank you so much in advance!
Creating one index for all and creating different indexes will have completely different impact on the queries.
You can use EXPLAIN to see if indexes are getting used for the queries.
This video is really good to know about DB indexes.
Index CREATE INDEX CONCURRENTLY ON Students (id, status, result) will be used only and only if query uses id, (id,status) or (id, status and result) in WHERE clause. a query with status in Where will not use this index at all.
Indexes are basically balanced binary trees. A multicolumn index will index rows by id, then rows ordered by id's are further indexes by status and then with result and so on.
You can see that in this index, the ordering via status is not present at all. It is only available on rows indexed by the id's first.
Do have the look at video, it explains all this pretty well.
The rule of thumb you read is wrong.
A better rule is: create such an index only if it is useful and gets used often enough that it is worth the performance hit on data modification that comes with every index.
A multi-column B-tree index on (a, b, c) is useful in several cases:
if the query looks like this:
SELECT ... FROM tab
WHERE a = $1 AND b = $2 AND c <operator> $3
where <operator> is an operator supported by the index and $1, $2 and $3 are constants.
if the query looks like this:
SELECT ... FROM tab
WHERE a = $1 AND b = $2
ORDER BY c;
or like this
SELECT ... FROM tab
WHERE a = $1
ORDER BY b, c;
Any decorations in the ORDER BY clause must be reflected in the CREATE INDEX statement. For example, for ORDER BY b, c DESC the index must be created on (a, b, c DESC) or (a, b DESC, c) (indexes can be read in both directions).
if the query looks like this:
SELECT c
FROM tab
WHERE a = $1 AND b <operator> $2;
If the table is newly VACUUMed, this can get you an index only scan, because all required information is in the index.
In recent PostgreSQL versions, such an index in better created as
CREATE INDEX ON tab (a, b) INCLUDE (c);
Related
I want to index an array column with either GIN or GiST. The fact that GIN is slower in insert/update operations, however, made me wonder if it would have any impact on performance - even though the indexed column itself will remain static.
So, assuming that, for instance, I have a table with columns (A, B, C) and that B is indexed, does the index get updated if I update only column C?
It depends :^)
Normally, PostgreSQL will have to modify the index, even if nothing changes in the indexed column, because an UPDATE in PostgreSQL creates a new row version, so you need a new index entry to point to the new location of the row in the table.
Since this is unfortunate, there is an optimization called “HOT update”: If none of the indexed columns are modified and there is enough free space in the block that contains the original row, PostgreSQL can create a “heap-only tuple” that is not referenced from the outside and therefore does not require a new index entry.
You can lower the fillfactor on the table to increase the likelihood for HOT updates.
For details, you may want to read my article on the topic.
Laurenz Albe answer is great. The following part is my interpretation.
Because the gin array_ops can not do index only scan. Which means that even if you only query the array column, you can only use bitmap index scan. for bitmap scan. with low fillfactor, you probably don't need to visit extract pages.
demo:
begin;
create table test_gin_update(cola int, colb int[]);
insert into test_gin_update values (1,array[1,2]);
insert into test_gin_update values (1,array[1,2,3]);
insert into test_gin_update(cola, colb) select g, array[g, g + 1] from generate_series(10, 10000) g;
commit;
for example, select colb from test_gin_update where colb = array[1,2]; see the following query plan.
because GIN cannot distinguish array[1,2] and array[1,2,3] then even if we created gin index. create index on test_gin_update using gin(colb array_ops ); We can only use bitmap index scan.
QUERY PLAN
-----------------------------------------------------------------------------
Bitmap Heap Scan on test_gin_update (actual rows=1 loops=1)
Recheck Cond: (colb = '{1,2}'::integer[])
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
-> Bitmap Index Scan on test_gin_update_colb_idx (actual rows=2 loops=1)
Index Cond: (colb = '{1,2}'::integer[])
(6 rows)
I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").
I have a table with 3 columns: "Id", "A", "B".
All of them are searchable. Id is identity and used only to search exact rows so it's clear. But I have doubts about "A" and "B". I have a 3 cases to search in my application: search by "A", search by "B" and search by "A" and "B" simultaneously. So i'm not sure which index type to choose. Should I use two single-column indexes or one multi-column? Or maybe it's better to combine single-column indexes with multi-column (3 indexes in total)? I don't really care about INSERT/UPDATE/DELETE duration, my target priority is to make SELECT as fast as possible.
I use SQL Server 2017.
Thank you.
I think two additional indexes will be enough:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) -- the first column here which has more different values
CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A)
If you have other columns in this table you can create included indexes:
CREATE INDEX IDX_YourTable_AB ON YourTable(A,B) INCLUDE(C,D,E,...)
CREATE INDEX IDX_YourTable_B ON YourTable(B) INCLUDE(A,C,D,E,...)
Index IDX_YourTable_AB might used for conditions WHERE A='...' or WHERE A='...' AND B='...' or WHERE A LIKE '...%' AND B='...' - used only A column or A&B columns.
Index IDX_YourTable_B might used for conditions with B column only (WHERE B='...' or WHERE B LIKE '...%').
Also try to test CREATE INDEX IDX_YourTable_BA ON YourTable(B,A) instead of CREATE INDEX IDX_YourTable_B ON YourTable(B)INCLUDE(A). Maybe it will be better.
I'm using the lsqlite3 lua wrapper and I'm making queries into a database. My DB has ~5million rows and the code I'm using to retrieve rows is akin to:
db = lsqlite3.open('mydb')
local temp = {}
local sql = "SELECT A,B FROM tab where FOO=BAR ORDER BY A DESC LIMIT N"
for row in db:nrows(sql) do temp[row['key']] = row['col1'] end
As you can see I'm trying to get the top N rows sorted in descending order by FOO (I want to get the top rows and then apply the LIMIT not the other way around). I indexed the column A but it doesn't seem to make much of a difference. How can I make this faster?
You need to index the column on which you filter (i.e. with the WHERE clause). THe reason is that ORDER BY comes into play after filtering, not the other way around.
So you probably should create an index on FOO.
Can you post your table schema?
UPDATE
Also you can increase the sqlite cache, e.g.:
PRAGMA cache_size=100000
You can adjust this depending on the memory available and the size of your database.
UPDATE 2
I you want to have a better understanding of how your query is handled by sqlite, you can ask it to provide you with the query plan:
http://www.sqlite.org/eqp.html
UPDATE 3
I did not understand your context properly with my initial answer. If you are to ORDER BY on some large data set, you probably want to use that index, not the previous one, so you can tell sqlite to not use the index on FOO this way:
SELECT a, b FROM foo WHERE +a > 30 ORDER BY b
I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])