postgres update with join slow performance - database

I have below tables and trying to do an update from second table to first one, it seems to take more than 15 minutes and I killed it at that point.
Basically just trying to set one field from a table to another field. Both tables have around 2.5 million rows. How can we optimize this operation?
first table:
\d table1
Table "public.fa_market_urunu"
Column | Type | Collation | Nullable | Default
--------------+-----------------------------+-----------+----------+-----------------------
id | character varying | | not null |
ad | character varying | | |
url | character varying | | |
image_url | character varying | | |
satici_id | character varying | | not null |
satici | character varying | | not null |
category_id | character varying | | |
date_created | timestamp with time zone | | not null | now()
last_updated | timestamp(3) with time zone | | not null | now()
fiyat | double precision | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
"tbl1_satici" UNIQUE, btree (id, satici)
"tbl1_satici_id" UNIQUE, btree (satici, id)
"tbl1_satici_id_last_updated" UNIQUE, btree (satici, id, last_updated)
"tbl1_satici_id_satici_key" UNIQUE CONSTRAINT, btree (satici_id, satici)
"tbl1_satici_last_updated_id" UNIQUE, btree (satici, last_updated, id)
"tbl1_last_updated" btree (last_updated)
"tbl1_satici_category" btree (satici, category_id)
"tbl1_satici_category_last_updated" btree (satici, category_id, last_updated)
"tbl1_satici_last_updated" btree (satici, last_updated)
second table:
\d table2
Table "public.temp_son_fiyat"
Column | Type | Collation | Nullable | Default
---------+-------------------+-----------+----------+---------
urun_id | character varying | | |
satici | character varying | | |
fiyat | double precision | | |
Indexes:
"ind_u" UNIQUE, btree (urun_id, satici)
My operation:
UPDATE table1 mu
SET fiyat = fn.fiyat
FROM table2 AS fn
WHERE mu.satici_id = fn.urun_id AND mu.satici = fn.satici;

This happens because of the indexes. Every update in postgres is considered as reinsertion of that row regardless of the column getting updated, so all indexes are recalculated. To make it faster, dropping indexes or swapping to a new table would work (if it is possible to do those).

Related

Postgresql 10: How to move 4 existing columns to an array column

I have the following table called client:
Table "public.client"
Column | Type | Collation | Nullable | Default
---------------------+---------+-----------+----------+------------------------------
clientid | integer | | not null | generated always as identity
account_name | text | | not null |
last_name | text | | |
first_name | text | | |
address | text | | not null |
suburbid | integer | | |
cityid | integer | | |
post_code | integer | | not null |
business_phone | text | | |
home_phone | text | | |
mobile_phone | text | | |
alternative_phone | text | | |
email | text | | |
quote_detailsid | integer | | |
invoice_typeid | integer | | |
payment_typeid | integer | | |
job_typeid | integer | | |
communicationid | integer | | |
accessid | integer | | |
difficulty_levelid | integer | | |
current_lawn_price | numeric | | |
square_meters | numeric | | |
note | text | | |
client_statusid | integer | | |
reason_for_statusid | integer | | |
Indexes:
"client_pkey" PRIMARY KEY, btree (clientid)
"account_name_check" UNIQUE CONSTRAINT, btree (account_name)
Foreign-key constraints:
"client_accessid_fkey" FOREIGN KEY (accessid) REFERENCES access(accessid)
"client_cityid_fkey" FOREIGN KEY (cityid) REFERENCES city(cityid)
"client_client_statusid_fkey" FOREIGN KEY (client_statusid) REFERENCES client_status(client_statusid)
"client_communicationid_fkey" FOREIGN KEY (communicationid) REFERENCES communication(communicationid)
"client_difficulty_levelid_fkey" FOREIGN KEY (difficulty_levelid) REFERENCES difficulty_level(difficulty_levelid)
"client_invoice_typeid_fkey" FOREIGN KEY (invoice_typeid) REFERENCES invoice_type(invoice_typeid)
"client_job_typeid_fkey" FOREIGN KEY (job_typeid) REFERENCES job_type(job_typeid)
"client_payment_typeid_fkey" FOREIGN KEY (payment_typeid) REFERENCES payment_type(payment_typeid)
"client_quote_detailsid_fkey" FOREIGN KEY (quote_detailsid) REFERENCES quote_details(quote_detailsid)
"client_reason_for_statusid_fkey" FOREIGN KEY (reason_for_statusid) REFERENCES reason_for_status(reason_for_statusid)
"client_suburbid_fkey" FOREIGN KEY (suburbid) REFERENCES suburb(suburbid)
Referenced by:
TABLE "work" CONSTRAINT "work_clientid_fkey" FOREIGN KEY (clientid) REFERENCES client(clientid)
I want to move all phone columns (business_phone, home_phone, mobile_phone, alternative_phone) as an array to one column called phone_numbers and get rid of the four phone_columns. Any idea how to do this safely without losing any records?
Add the array column.
ALTER TABLE client
ADD COLUMN phone_numbers text[];
Then use an UPDATE command to set the value of the array column based on the other four.
UPDATE client
SET phone_numbers = [business_phone,home_phone,mobile_phone,alternate_phone]; -- test and modify if needed
You can repeat this UPDATE as many times as it takes to get it right. Then you can safely DROP the four old columns.
ALTER TABLE client
DROP COLUMN business_phone
DROP COLUMN home_phone
DROP COLUMN mobile_phone
DROP COLUMN alternate_phone;

why Postgres prefer seq scan to partial index with explicit where condition?

I have a simple query like select * from xxx where col is not null limit 10. I don't know why Postgres prefer seq scan which is much slower than partial index (I have analyzed the table). How to debug problem like this?
The table has more than 4 millions rows. And about 350,000 rows satisfied pid is not null.
I think there may be something wrong with the cost estimation. Cost of seq scan is lower than index scan. But how to dig into this?
I have a guess but not sure about it. The not null rows occupy about 10% of total rows. It means it may get 10 rows that not null when seq scan 100 rows. And it think the cost of seq scan 100 rows is lower than index scan 10 rows and then random fetch 10 full rows. Is it?
> \d data_import
+--------------------+--------------------------+----------------------------------------------------------------------------+
| Column | Type | Modifiers |
|--------------------+--------------------------+----------------------------------------------------------------------------|
| id | integer | not null default nextval('data_import_id_seq'::regclass) |
| name | character varying(64) | |
| market_activity_id | integer | not null |
| hmsr_id | integer | not null default (-1) |
| site_id | integer | not null default (-1) |
| hmpl_id | integer | not null default (-1) |
| hmmd_id | integer | not null default (-1) |
| hmci_id | integer | not null default (-1) |
| hmkw_id | integer | not null default (-1) |
| creator_id | integer | |
| created_at | timestamp with time zone | |
| updated_at | timestamp with time zone | |
| bias | integer | |
| pid | character varying(128) | default NULL::character varying |
+--------------------+--------------------------+----------------------------------------------------------------------------+
Indexes:
"data_import_pkey" PRIMARY KEY, btree (id)
"unique_hmxx" UNIQUE, btree (site_id, hmsr_id, hmpl_id, hmmd_id, hmci_id, hmkw_id) WHERE pid IS NULL
"data_import_pid_idx" UNIQUE, btree (pid) WHERE pid IS NOT NULL
"data_import_created_at_idx" btree (created_at)
"data_import_hmsr_id" btree (hmsr_id)
"data_import_updated_at_idx" btree (updated_at)
> set enable_seqscan to false;
apollon> explain (analyse, verbose) select * from data_import where pid is not null limit 10
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
| QUERY PLAN
|-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Limit (cost=0.42..5.68 rows=10 width=84) (actual time=0.059..0.142 rows=10 loops=1)
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
| -> Index Scan using data_import_pid_idx on public.data_import (cost=0.42..184158.08 rows=350584 width=84) (actual time
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid
| Index Cond: (data_import.pid IS NOT NULL)
| Planning time: 0.126 ms
| Execution time: 0.177 ms
+-------------------------------------------------------------------------------------------------------------------------------------------------------------
EXPLAIN
Time: 0.054s
> set enable_seqscan to true;
> explain (analyse, verbose) select * from data_import where pid is not null limit 10
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit (cost=0.00..2.37 rows=10 width=84) (actual time=407.042..407.046 rows=10 loops=1) |
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
| -> Seq Scan on public.data_import (cost=0.00..83016.60 rows=350584 width=84) (actual time=407.041..407.045 rows=10 loops=1) |
| Output: id, name, market_activity_id, hmsr_id, site_id, hmpl_id, hmmd_id, hmci_id, hmkw_id, creator_id, created_at, updated_at, bias, pid |
| Filter: (data_import.pid IS NOT NULL) |
| Rows Removed by Filter: 3672502 |
| Planning time: 0.116 ms |
| Execution time: 407.078 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
EXPLAIN
Time: 0.426s
Your problem are the
Rows Removed by Filter: 3672502
PostgreSQL knows the distribution of the values and how they are correlated with the physical table layout, but it does not know that the rows at the beginning of the table all have NULL for pid.
If the NULLs were evenly distributed, the sequential scan would quickly find 10 hits and stop, but as it is, it has to read 3672512 rows to find 10 matching ones.
If you add ORDER BY pid (even though you don't need it) before the LIMIT, the optimizer will do the right thing.

How do I set 2 columns so each entry is unique against both columns?

I have a record that holds 2 license "keys" (actually GUIDs). When a request comes to our service it includes a key (GUID) in the request. I then do a query looking for a record that has this value in either the column Key1 or Key2.
The purpose of this is users will use Key1 for everything. Then they discover that Key1 has become public. So they switch to Key2 and then after 15 minutes, change the value of Key1. Now the old Key1 value is of no use.
By having the 2 keys, it allows the switch over with no downtime.
I need any key value to be unique. Not that any pair of values is unique. Not that a value in Key1 is unique in all rows for Key 1. But that a new value is unique in all rows.Key1 and rows.Key2.
Is there a way to force this in Sql Server. Or do I need to do this myself with a select before doing an insert or update?
-------------------------------------------------------------------------------------------
| LicenseId | ApiKey1 | APiKey2 |
| 1 | af53d192-7fa3-4be0-b3d4-7efe17a397b5 | 1a87cc4a-1941-4af7-aeaa-bf9690f47eef |
| 2 | 5bbc2d06-ed6f-4444-aa22-73820dd6f3f6 | c2bdd9d9-fd47-4727-83f8-02ed0e7537e1 |
| 3 | 8acfa8b4-aa4b-41a7-9d3d-b6ba1eac838e | 30c18f2d-5d89-4e5d-8e8e-2d2b647d6ab6 |
-------------------------------------------------------------------------------------------
I need to insure if I am going to create record LicenseId = 4, that if it has ApiKey2 = 'af53d192-7fa3-4be0-b3d4-7efe17a397b5', that the insert will fail because that guid is ApiKey1 for LicenseId = 1.
The most natural way to enforce this in the database is to put all keys in a single column. Eg
create table ApiKeys
(
LicenceId int,
KeyId int check (KeyId in (0,1)),
constraint pk_ApiKeys primary key (LicenceId,KeyId),
KeyGuid uniqueidentifier unique
)
Arguably having both the keys on the same row violates 1NF, and certainly your desire for uniqueness across the two column strongly suggests that they belong to a single domain.
So instead of storing ApiKey1 and ApiKey2 on the same row, you store them on two separate rows.
So instead of
---------------
| LicenseId | ApiKey1 | APiKey2 |
| 1 | af53d192-7fa3-4be0-b3d4-7efe17a397b5 | 1a87cc4a-1941-4af7-aeaa-bf9690f47eef |
| 2 | 5bbc2d06-ed6f-4444-aa22-73820dd6f3f6 | c2bdd9d9-fd47-4727-83f8-02ed0e7537e1 |
| 3 | 8acfa8b4-aa4b-41a7-9d3d-b6ba1eac838e | 30c18f2d-5d89-4e5d-8e8e-2d2b647d6ab6 |
-------------------------------------------------------------------------------------------
You would have:
----------------------------------------------------------
| LicenseId | KeyId | ApiKey |
| 1 | 0 | af53d192-7fa3-4be0-b3d4-7efe17a397b5|
| 1 | 1 | 1a87cc4a-1941-4af7-aeaa-bf9690f47ee4|
| 2 | 0 | 5bbc2d06-ed6f-4444-aa22-73820dd6f3f6|
| 2 | 1 | c2bdd9d9-fd47-4727-83f8-02ed0e7537e1|
| 3 | 0 | 8acfa8b4-aa4b-41a7-9d3d-b6ba1eac838e|
| 3 | 1 | 30c18f2d-5d89-4e5d-8e8e-2d2b647d6ab6|
----------------------------------------------------------

Index is not being used on a partitioned table

I have tableA, which is (list) partitioned almost evenly by 5 values. tableA contains 100million rows and has a local (partitioned) index on customFunc(x). Following query does RANGE SCAN using mentioned index and takes about 5-10s to execute and returns 5million.
select count(*) from tableA where customFunc(x)='abc';
Unfortunately, when I try to execute the same query on a specific partition it does full table scan and takes forever..
select count(*) from tableA where customFunc(x)='abc' and partitioning_key='DT';
I completely don't understand why it works that way.. Shouldn't it take an advantage of partition pruning in the 2nd case?
EDIT: Adding a hint /*+ index(tableA mentionedIndex) */ solves the problem, but I still don't understand why it is not used by default
EDIT: XPLAN 1
Plan hash value: xxx
---------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 17 | 29335 (1)| 00:00:02 | | |
| 1 | SORT AGGREGATE | | 1 | 17 | | | | |
| 2 | PARTITION LIST ALL| | 5227K| 84M| 29335 (1)| 00:00:02 | 1 | 5 |
|* 3 | INDEX RANGE SCAN | CUSTOM_FUNC_INDEX | 5227K| 84M| 29335 (1)| 00:00:02 | 1 | 5 |
---------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access(customFunc(x)='abc')
XPLAN 2 (with partition key)
Plan hash value: yyy
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 679K (2)| 00:00:27 | | |
| 1 | SORT AGGREGATE | | 1 | 30 | | | | |
| 2 | PARTITION LIST SINGLE| | 4014K| 114M| 679K (2)| 00:00:27 | KEY | KEY |
|* 3 | TABLE ACCESS FULL | tableA | 4014K| 114M| 679K (2)| 00:00:27 | 1 | 1 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(customFunc(x)='abc')
Shouldn't it take an advantage of partition pruning in the 2nd case?
The second query does apply partition pruning: that's what this step means: PARTITION LIST SINGLE . The catch is that partition pruning means reading the whole partition: in the second plan the step TABLE ACCESS FULL means read all the rows in the partition, don't use an index. Consequently, the second query is evaluating customFunc(x)='abc' for every row in the partition.
What is the point of creating a local index with partitioning key?
The difference is that a local index prefixed with the partitioning key will always use partition pruning, whereas when a local index doesn't have the partitioning key the optimiser can choose whether to apply partition pruning. But if you want to run queries that don't use the partition key then clearly you need the non-prefixed version.
Now you're right to be puzzled. Given the partition key as a predicate the optimizer ought to have executed an INDEX RANGE SCAN against the indicated partition. To figure out why it doesn't will require more effort on your part. It may be that your statistics are stale or you need to gather histograms. Maybe the fact that it's a function-based index confuses the optimizer. If you have the access, or a co-operative DBA, you can use the 10053 event to look under the hood. Find out more.

MySQL I want to optimize this further

So I started off with this query:
SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
It took forever, so I ran an explain:
mysql> explain SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| 1 | PRIMARY | TABLE1 | ALL | NULL | NULL | NULL | NULL | 2554388553 | Using where |
| 2 | DEPENDENT SUBQUERY | temptable | ALL | NULL | NULL | NULL | NULL | 1506 | Using where |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
2 rows in set (0.01 sec)
It wasn't using an index. So, my second pass:
mysql> explain SELECT * FROM TABLE1 JOIN temptable ON TABLE1.hash=temptable.hash;
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| 1 | SIMPLE | temptable | ALL | hash | NULL | NULL | NULL | 1506 | |
| 1 | SIMPLE | TABLE1 | ref | hash | hash | 5 | testdb.temptable.hash | 527 | Using where |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
2 rows in set (0.00 sec)
Can I do any other optimization?
You can gain some more speed by using a covering index, at the cost of extra space consumption. A covering index is one which can satisfy all requested columns in a query without performing a further lookup into the clustered index.
First of all get rid of the SELECT * and explicitly select the fields that you require. Then you can add all the fields in the SELECT clause to the right hand side of your composite index. For example, if your query will look like this:
SELECT first_name, last_name, age
FROM table1
JOIN temptable ON table1.hash = temptable.hash;
Then you can have a covering index that looks like this:
CREATE INDEX ix_index ON table1 (hash, first_name, last_name, age);

Resources