Clickhouse seems quite slow - database

We are considering moving to ClickHouse as a time-series database for an IoT usage, so we followed best practices for creating a table designed for that, but when running tests, the results seemed to be quite slow on first time the query is run, but then it is quite fast (due to cold caching).
Our table schema is as follows:
CREATE TABLE IF NOT EXISTS feeds (
ts UInt64,
uuid UUID,
value Decimal(30, 5),
date Date MATERIALIZED toDate(round(ts/1000)) CODEC(DoubleDelta),
time DateTime MATERIALIZED toDateTime(round(ts/1000)) CODEC(DoubleDelta)
) ENGINE = ReplacingMergeTree() PARTITION BY toYYYYMM(date) PRIMARY KEY (uuid, ts) ORDER BY (uuid, ts) SETTINGS index_granularity=4096
Then to query, we execute this:
SELECT toStartOfInterval(time, INTERVAL '1 day') AS agreg,
avg(value) AS value
FROM feeds
WHERE uuid='391becbb-39fd-4205-aba1-5088c9529d42'
AND ts > 1629378660000
AND ts < 1632057060000
GROUP BY agreg
ORDER BY agreg ASC
Which represents around a month of data, aggregated daily.
We get this result :
┌───────────────agreg─┬──────────────value─┐
│ 2021-08-19 00:00:00 │ 43.54795698924731 │
│ 2021-08-20 00:00:00 │ 33.68460122699386 │
│ 2021-08-23 00:00:00 │ 26.8002874251497 │
│ 2021-08-24 00:00:00 │ 45.30143348623853 │
│ 2021-08-25 00:00:00 │ 43.78214139344263 │
│ 2021-08-26 00:00:00 │ 35.66887477313974 │
│ 2021-08-27 00:00:00 │ 87.03632541133456 │
│ 2021-08-28 00:00:00 │ 87.00372191011236 │
│ 2021-08-29 00:00:00 │ 87.16366666666666 │
│ 2021-08-30 00:00:00 │ 87.62234215885947 │
│ 2021-08-31 00:00:00 │ 87.62653798256538 │
│ 2021-09-01 00:00:00 │ 87.08854251012146 │
│ 2021-09-02 00:00:00 │ 44.809177939646204 │
│ 2021-09-03 00:00:00 │ 20.85095168374817 │
│ 2021-09-06 00:00:00 │ 21.086067551266584 │
│ 2021-09-07 00:00:00 │ 20.904265569917744 │
│ 2021-09-08 00:00:00 │ 20.988032407407406 │
│ 2021-09-09 00:00:00 │ 21.070418848167538 │
│ 2021-09-10 00:00:00 │ 20.900507674144038 │
│ 2021-09-14 00:00:00 │ 34.17651296829971 │
│ 2021-09-15 00:00:00 │ 33.79448634590377 │
│ 2021-09-16 00:00:00 │ 33.7209536423841 │
│ 2021-09-17 00:00:00 │ 33.5361985472155 │
└─────────────────────┴────────────────────┘
Ok. 23 rows in set. Elapsed: 6.891 sec. Processed: 0 rows, 0.0B (0 rows/s, 0.0B/s)
7s seems quite slow for such a simple operation.
Is our way of doing this wrong ?

You don't use partitioning and primary index.
You have 3 columns with the same temporal data. When you query ts and time your query processes twice more data. Get rid of time and date columns!!!!!!!!!!!!!!!
Your table is partitioned by the column date and your where section uses AND ts > 1629378660000 === partition elimination does not work!!!
Use partition by toYYYYMM(toDateTime(intDiv(ts,1000)) !
Do not use round(ts/1000), use intDiv(ts,1000)
You don't use primary key index. Query Optimizer is stupid, use GROUP BY uuid, agreg ORDER BY uuid, agreg ASC
+try set optimize_aggregation_in_order=1 Make agreg a derivative from ts.
(uuid, agreg (ts) ) pair matches table's ORDER BY
Yes, CH still will be slower than other timeseries databases, because CH is not designed for such queries.

Related

Strange error in the select of the array of records

As I understand Postgres supports selecting of arrays of complex types - records (tables). I found this answer so I try in my PLPGSQL function a code like this:
...
DECLARE
deleting my_table[];
BEGIN
deleting := ARRAY(SELECT mt.* FROM table1 t1
INNER JOIN table2 t2 ON t2.id=t1.table1
INNER JOIN my_table mt ON mt.id=t2.my_table
WHERE t1.x=10);
...
END;
$$;
just simple example to illustrate the error. The error is:
SQL Error [42601]: ERROR: subquery must return only one column
where: PL/pgSQL function xxx(...) line 6 at assignment
The error is in the assignment deleting := ARRAY(SELECT mt.* ...) (I verified it with a debug RAISE directly after this line). But what is wrong with it? I want to select ALL columns sure, and my array type is the type of the mt table. It looks similar to the mentioned StackOverflow's answer IMHO. TBH I know only one alternative to achieve the same without querying of an records array: multiple records - one per column which looks dummy.
Would somebody hint with a right syntax? Any help will be very useful.
PS. I use Postgres 12.
You can use one trick (it is equal to #a_horse_with_no_name notice - here is just more detailed description). Any Postgres's table has some invisible columns. One, that has same name like table name, holds all columns of table in composite format. I have not an idea, when this feature was introduced - it's pretty old (maybe from pre SQL age):
postgres=# select * from foo;
┌────┬────┐
│ a │ b │
╞════╪════╡
│ 10 │ 20 │
│ 30 │ 40 │
└────┴────┘
(2 rows)
postgres=# select foo from foo ;
┌─────────┐
│ foo │
╞═════════╡
│ (10,20) │
│ (30,40) │
└─────────┘
(2 rows)
Sometimes this feature can be pretty useful:
postgres=# do $$
declare a foo[];
begin
a := array(select foo from foo);
raise notice '%', a;
end;
$$;
NOTICE: {"(10,20)","(30,40)"}
DO
If you don't know this trick, or if you prefer exact column specification, you can use row constructor:
postgres=# do $$
declare a foo[];
begin
a := array(select row(x.a,x.b) from foo x);
raise notice '%', a;
end;
$$;
NOTICE: {"(10,20)","(30,40)"}
DO
In both cases the result is an array of composite values. Attention - the content is hold only in memory, and you can exhaust lot of memory (although the size is limited to 1GB). On second hand, it is very practical replacement of temporary tables or table's variables. Creating and dropping temporary table is relatively slow and expensive, and when the content is not too big, and when you don't need full functionality of tables (indexes, statistics, session related life-cycle), then composite arrays can be very good replacement (perfect) of temporary tables.

Is there any way to delete all data on clickhouse tables with a memory engine?

Most of the time, the data that stays in the database causes the tests to fail, so I need to clear the database before running my tests. Do you know a way to clear the clickhouse before running the tests, like DatabaseMigrations in Laravel?
Engine=Memory supports truncate table
create table xxxm(a Int64) engine = Memory;
insert into xxxm values (1), (2), (3);
select count() from xxxm;
┌─count()─┐
│ 3 │
└─────────┘
truncate table xxxm;
select count() from xxxm
┌─count()─┐
│ 0 │
└─────────┘

Same SQL request, CockroachDB takes 4min SQL Server takes 35ms. Am I missing something?

The database contains only 2 tables:
wallet (1 million rows)
transaction (15 million rows)
CockroachDB 19.2.6 runs on 3 Ubuntu machines
2vCPU each
8GB RAM each
Docker swarm container
vs
SQL Server 2019 runs on 1 machine Windows Server 2019
4vCPU
16GB RAM
Here is the request
select * from transaction t
join wallet s on t.sender_id=s.id
join wallet r on t.receiver_id=r.id
limit 10;
The SQL Server take only 35ms to return the first 10 results
CockroachDB take 3.5-5 min for it.
1) I know that the infrastructure is not fair enough for CockroachDB but though.. the different is really too big. Am I missing something? or CockroachDB is just very slow for this particular SQL request?
2) When I execute this request, the CPU of all 3 cockroach nodes went up to 100%. Is it normal?
Update: here is the request "EXPLAIN". I'm not sure how to read it..
> explain select * from transaction t
-> join wallet s on t.sender_id=s.id
-> join wallet r on t.receiver_id=r.id
-> limit 10;
tree | field | description
+---------------------+--------------------+----------------------+
| distributed | true
| vectorized | false
limit | |
│ | count | 10
└── hash-join | |
│ | type | inner
│ | equality | (receiver_id) = (id)
│ | right cols are key |
├── hash-join | |
│ │ | type | inner
│ │ | equality | (sender_id) = (id)
│ │ | right cols are key |
│ ├── scan | |
│ │ | table | transaction#primary
│ │ | spans | ALL
│ └── scan | |
│ | table | wallet#primary
│ | spans | ALL
└── scan | |
| table | wallet#primary
| spans | ALL
It appears that this is actually due to a difference in query plans between SQL Server and CockroachDB, but one that can be worked around in a few ways.
The root problem is that the transaction table has two foreign key constraints pointed at the wallet table, but both foreign keys are nullable. This prevents CockroachDB from pushing the 10 row limit through the join, because the scan on the transaction table may need to produce more than 10 rows for the entire query to produce up to 10 rows.
We see this in the query plan:
> explain select * from transaction t
join wallet s on t.sender_id=s.id
join wallet r on t.receiver_id=r.id
limit 10;
info
---------------------------------------------
distribution: full
vectorized: true
• limit
│ count: 10
│
└── • lookup join
│ table: wallet#primary
│ equality: (receiver_id) = (id)
│ equality cols are key
│
└── • lookup join
│ table: wallet#primary
│ equality: (sender_id) = (id)
│ equality cols are key
│
└── • scan
estimated row count: 10,000
table: transaction#primary
spans: FULL SCAN
Notice that the limit is applied after both joins.
There are two relatively straightforward ways to fix this. First, we could replace the joins with left joins. This will allow the limit to be pushed down to the scan on the transaction table because the left join will never discard rows.
> explain select * from transaction t
left join wallet s on t.sender_id=s.id
left join wallet r on t.receiver_id=r.id
limit 10;
info
----------------------------------------
distribution: full
vectorized: true
• lookup join (left outer)
│ table: wallet#primary
│ equality: (receiver_id) = (id)
│ equality cols are key
│
└── • lookup join (left outer)
│ table: wallet#primary
│ equality: (sender_id) = (id)
│ equality cols are key
│
└── • scan
estimated row count: 10
table: transaction#primary
spans: LIMITED SCAN
limit: 10
The other option is to make the referencing columns in the foreign key constraints non null. This will also allow the limit to be pushed down to the scan on the transaction table because even an inner join will never discard rows.
> alter table transaction alter column sender_id set not null;
ALTER TABLE
> alter table transaction alter column receiver_id set not null;
ALTER TABLE
> explain select * from transaction t
join wallet s on t.sender_id=s.id
join wallet r on t.receiver_id=r.id
limit 10;
info
----------------------------------------
distribution: full
vectorized: true
• lookup join
│ table: wallet#primary
│ equality: (receiver_id) = (id)
│ equality cols are key
│
└── • lookup join
│ table: wallet#primary
│ equality: (sender_id) = (id)
│ equality cols are key
│
└── • scan
estimated row count: 10
table: transaction#primary
spans: LIMITED SCAN
limit: 10
I believe it's possible your table has outdated statistics. (Though hopefully by this point, the statistics have automatically been updated and you aren't having this problem anymore.) You can read about this in the CockroachDB docs, and there is also a link there that describes how you can manually create new statistics.

Data from mutiple tables without any relationships

I have two models PrintEvent and SocialEvent with 2 tables as given below:-
print_events:-
╔════╤════════════════════════╤════════════╗
║ id │ name │ date ║
╠════╪════════════════════════╪════════════╣
║ 1 │ Important print event1 │ 01/05/2015 ║
╟────┼────────────────────────┼────────────╢
║ 2 │ Important print event2 │ 02/05/2015 ║
╟────┼────────────────────────┼────────────╢
║ 3 │ Important print event3 │ 03/05/2015 ║
╚════╧════════════════════════╧════════════╝
social_events:-
╔════╤═════════════════════════╤════════════╗
║ id │ name │ date ║
╠════╪═════════════════════════╪════════════╣
║ 1 │ Important social event1 │ 01/05/2015 ║
╟────┼─────────────────────────┼────────────╢
║ 2 │ Important social event2 │ 02/05/2015 ║
╟────┼─────────────────────────┼────────────╢
║ 3 │ Important social event3 │ 03/05/2015 ║
╚════╧═════════════════════════╧════════════╝
Now I have a search form with 'event date from' and 'event date to' fields and want to show the list of records from both tables using one model or any other way to achieve this but don't want to use separate queries using both models.
Thanks.
Without relation between the tables? Then you just have to do 2 queries instead.
or just do:
class yourFirstModel extends AppModel{
$public $hasMany = array('secondModel');
}
Then
class secondModel extends AppModel{
$public $belongTo = array('yourFirstModel');
}
And when you query any of these Models you should be able to get data from both tables.
Ok, here is another way of using join if you do not want to use my first approach.
You modify the query to meet your own specification.
$this->PrintEvent->find('all', array(
'joins' => array(
array(
'table' => 'social_events',
'alias' => 'EventJoin',
'type' => 'INNER',
'conditions' => array(
'EventJoin.date = PrintEvent.date'
)
)
),
'conditions' => array(
'PrintEvent.date' => $searchConditions
),
'fields' => array('EventJoin.*', 'PrintEvent.*'),
'order' => 'PrintEvent.date DESC'
));
Note that this has to be in your controller.
Let me know if this help.
As far as I understand CakePHP models can only ever represent one table each. You may need to resort to using query() to preform the UNION SQL:-
$this->PrintEvent->query('SELECT id, name, date FROM print_events UNION SELECT id, name, date FROM social_events');
I would normally advise against using this method, but I believe what you are trying to achieve is not supported by the framework directly.

Auto analyze not running for all tables in Postgres 9.2 database

I've noticed that a database I am tuning (postgres 9.2) is not running autoanalyze for many of the tables I am interested in, and I don't quite understand why. My understanding/expectations is that based on the current configuration, autoanalyze will run assuming the table grows or is modified by >= 10% of the rows. However, this is not the case based on the information I am seeing when querying the database.
Here's a set of results from running a query on pg_stat_all_tables on a database that's been running in prod for over a year (results truncated and real table names redacted)
┌────────────────────┬────────────────────────────────────┬──────────────────┬───────────────────┐
│ relname │ last_autovacuum │ autovacuum_count │ last_autoanalyze │ autoanalyze_count │
├────────────────────┼─────────────────┼──────────────────┼──────────────────┼───────────────────┤
│ a_large_table │ ¤ │ 0 │ ¤ │ 0 │
│ table_a │ 2014-04-01 │ 1 │ 2014-04-01 │ 1 │
│ table_b │ 2014-04-01 │ 1 │ 2014-04-01 │ 1 │
│ a_very_large_table │ ¤ │ 0 │ ¤ │ 0 │
└────────────────────┴─────────────────┴──────────────────┴──────────────────┴───────────────────┘
Note, table_a and table_b are frequently cleaned of old data, so it makes sense that these would have had an autovacuum/autoanalyze recently. However, I'd also expect the other large tables to have been at least analyzed recently as well.
For good measure, here's the postgresql.conf...
#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------
autovacuum = on
log_autovacuum_min_duration = 1000
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 100
autovacuum_analyze_threshold = 100
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1
autovacuum_freeze_max_age = 200000000
autovacuum_vacuum_cost_delay = 20ms
autovacuum_vacuum_cost_limit = -1
It does look like the tables should be vacuum analyzed when 10% of of the tuples are changed. Based on the posed postgresql.conf setting,
autovacuum analyze threshold = 100 + 0.1 * table size before vacuum
Other things to check:
Is autovacuum running? ps -ef | grep vacuum
Make sure you're not checking on a standby or slave server
See slides from Gabrielle Roth's autovacuum talk at Scale:
https://wiki.postgresql.org/images/b/b5/Groth_scale12x_autovacuum.pdf

Resources