How to check max-sql-memory and cache settings for an already running instance of cockroach db? - database

I have a cockroachdb instance running in production and would like to know the settings for the --max-sql-memory and --cache specified when the database was started. I am trying to enhance performance by following this production checklist but I am not able infer the setting either on dashboard or sql console.
Where can I check the values of max-sql-memory and cache value ?
Note: I am able to access the cockroachdb admin console and sql tables.

You can find this information in the logs, shortly after node startup:
I190626 10:22:47.714002 1 cli/start.go:1082 CockroachDB CCL v19.1.2 (x86_64-unknown-linux-gnu, built 2019/06/07 17:32:15, go1.11.6)
I190626 10:22:47.815277 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 31 GiB, using system memory
I190626 10:22:47.815311 1 server/config.go:386 system total memory: 31 GiB
I190626 10:22:47.815411 1 server/config.go:388 server configuration:
max offset 500000000
cache size 7.8 GiB <====
SQL memory pool size 7.8 GiB <====
scan interval 10m0s
scan min idle time 10ms
scan max idle time 1s
event log enabled true
If the logs have been rotated, the value depends on the flags.
The defaults for v19.1 are 128MB, with recommended settings being 0.25 (a quarter of system memory).
The settings are not currently logged periodically or exported through metrics.

Related

What causes data on a read-replica to be an old_snapshot and cause conflict?

After encountering (on an RDS Postgres instance) for several times:
ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed
I ran (on the hot standby):
SELECT *
FROM pg_stat_database_conflicts;
And found that all the conflicts have to do with confl_snapshot
Which is explained in the documentation as:
confl_snapshot: Number of queries in this database that have been canceled due to old
snapshots
What might be causing this conflict (being an old snapshot)?
If it helps, here are some of the relevant settings (by running SHOW ALL ; on the stand by):
hot_standby: on
hot_standby_feedback: off
max_standby_archive_delay: 30s
max_standby_streaming_delay: 1h
name,setting
old_snapshot_threshold: -1
vacuum_defer_cleanup_age: 0
vacuum_freeze_min_age: 50000000
vacuum_freeze_table_age: 150000000
vacuum_multixact_freeze_min_age: 5000000
vacuum_multixact_freeze_table_age: 150000000
wal_level: replica
wal_receiver_status_interval: 10s
wal_receiver_timeout: 30s
wal_retrieve_retry_interval: 5s
wal_segment_size: 16MB
wal_sender_timeout: 30s
wal_writer_delay: 200ms

SQL query big duration increase after cpu+mainboard upgrade

After upgrading server hardware (cpu+mainboard) I'm having a big increase in query duration for really small and simple querys.
Software: Windows Server 2012 R2 + SQL Server 2014
Storage: Samsung SSD 850 EVO 2TB Disk
Old Hardware: i7-4790k 4.0Ghz 4core cpu + Asus H97M-E mainboard + 32 GB DDR3
New Hardware: i9-7900X 3.60Ghz 10core cpu + Asus Prime X299 mainboard + 32 GB DDR4
Query Sample:
UPDATE CLIE_PRECIOS_COMPRA SET [c_res_tr] = '0.0' WHERE eje ='18' AND mes =8 AND dia =27 AND hor =19 AND unipro='001'
SQL Profiler Result :
Old Hardware - CPU: 0, Reads 4, Writes 0, Duration 123
Old Hardware - CPU: 0, Reads 4, Writes 0, Duration 2852
I've checked network speed of both server to be the same but anyway I'm running the querys directly in the server throught Microsoft SQL Server Management console to avoid applicactions or network issues.
Checked Storage speed too being the same both at reading and writting in old and new hardware.
Also played with paralelism and tried diferent scenarios even disabling paralelism with the same result.
Of course the data is the same having the same copy of SQL database in both hardware.
I've set the duration to be showed in microseconds instead of miliseconds to appreciate better the diference.
The diference in duration for a single query is not really visible to user but the problem is that there are several thousands querys of this type and the time increase is important.
Any hint or thing to investigate would be really appreciated.
Current Execution Plan New Server: https://www.brentozar.com/pastetheplan/?id=HJYDtQQD7
Current Execution Plan Old Server: https://www.brentozar.com/pastetheplan/?id=SynyW4mPQ
Thanks in advance.

Cannot repair specific tables on specific nodes in Cassandra

I'm running 5 nodes in one DC of Cassandra 3.10.
As I'm trying to maintain those nodes I'm running on daily basis on every node
nodetool repair -pr
and weekly
nodetool repair -full
This is only table I have difficulties:
Table: user_tmp
SSTable count: 4
Space used (live): 366.71 MiB
Space used (total): 366.71 MiB
Space used by snapshots (total): 216.87 MiB
Off heap memory used (total): 5.28 MiB
SSTable Compression Ratio: 0.4690289976332873
Number of keys (estimate): 1968368
Memtable cell count: 2353
Memtable data size: 84.98 KiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 1108
Local read count: 62938927
Local read latency: 0.324 ms
Local write count: 62938945
Local write latency: 0.018 ms
Pending flushes: 0
Percent repaired: 76.94
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 4.51 MiB
Bloom filter off heap memory used: 4.51 MiB
Index summary off heap memory used: 717.62 KiB
Compression metadata off heap memory used: 76.96 KiB
Compacted partition minimum bytes: 51
Compacted partition maximum bytes: 654949
Compacted partition mean bytes: 194
Average live cells per slice (last five minutes): 2.503074492537404
Maximum live cells per slice (last five minutes): 179
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 19 bytes
Percent repaired is never above 80% on this table on this and one more node but on others is above 85%. RF is 3, and strategy is SizeTieredCompactionStrategy
gc_grace_period is on 10days and as I somewhere in that period I'm getting writetimeout on exactly this table but after consumer which got this timeout is immediately replaced with another one everything keep going like nothing happened. Its like one time writetimeout.
My question is: Are you maybe have suggestion for better repair strategy because I'm kind of a noob and every suggest is a big win for me + any other for this table?
Maybe repair -inc instead of repair -pr
The nodetool repair command in Casandra 3.10 defaults to running incremental repair. There have been some major issues with incremental repair and it's currently not recommended by the community to run incremental repair. Please see this article for some great insight into repair and the issues with incremental repair: http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
I would recommend, as does many others, to run:
nodetool repair -full -pr
Please be aware that you need to run repair on every node in your cluster. This means that if you run repair on one node per day you can have a max of 7 nodes (since with default gc_grace you should aim to finish repair within 7 days). And you also have to rely on that nothing goes wrong when doing repair since you would have to restart any failing jobs.
This is why tools like Reaper exist. It solves these issues with ease, it automates repair and makes life simpler. Reaper runs scheduled repairs and provides a web interface to make administration easier. I would highly recommend using reaper for routine maintance and nodetool repair for unplanned activities.
Edit: Link http://cassandra-reaper.io/

Cassandra sudden errors Mutation of x bytes is too large

After some weeks of stress tests with a Cassandra 3.10 cluster, suddenly errors "Mutation of x bytes is too large" have appeared, with high CPU load (e.g. load average: 33.26, 32.47, 32.15) on all the 6 nodes.
There has been not change in the size of the executed queries, so we have tried to increase the commit_log_segment_size_in_mb . It doesn't help much. The result of nodetool info is as bellows.
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 126.04 GiB
Generation No : 1498838154
Uptime (seconds) : 68795
Heap Memory (MB) : 3258.00 / 8004.00
Off Heap Memory (MB) : 597.23
Some blob are inserted, with predictable max size (and unchanged by the degradation time) or ~18kb (avg ~10kb).
Data are mainly updated using single-partition atomic/logged batches (with configurable and so preditable max batch size), as it was providing the best throughput according the stress tests.
Note that even if reverting now to individual async update doesn't fix this degradation.
What could have caused this sudden degradation of the performances ?
The amount of data had increased (~ * 1.1), but I would not expect it to cause such disproportionate issue (there is no lack of available disk storage related to that). I would expect Cassandra to scale well in such situation.
Thanks for ideas

Apache2: server-status reported value for "requests/sec" is wrong. What am I doing wrong?

I am running Apache2 on Linux (Ubuntu 9.10).
I am trying to monitor the load on my server using mod_status.
There are 2 things that puzzle me (see cut-and-paste below):
The CPU load is reported as a ridiculously small number,
whereas, "uptime" reports a number between 0.05 and 0.15 at the same time.
The "requests/sec" is also ridiculously low (0.06)
when I know there are at least 10 requests coming in per second right now.
(You can see there are close to a quarter million "accesses" - this sounds right.)
I am wondering whether this is a bug (if so, is there a fix/workaround),
or maybe a configuration error (but I can't imagine how).
Any insights would be appreciated.
-- David Jones
- - - - -
Current Time: Friday, 07-Jan-2011 13:48:09 PST
Restart Time: Thursday, 25-Nov-2010 14:50:59 PST
Parent Server Generation: 0
Server uptime: 42 days 22 hours 57 minutes 10 seconds
Total accesses: 238015 - Total Traffic: 91.5 MB
CPU Usage: u2.15 s1.54 cu0 cs0 - 9.94e-5% CPU load
.0641 requests/sec - 25 B/second - 402 B/request
11 requests currently being processed, 2 idle workers
- - - - -
After I restarted my Apache server, I realized what is going on. The "requests/sec" is calculated over the lifetime of the server. So if your Apache server has been running for 3 months, this tells you nothing at all about the current load on your server. Instead, reports the total number of requests, divided by the total number of seconds.
It would be nice if there was a way to see the current load on your server. Any ideas?
Anyway, ... answered my own question.
-- David Jones
Apache status value "Total Accesses" is total access count since server started, it's delta value of seconds just what we mean "Request per seconds".
There is the way:
1) Apache monitor script for zabbix
https://github.com/lorf/zapache/blob/master/zapache
2) Install & config zabbix agentd
UserParameter=apache.status[*],/bin/bash /path/apache_status.sh $1 $2
3) Zabbix - Create apache template - Create Monitor item
Key: apache.status[{$APACHE_STATUS_URL}, TotalAccesses]
Type: Numeric(float)
Update interval: 20
Store value: Delta (speed per second) --this is the key option
Zabbix will calculate the increment of the apache request, store delta value, that is "Request per seconds".

Resources