Cassandra row cache not reducing query latency - database

I have a table with partition key + clustering column, and I enabled row caching = 2000. I gave row caching 20GB by changing the yaml file.
I can see that my hit rate is ~90% (which is good enough for me), but monitoring shows no latency reduction at all (it's even a bit higher after caching). Note that the query that I'm running is to get all rows for a given partition key. Number of rows differ from 10 to 10000.
Any idea why Cassandra row caching is not effective for my case?
Host: EC2 r4.16x, which has 64 cores and 488GB memory.
C* version: 3.11.0
=========nodetool info=========:
ID : 1f05c846-ddd6-4409-8473-10f3c0490279
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 145.95 GiB
Generation No : 1506984130
Uptime (seconds) : 69561
Heap Memory (MB) : 4707.58 / 7987.25
Off Heap Memory (MB) : 680.54
Data Center : us-east
Rack : 1e
Exceptions : 0
Key Cache : entries 671231, size 100 MiB, capacity 100 MiB,
558160978 hits, 566196579 requests, 0.986 recent hit rate, 14400 save
period in seconds
Row Cache : entries 1225796, size 17.8 GiB, capacity 19.53 GiB, 80015143 hits, 86502918 requests, 0.925 recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 122880, size 480 MiB, capacity 480 MiB, 334907479 misses, 6940124103 requests, 0.952 recent hit rate, 13.407 microseconds miss latency
Percent Repaired : 85.28384276963543%
Token : (invoke with -T/--tokens to see all 256 tokens)

Related

How to Config ScyllaDB for million rows: run simple query count(*) return error : timeout during read query at consistency

I have table with simple table :
create table if not exists keyspace_test.table_test
(
id int,
date text,
val float,
primary key (id, date)
)
with caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
and compaction = {'class': 'SizeTieredCompactionStrategy'}
and compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
and dclocal_read_repair_chance = 0
and speculative_retry = '99.0PERCENTILE'
and read_repair_chance = 1;
After that i import 12 million rows. Than i want to run simple calculation count rows & sum column val. With this query :
SELECT COUNT(*), SUM(val)
FROM keyspace_test.table_test
but show error :
Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
I am already add USING TIMEOUT 180s; but show error :
Timed out waiting for server response
Configuration server that i use are in 2 location datacenter. Each datacenter has 4 server.
# docker exec -it scylla-120 nodetool status
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.3.192.25 79.04 GB 256 ? 5975a143fec6 Rack1
UN 10.3.192.24 74.2 GB 256 ? 61dc1cfd3e92 Rack1
UN 10.3.192.22 88.21 GB 256 ? 0d24d52d6b0a Rack1
UN 10.3.192.23 63.41 GB 256 ? 962s266518ee Rack1
Datacenter: dc3
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 34.77.78.21 83.5 GB 256 ? 5112f248dd38 Rack1
UN 34.77.78.20 59.87 GB 256 ? e8db897ca33b Rack1
UN 34.77.78.48 81.32 GB 256 ? cb88bd9326db Rack1
UN 34.77.78.47 79.8 GB 256 ? 562a721d4b77 Rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
And i create keyspace with :
CREATE KEYSPACE keyspace_test WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc2' : 3, 'dc3' : 3};
How realay config this scylla with million rows data ?
Not sure about SUM, but you could use DSBulk to count the rows in a table.
dsbulk count \
-k keyspace_test \
-t table_test \
-u username \
-p password \
-h 10.3.192.25
DSBulk takes token range ownership into account, so it's not as stressful on the cluster.
A explained in the Scylla documentation (https://docs.scylladb.com/stable/kb/count-all-rows.html) a COUNT requires scanning the entire database, which can take a long time, so using USING TIMEOUT like you did is indeed the right thing.
I don't know whether or not 180 seconds for scanning 12 million rows on your table is a long enough timeout - to be sure you can try increasing this to 3600 seconds and see if it ever finishes, or try a full-table scan (not just a count) so see how fast it progresses to be able to estimate how long a count() might take (a count() should take less time than an actual scan returning data, but not much less - it still does all the same IO).
Also, it is important to note that until very recently, COUNT was implemented inefficiently - it proceeded sequentially instead of utilizing all the shards in the system. This was fixed in https://github.com/scylladb/scylladb/commit/fe65122ccd40a2a3577121aebdb9a5b50deb4a90 - but it only reached Scylla 5.1 (or the master branch), are you using an older version of Scylla? The example in that commit suggests that the new implementation may be as much as 30 times faster as the old one!
So hopefully, on Scylla 5.1 a much lower timeout would be enough for your COUNT operation to finish. On older versions, you can emulate what Scylla 5.1 does manually: Divide the token range into parts, and invoke a partial COUNT on each of these token ranges, all in parallel, and then sum up the results from all the different ranges.

How to check max-sql-memory and cache settings for an already running instance of cockroach db?

I have a cockroachdb instance running in production and would like to know the settings for the --max-sql-memory and --cache specified when the database was started. I am trying to enhance performance by following this production checklist but I am not able infer the setting either on dashboard or sql console.
Where can I check the values of max-sql-memory and cache value ?
Note: I am able to access the cockroachdb admin console and sql tables.
You can find this information in the logs, shortly after node startup:
I190626 10:22:47.714002 1 cli/start.go:1082 CockroachDB CCL v19.1.2 (x86_64-unknown-linux-gnu, built 2019/06/07 17:32:15, go1.11.6)
I190626 10:22:47.815277 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 31 GiB, using system memory
I190626 10:22:47.815311 1 server/config.go:386 system total memory: 31 GiB
I190626 10:22:47.815411 1 server/config.go:388 server configuration:
max offset 500000000
cache size 7.8 GiB <====
SQL memory pool size 7.8 GiB <====
scan interval 10m0s
scan min idle time 10ms
scan max idle time 1s
event log enabled true
If the logs have been rotated, the value depends on the flags.
The defaults for v19.1 are 128MB, with recommended settings being 0.25 (a quarter of system memory).
The settings are not currently logged periodically or exported through metrics.

Cannot repair specific tables on specific nodes in Cassandra

I'm running 5 nodes in one DC of Cassandra 3.10.
As I'm trying to maintain those nodes I'm running on daily basis on every node
nodetool repair -pr
and weekly
nodetool repair -full
This is only table I have difficulties:
Table: user_tmp
SSTable count: 4
Space used (live): 366.71 MiB
Space used (total): 366.71 MiB
Space used by snapshots (total): 216.87 MiB
Off heap memory used (total): 5.28 MiB
SSTable Compression Ratio: 0.4690289976332873
Number of keys (estimate): 1968368
Memtable cell count: 2353
Memtable data size: 84.98 KiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 1108
Local read count: 62938927
Local read latency: 0.324 ms
Local write count: 62938945
Local write latency: 0.018 ms
Pending flushes: 0
Percent repaired: 76.94
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 4.51 MiB
Bloom filter off heap memory used: 4.51 MiB
Index summary off heap memory used: 717.62 KiB
Compression metadata off heap memory used: 76.96 KiB
Compacted partition minimum bytes: 51
Compacted partition maximum bytes: 654949
Compacted partition mean bytes: 194
Average live cells per slice (last five minutes): 2.503074492537404
Maximum live cells per slice (last five minutes): 179
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 19 bytes
Percent repaired is never above 80% on this table on this and one more node but on others is above 85%. RF is 3, and strategy is SizeTieredCompactionStrategy
gc_grace_period is on 10days and as I somewhere in that period I'm getting writetimeout on exactly this table but after consumer which got this timeout is immediately replaced with another one everything keep going like nothing happened. Its like one time writetimeout.
My question is: Are you maybe have suggestion for better repair strategy because I'm kind of a noob and every suggest is a big win for me + any other for this table?
Maybe repair -inc instead of repair -pr
The nodetool repair command in Casandra 3.10 defaults to running incremental repair. There have been some major issues with incremental repair and it's currently not recommended by the community to run incremental repair. Please see this article for some great insight into repair and the issues with incremental repair: http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
I would recommend, as does many others, to run:
nodetool repair -full -pr
Please be aware that you need to run repair on every node in your cluster. This means that if you run repair on one node per day you can have a max of 7 nodes (since with default gc_grace you should aim to finish repair within 7 days). And you also have to rely on that nothing goes wrong when doing repair since you would have to restart any failing jobs.
This is why tools like Reaper exist. It solves these issues with ease, it automates repair and makes life simpler. Reaper runs scheduled repairs and provides a web interface to make administration easier. I would highly recommend using reaper for routine maintance and nodetool repair for unplanned activities.
Edit: Link http://cassandra-reaper.io/

Cassandra sudden errors Mutation of x bytes is too large

After some weeks of stress tests with a Cassandra 3.10 cluster, suddenly errors "Mutation of x bytes is too large" have appeared, with high CPU load (e.g. load average: 33.26, 32.47, 32.15) on all the 6 nodes.
There has been not change in the size of the executed queries, so we have tried to increase the commit_log_segment_size_in_mb . It doesn't help much. The result of nodetool info is as bellows.
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 126.04 GiB
Generation No : 1498838154
Uptime (seconds) : 68795
Heap Memory (MB) : 3258.00 / 8004.00
Off Heap Memory (MB) : 597.23
Some blob are inserted, with predictable max size (and unchanged by the degradation time) or ~18kb (avg ~10kb).
Data are mainly updated using single-partition atomic/logged batches (with configurable and so preditable max batch size), as it was providing the best throughput according the stress tests.
Note that even if reverting now to individual async update doesn't fix this degradation.
What could have caused this sudden degradation of the performances ?
The amount of data had increased (~ * 1.1), but I would not expect it to cause such disproportionate issue (there is no lack of available disk storage related to that). I would expect Cassandra to scale well in such situation.
Thanks for ideas

Hazelcast vs. Ignite benchmark

I am using data grids as my primary "database". I noticed a drastic difference between Hazelcast and Ignite query performance. I optimized my data grid usage by the proper custom serialization and indexes, but the difference is still noticeable IMO.
Since no one asked it here, I am going to answer my own question for all future references. This is not an abstract (learning) exercise, but a real-world benchmark, that models my data grid usage in large SaaS systems - primarily to display sorted and filtered paginated lists. I primarily wanted to know how much overhead my universal JDBC-ish data grid access layer adds compared to raw no-frameworks Hazelcast and Ignite usage. But since I am comparing apples to apples, here comes the benchmark.
I have reviewed the provided code on GitHub and have many comments:
Indexing and Joins
Probably the most important point is that Apache Ignite indexing is a lot more sophisticated than Hazelcast. Unlike Hazelcast, Ignite supports ANSI 99 SQL, so you can write your queries at will.
Most importantly, unlike Hazelcast, Ignite supports group-indexes and SQL JOINs across different caches or data types. Imagine that you have Person and Organization tables, and you need to select all Persons working for the same Organization. This would be impossible to do in 1 step in Hazelcast (correct me if I am wrong), but in Ignite it is a simple SQL JOIN query.
Given the above, Ignite indexes will take a bit longer to create, especially in your test, where you have 7 of them.
Fixes in TestEntity Class
In your code, the entity you store in cache, TestEntity, recalculates the value for idSort, createdAtSort, and modifiedAtSort every time the getter is called. Ignite calls these getters several times while the entity is being stored in the index tree. A simple fix to the TestEntity class provides 4x performance improvement: https://gist.github.com/dsetrakyan/6bfe089d53f888448503
Heap Measurement is Not Accurate
The way you measure heap is incorrect. You should at least call System.gc() before taking the heap measurement, and even that would not be accurate. For example, in the results below, I get negative heap size using your method.
Warmup
Every benchmark requires a warm-up. For example, when I apply the TestEntity fix, as suggested above, and do the cache population and queries 2 times, I get better results.
MySQL Comparison
I don't think comparing a single-node Data Grid test to MySQL is fair, neither for Ignite, nor for Hazelcast. Databases have their own caching and whenever working with such small memory sizes, you are usually testing database in-memory cache vs. Data Grid in-memory cache.
The performance benefit usually comes in whenever doing a distributed test over a partitioned cache. This way a Data Grid will execute the query on each cluster node in parallel and the results should come back a lot faster.
Results
Here are the results I got for Apache Ignite. They look a lot better after I made the aforementioned fixes.
Note that the 2nd time we execute the cache population and cache queries, we get better results because the HotSpot JVM is warmed up.
It is worth mentioning that Ignite does not cache query results. Every time you run the query, you are executing it from scratch.
[00:45:15] Ignite node started OK (id=0960e091, grid=Benchmark)
[00:45:15] Topology snapshot [ver=1, servers=1, clients=0, CPUs=4, heap=8.0GB]
Starting - used heap: 225847216 bytes
Inserting 100000 records: ....................................................................................................
Inserted all records - used heap: 1001824120 bytes
Cache: 100000 entries, heap size: 775976904 bytes, inserts took 14819 ms
------------------------------------
Starting - used heap: 1139467848 bytes
Inserting 100000 records: ....................................................................................................
Inserted all records - used heap: 978473664 bytes
Cache: 100000 entries, heap size: **-160994184** bytes, inserts took 11082 ms
------------------------------------
Query 1 count: 100, time: 110 ms, heap size: 1037116472 bytes
Query 2 count: 100, time: 285 ms, heap size: 1037116472 bytes
Query 3 count: 100, time: 19 ms, heap size: 1037116472 bytes
Query 4 count: 100, time: 123 ms, heap size: 1037116472 bytes
------------------------------------
Query 1 count: 100, time: 10 ms, heap size: 1037116472 bytes
Query 2 count: 100, time: 116 ms, heap size: 1056692952 bytes
Query 3 count: 100, time: 6 ms, heap size: 1056692952 bytes
Query 4 count: 100, time: 119 ms, heap size: 1056692952 bytes
------------------------------------
[00:45:52] Ignite node stopped OK [uptime=00:00:36:515]
I will create another GitHub repo with the corrected code and post it here when I am more awake (coffee is not helping anymore).
Here is the benchmark source code: https://github.com/a-rog/px100data/tree/master/examples/HazelcastVsIgnite
It is part of the JDBC-ish NoSQL framework I mentioned earlier: Px100 Data
Building and running it:
cd <project-dir>
mvn clean package
cd target
java -cp "grid-benchmark.jar:lib/*" -Xms512m -Xmx3000m -Xss4m com.px100systems.platform.benchmark.HazelcastTest 100000
java -cp "grid-benchmark.jar:lib/*" -Xms512m -Xmx3000m -Xss4m com.px100systems.platform.benchmark.IgniteTest 100000
As you can see, I set the memory limits high to avoid garbage collection. You can also run my own framework test (see Px100DataTest.java) and compare to the two above, but let's concentrate on pure performance. Neither test uses Spring or anything else except for Hazelcast 3.5.1 and Ignite 1.3.3 - the latest at the moment.
The benchmark transactionally inserts the specified number of appr. 1K-size records (100000 of them - you can increase it, but beware of memory) in batches (transactions) of 1000. Then it executes two queries with ascending and descending sort: four total. All query fields and ORDER BY are indexed.
I am not going to post the entire class (download it from GitHub). The Hazelcast query looks like this:
PagingPredicate predicate = new PagingPredicate(
new Predicates.AndPredicate(new Predicates.LikePredicate("textField", "%Jane%"),
new Predicates.GreaterLessPredicate("id", first.getId(), false, false)),
(o1, o2) -> ((TestEntity)o1.getValue()).getId().compareTo(((TestEntity)o2.getValue()).getId()),
100);
The matching Ignite query:
SqlQuery<Object, TestEntity> query = new SqlQuery<>(TestEntity.class,
"FROM TestEntity WHERE textField LIKE '%Jane%' AND id > '" + first.getId() + "' ORDER BY id LIMIT 100");
query.setPageSize(100);
Here are the results executed on my 2012 8-core MBP with 8G of memory:
Hazelcast
Starting - used heap: 49791048 bytes
Inserting 100000 records: ....................................................................................................
Inserted all records - used heap: 580885264 bytes
Map: 100000 entries, used heap: 531094216 bytes, inserts took 5458 ms
Query 1 count: 100, time: 344 ms, heap size: 298844824 bytes
Query 2 count: 100, time: 115 ms, heap size: 454902648 bytes
Query 3 count: 100, time: 165 ms, heap size: 657153784 bytes
Query 4 count: 100, time: 106 ms, heap size: 811155544 bytes
Ignite
Starting - used heap: 100261632 bytes
Inserting 100000 records: ....................................................................................................
Inserted all records - used heap: 1241999968 bytes
Cache: 100000 entries, heap size: 1141738336 bytes, inserts took 14387 ms
Query 1 count: 100, time: 222 ms, heap size: 917907456 bytes
Query 2 count: 100, time: 128 ms, heap size: 926325264 bytes
Query 3 count: 100, time: 7 ms, heap size: 926325264 bytes
Query 4 count: 100, time: 103 ms, heap size: 934743064 bytes
One obvious difference is the insert performance - noticeable in real life. However very rarely one inserts a 1000 records. Typically it is one insert or update (saving entered user's data, etc.), so it doesn't bother me. However the query performance does. Most data-centric business software is read-heavy.
Note the memory consumption. Ignite is much more RAM-hungry than Hazelcast. Which can explain better query performance. Well, if I decided to use an in-memory grid, should I worry about the memory?
You can clearly tell when data grids hit indexes and when they don't, how they cache compiled queries (the 7ms one), etc. I don't want to speculate and will let you play with it, as well, as Hazelcast and Ignite developers provide some insight.
As far, as general performance, it is comparable, if not below MySQL. IMO in-memory technology should do better. I am sure both companies will take notes.
The results above are pretty close. However when used within Px100 Data and the higher level Px100 (which heavily relies on indexed "sort fields" for pagination) Ignite pulls ahead and is better suited for my framework. I care primarily about the query performance.

Resources