What would happen raft-log-gc-size-limit is larger than region-split-size in TiKV - tikv

I have a TiKV setup it has a region-split-size as 96MB, which is the default and raft-log-gc-size-limit is 144MB. How does it impact the tikv cluster. Since it is mentioned that raft-log-gc-size-limit should be 3/4th of region-split-size.

You may waste space for storing raft logs and waste network traffic to restore Raft state.
A Region is managed by a Raft group, and it splits when it's data exceeds region-max-size, eg, Region [a,e) may be split into several Regions [a,b), [b,c), [c,d), [d,e) and the size of [a,b), [b,c), [c,d) are around the region-split-size. So TiKV assumes size of a snapshot is around the region-split-size too.
An out of date peer restores state either by snapshot or by raft logs. Raft logs are always preferred. There are two possible outcomes if we store too many raft logs (> region-spilt-size):
Waste network traffic to restore state,
Waste space to store raft logs, older raft logs may never be fetched and sent.

Related

How is Raft linearlizable?

I am pretty new to distributed systems and was wondering how Raft consensus algorithm is linearizable. Raft commits log entries through quorum. At the moment the leader Raft commits, this means that more than half of the participants has the replicated log. But there may be a portion of the participants that doesn't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
Or does Raft's read linearizability require a read quorum?
Well, linearizability pertains to both reads and writes, and yes, both are accomplished with a quorum. To make reads linearizable, reads must be handled by the leader, and the leader must verify it has not been superseded by a newer leader after applying the read to the state machine but before responding to the client. In practice, though, many real-world implementations use relaxed consistency models for reads, e.g. allowing reads from followers. But note that while quorums guarantee linearizability for the Raft cluster, that doesn’t mean client requests are linearizable. To extend linearizability to clients, sessions must be added to prevent dropped/duplicated client requests from producing multiple commits in the Raft log for a single commit, which would violate linearizabity.
kuujo has explained what's linearizability. I'll answer your other doubt in the question.
But there may be a portion of the participants that don't have the latest logs or that they have the logs but haven't received instructions to commit those logs.
This is possible after a leader commits a log entry and some of the peers do not have this log entry right now, but other peers will have it eventually in some ways, Raft does several things to guarantee that:
Even if the leader has committed a log entry (let's say LogEntry at index=8) and answered the client, background routines are still trying to sync logEntry:8 to other peers. If RPC fails, (it's possible)
Raft sends a heartbeat periodically(a kind of AppendEntries), this heartbeat RPC will sync logs the leader has to followers if followers do not have it. After followers have logEntry:8, followers will compare its local commitIndex and leaderCommitIndex in the RPC args to decide if it should commit logEntry:8. If the leader fails immediately after committing a log, (it's rare but possible)
Based on the election rule, only the one who has the logEntry:8 can win the election. After a new leader elected, the new leader will continue using the heartBeat to sync logEntry:8 to other peers. What happens if a follower falls behind so much that it can not get all logs from the leader? (it happens a lot when you try to add a new node). In this scenario, Raft will use a snapshot RPC mechanism to sync all data and trimmed logs.

How to make Flink job with huge state finish

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.
using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

Clarification for State Backend

I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.

Are there databases that bases durability on redundancy and not on persistent storage?

Sorry that the title isn't exactly obvious, but I couldn't word it better.
We are right now using a conventional DB (oracle) as our job queue, and these "jobs" are consumed by some number of nodes (machines). So the DB server gets hit by these nodes, and we have to pay a lot for the software and hardware for this database server.
Now, it occurred to me the other day that,
1) There are already multiple nodes in the system
2) "Jobs" may not be lost because of node failures, but there is no reason they have to be sitting in a secondary storage (no reason why they couldn't reside in memory, as long as they are not lost)
Given this, couldn't one retain these jobs in-memory, making sure that at least n number of copies of this job is present in the entire cluster, thereby getting rid of the DB server?
Are such technologies available?
Did you take a look at Gigaspaces? On an internet scale, you do not need to persist at all. You just have to know sufficient copies are around. If you have low latency connections to places that are not on the same powergrid (or have battery power), pushing out your transactions to the duplicates is enough.
If you're only looking at storing up to a few terabytes of data, and you're looking for redundancy vs. disk recoverability, then take a look at Oracle Coherence. For example:
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
Datacenter continuous availability (CA). Kill a data center, no data loss.
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
It depends on how much you expect these technologies to do for you. There are loads of basic in-memory databases (SQLite, Redis, etc) and you can use normal database replication techniques with multiple slaves in multiple data centers to pretty much ensure durability without persistence.
If you're storing in memory you're likely going to run out of space and require horizontal partitioning (sharding) and may want to check out something like VoltDB if you want to stick with SQL.

How do in-memory databases provide durability?

More specifically, are there any databases that don't require secondary storage (e.g. HDD) to provide durability?
Note:This is a follow up of my earlier question.
If you want persistence of transations writing to persistent storage is only real option (you perhaps do not want to build many clusters with independent power supplies in independent data centers and still pray that they never fail simultaneously). On the other hand it depends on how valuable your data is. If it is dispensable then pure in-memory DB with sufficient replication may be appropriate. BTW even HDD may fail after you stored your data on it so here is no ideal solution. You may look at http://www.julianbrowne.com/article/viewer/brewers-cap-theorem to choose replication tradeoffs.
Prevayler http://prevayler.org/ is an example of in-memory system backed up with persistent storage (and the code is extremely simple BTW). Durability is provided via transaction logs that are persisted on appropriate device (e.g. HDD or SSD). Each transaction that modifies data is written into log and the log is used to restore DB state after power failure or database/system restart. Aside from Prevayler I have seen similar scheme used to persist message queues.
This is indeed similar to how "classic" RDBMS works except that logs are only data written to underlying storage. The logs can be used for replication also so you may send one copy of log to a live replica other one to HDD. Various combinations are possible of course.
All databases require non-volatile storage to ensure durability. The memory image does not provide a durable storage medium. Very shortly after you loose power your memory image becomes invalid. Likewise, as soon as the database process terminates, the operating system will release the memory containing the in-memory image. In either case, you loose your database contents.
Until any changes have been written to non-volatile memory, they are not truely durable. This may consist of either writing all the data changes to disk, or writing a journal of the change being done.
In space or size critical instances non-volatile memory such as flash could be substituted for a HDD. However, flash is reported to have issues with the number of write cycles that can be written.
Having reviewed your previous post, multi-server replication would work as long as you can keep that last server running. As soon as it goes down, you loose your queue. However, there are a number of alternatives to Oracle which could be considered.
PDAs often use battery backed up memory to store their databases. These databases are non-durable once the battery runs down. Backups are important.
In-memory means all the data is stored in memory for it to be accessed. When data is read, it can either be read from the disk or from memory. In case of in-memory databases, it's always retrieved from memory. However, if the server is turned off suddenly, the data will be lost. Hence, in-memory databases are said to lack support for the durability part of ACID. However, many databases implement different techniques to achieve durability. This techniques are listed below.
Snapshotting - Record the state of the database at a given moment in time. In case of Redis the data is persisted to the disk after every two seconds for durability.
Transaction Logging - Changes to the database are recorded in a journal file, which facilitates automatic recovery.
Use of NVRAM usually in the form of static RAM backed up by battery power. In this case data can be recovered after reboot from its last consistent state.
classic in memory database can't provide classic durability, but depending on what your requirements are you can:
use memcached (or similar) to storing in memory across enough nodes that it's unlikely that the data is lost
store your oracle database on a SAN based filesystem, you can give it enough RAM (say 3GB) that the whole database is in RAM, and so disk seek access never stores your application down. The SAN then takes care of delayed writeback of the cache contents to disk. This is a very expensive option, but it is common in places where high performance and high availability are needed and they can afford it.
if you can't afford a SAN, mount a ram disk and install your database on there, then use DB level replication (like logshipping) to provide failover.
Any reason why you don't want to use persistent storage?

Resources