ClickHouse - How to remove node from cluster for reading? - database

Background
I'm beginning work to set up a ClickHouse cluster with 3 CH nodes. The first node (Node A) would be write-only, and the remaining 2 (Nodes B + C) would be read-only. By this I mean that writes for a given table to Node A would automatically replicate to Nodes B + C. When querying the cluster, reads would only be resolved against Nodes B + C.
The purpose for doing this is two-fold.
This datastore serves both real-time and background jobs. Both are high volume, only on the read side, so it makes sense to segment the traffic. Node A would be used for writing to the cluster and all background reads. Nodes B + C would be strictly used for the UX.
The volume of writes is very low, perhaps 1 write per 10,000 reads. Data is entirely refreshed once per week. Background jobs need to be certain that the most current data is being read before they can be kicked off. Reading off of replicas introduces eventual consistency as a concern, so reading from the node directly (rather than the cluster) from Node A guarantees the data to be strongly consistent.
Question
I'm not finding much specific information in the CH documentation, and am wondering whether this might be possible. If so, what would the cluster configuration look like?

Yes, it is possible to do so. But wouldn't the best solution be to read and write to each server sequentially using the Distributed table?

Related

What to pick? Quorum OR Latest timestamp data in Cassandra

I was reading about Cassandra and got to know that there is Quorum concept (i.e, if there are multiple nodes/replicas where a particular key is stored, then during read operation choose and return data which has majority across those replicas) to deal with consistency during read operation.
My doubt maybe be silly but i am not able to get how Quorum concept is useful in case where we have majority data value is different that latest timestamp data.
How we decide then which data value we have to return?
Ex -
for particular key "key1"
timestamp : t1>t2
5 replicas
replica0(main node) is down
replicaNo - Value - TIMESTAMP -
replica1 - value1 - t1
replica2 - value2 - t2
replica3 - value2 - t2
replica4 - value2 - t2
So in above case, what should we return majority (value2) or latest timestamp (value1)?
Can someone please help?
In Cassandra the last write always wins. That means that for (a1,t1) and (a2, t2) with t2>t1 value a2 will be considered the right one.
Regarding your question, a QUORUM read on its own is not that useful. That is because in order to have full consistency, the following rule must be followed:
RC+WC>RF
(RC - read consistency; WC - Write consistency; RF - Replication factor)
In your case (when a majority of replicas have the old data), QUORUM will increase the chance of getting the right data, but it won't guarantee it.
The most common use case is using quorum for both read and write. That would mean that for a RF of 5, 3 nodes would have the right value. Now, if we also read from 3 nodes then there is no way that one of the 3 does not have the newer value (since at most 2 have the old value).
Regarding how reading works, when you ask for quorum on RF of 5, the coordinator node will ask one node for the actual data and 2 nodes for a digest of that data. The coordinator node then compares the digest from the first node (the actual data) with the other 2 digests. If they match then all good the data from the first node is returned. If they are different, a read repair will be triggered, meaning that the data will be updated across all available nodes.
So if you write with consistency one on RF of 5 not only will you risk getting old data even with quorum, but if something happens with the node that had the good data, then you could lose it altogether. Finding the balance depends on the particular use case. If in doubts, use quorum for both reads and writes.
Hope this made sense,
Cheers!
Quorum just means that majority of the nodes should provide the answer. But the answers could have different timestamps, so coordinator node will select the answer with the latest timestamp to send it to the client, and at the same time will trigger repair operation for the nodes that have the old data.
But in your situation you may still get the old answer, because with RF=5, the quorum is 3 nodes, and coordinator can pickup results from replicas 2-4 that have old data. You'll get the newest results only if coordinator will include replica 1 into the list queried nodes.
P.S. in Cassandra there is no main replica - all replicas are equal.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

Cassandra performance issue/understanding

We have 3 cassandra nodes on which we are facing a problem where our application is working quite slow most of the times. There are many writes that we do per second and that could be one of the reason. Perhaps what we are doing wrong is we are continuously writing on one node (Let's say Node 1) and reading (through SOLR) from the same node (Node 1). So a possible solution that we think of is, to build some logic to change the connection string where we write on all three nodes and read from all three nodes. Will this work or is there a better way?
We have 100GB storage on each of the nodes and there is approximately 10 GB of data on each node. We also saw a case where data on Node 1 where are writing has more data and other 2 nodes has less data. Even this is something we are trying to figure out.

Apache Giraph Graph Partitioning.... Can a partition p1 resides partially in worker w1 and partially in worker w2?

I am newbie in Apache Giraph. My question is related to Giraph graph partitioning. As far as I know, Giraph partition the large graph randomly.... possibly #partitions>#workers in order to load balance. But, my question is, is #partitions/worker always an integer? Saying in the ther way, Can it happen, that a partition (say p1) resides partially in worker w1 and worker w2? Or, should p1 be either in w1 or w2 at entirety?
Partition in Giraph refers to vertex partition not graph-partitions. For example, if a graph has 10 vertices numbered from 1 to 10 then a possible partition would be {1,2. 3}, {4,5,6}, {7,8,9,10}. Each partition knows where its outgoing edges are pointing. Each worker creates threads for each partition which is assigned to it. The thread iterates over each vertex in the partition and executes compute function.
So with this information I would say a partition has to reside on a single worker entirely.
Hello #zahorak,
If Giraph implemented Pregel as it is, then as per the Pregel paper it is not necessary to have #partitions == #workers. It says,
The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. Having more than one partition per worker allows parallelism among the partitions and better load balancing, and will usually improve performance.
UPDATE: I found the similar question on Giraph user mailing list. The answers given in replies might be helpful. Here is the link to the thread - https://www.mail-archive.com/user#giraph.apache.org/msg01869.html
AFAIK no, actually I would have said, #partitions == #workers
The reason for partitioning is to handle parts of the graph on one server. After the superstep is executed messages sent to other partitions are exchanged between the servers within a cluster.
Maybe you understand something else under the term partitioning as me, but for me partitioning means:
Giraph is on a cluster with multiple servers, in order to laverage all servers, it needs to partition the graph. And than in simply assings randomly a node to one of the n servers. Out of this you get n partitions and nodes within each partition are executed by the one server they were assigned to, no other.

Sorted String Table (SSTable) or B+ Tree for a Database Index?

Using two databases to illustrate this example: CouchDB and Cassandra.
CouchDB
CouchDB uses a B+ Tree for document indexes (using a clever modification to work in their append-only environment) - more specifically as documents are modified (insert/update/delete) they are appended to the running database file as well as a full Leaf -> Node path from the B+ tree of all the nodes effected by the updated revision right after the document.
These piece-mealed index revisions are inlined right alongside the modifications such that the full index is a union of the most recent index modifications appended at the end of the file along with additional pieces further back in the data file that are still relevant and haven't been modified yet.
Searching the B+ tree is O(logn).
Cassandra
Cassandra keeps record keys sorted, in-memory, in tables (let's think of them as arrays for this question) and writes them out as separate (sorted) sorted-string tables from time to time.
We can think of the collection of all of these tables as the "index" (from what I understand).
Cassandra is required to compact/combine these sorted-string tables from time to time, creating a more complete file representation of the index.
Searching a sorted array is O(logn).
Question
Assuming a similar level of complexity between either maintaining partial B+ tree chunks in CouchDB versus partial sorted-string indices in Cassandra and given that both provide O(logn) search time which one do you think would make a better representation of a database index and why?
I am specifically curious if there is an implementation detail about one over the other that makes it particularly attractive or if they are both a wash and you just pick whichever data structure you prefer to work with/makes more sense to the developer.
Thank you for the thoughts.
When comparing a BTree index to an SSTable index, you should consider the write complexity:
When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.
You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.
In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.
When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.
You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:
Cassandra under heavy write load
This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.
If you want to know more about this, I've given a talk about how it works:
podcast
slides
Some things that should also be mentioned about each approach:
B-trees
The read/write operations are supposed to be logarithmic O(logn). However, a single database write can lead to multiple writes in the storage system. For example, when a node is full, it has to be split and that means that there will be 2 writes for the 2 new nodes and 1 additional write for updating the parent node. You can see how that could increase if the parent node was also full.
Usually, B-trees are stores in such a way that each node has the size of a page. This creates a phenomenon called write amplification, where even if a single byte needs to be updated, a whole page is written.
Writes are usually random (not sequential), thus slower especially for magnetic disks.
SSTables
SSTables are usually used in the following approach. There is an in-memory structure, called memtable, as you described. Every once in a while, this structure is flushed to disk to an SSTable. As a result, all the writes go to the memtable, but the reads might not be in the current memtable, in which case they are searched in the persisted SSTables.
As a result, writes are O(logn). However, always bear in mind that they are done in memory, so they should be orders of magnitude faster than the logarithmic operations in disk of B-trees. For the sake of completeness, we should mention that writes are also written to a write-ahead log for crash recovery. But, given that these are all sequential writes, they are expected to be much more efficient than the random writes of B-trees.
When served from memory (from the memtable), reads are expected to be much faster as well. But, when there's need to look in the older, disk-based SSTables, reads can potentially become quite slower than B-trees. There are several optimisations around that, such as use of bloom filters, to check whether an SSTable contains a value without performing disk reads.
As you mentioned, there's also a background process, called compaction, used to merge SSTables. This helps remove deleted values and prevent fragmentation, but it can cause significant write load, affecting the write throughput of the incoming operations.
As it becomes evident, a comparison between these 2 approaches is much more complicated. In an extremely simplified attempt to provide a concrete comparison, I think we could say that:
SSTables provide a much better write throughput than B-trees. However, they are expected to have less stable behaviour, because of ongoing compactions. An example of this can be seen in this benchmark comparison.
B-trees are usually preferred for use-cases, where transaction semantics are needed. This is because, each key can be found only in a single place (in contrast to the SSTable, where it could exist in multiple SSTables with obsolete values in some of them) and also because one could represent a range of values as part of the tree. This means that it's easier to perform key-level and range-level locking mechanisms.
References
[1] A Performance Comparison of LevelDB and MySQL
[2] Designing Data-intensive Applications
I think fractal trees, as used by Tokutek, are a better index for a database. They offer real-world 20x to 80x improvements over b-trees.
There are excellent explanations of how fractal tree indices work here.
LSM-Trees is better than B-Trees on storage engine structured.
It converts random-write to aof in a way.
Here is a LSM-Tree src:
https://github.com/shuttler/lsmtree

Resources