Why paxos acceptor must send back any value they have already accepted - distributed

I'm learning MIT 6.824 class and I have a question about paxos. When proposer send a prepare to acceptor, acceptor will return a prepare_ok with n and v which highest accept see. I wonder why acceptor need to return n and v?

In a nutshell, the acceptor must return v because if the value was already committed then the new proposer needs to know what it is. There is no global "is_committed" flag, so the proposer gathers all these values and their associated rounds from at least a majority of the acceptors. The proposer then sends the value with the highest round to all the acceptors.
As you can see, a proposer always finishes what another proposer has started when it receives a value from an acceptor. This is a little bit similar to a lot of wait-free algorithms.

Related

Can Flink Interval Join shares stream?

I'm trying to understand IntervalJoin operation on Flink and got a question.
Let's assume we have three streams A, B and C.
Here, we interval join two streams each like A-C and B-C.
In java code, it would show like below.
// join stream A and stream C
SingleOutputStreamOperator<SensorReadingOutput> joined1 = A
.intervalJoin(C)
.between(Time.seconds(-1), Time.seconds(0))
.process(new IntervalJoinFunction());
// join stream B and stream C
SingleOutputStreamOperator<SensorReadingOutput> joined2 = B
.intervalJoin(C)
.between(Time.seconds(-1), Time.seconds(0))
.process(new IntervalJoinFunction());
As we see, stream C is joined twice.
Here, can the stream C be shared among two streams A and B?
That is, does the stream C exist as single or duplicated(copy) for each A and B?
I am confusing because of two points in IntervalJoin operation.
Every time we call .process at last of interval join, we create new IntervalJoinOperator. I think stream C would be copied.
In IntervalJoinOperator, the records are cleaned up using internal timer service that is triggered by the event time based watermark. Stream A and B would have different watermark and I think it would affect the stream C's retention period, so stream C should be copied and managed individually.
However, when I made a test code to see if three streams records with the same key are collected in same task instance, they do.
Anybody knows the fact? Thank you!
For someone who wonders same question, the answer is 'they don't share the stream'.
Instead, another duplicated stream is created for another IntervalJoin.
I've done some tests to print the buffer's address inside the IntervalJoinOperator.
For A-C and B-C joining, the same value of record C which is joined with A and B both shows different address.
If the stream is shared, address of record in stream C would be same.
I think this is because of two reasons.
Whenever .intervalJoin is called for keyedStream, new IntervalJoinOperator is created and it contains its own buffer. New buffers would be created every time thus sharing the stream does not make sense.
Also, the watermark of stream A and B would be different. The watermark decides the retention period of buffers in IntervalJoinOperator, thus sharing the buffer also does not make sense here.

RAFT: What happens when Leader change during operation

I want to understand following scenario:
3 node cluster ( n1, n2, n3 )
Assume n1 is the leader
Client send operation/request to n1
For some reason during the operation i.e. ( appendentry, commitEntry ... ) leader changes to n2
But n1 has successfully written to its log
Should this be considered as failure of operation/request ? or It should return "Leader Change"
This is an indeterminate result because we don't know at the time if the value will be committed or not. That is, the new leader may decide to keep or overwrite the value; we just do not have the information to know.
In some systems that I have maintained the lower-level system (the consensus layer) gives two success results for each request. It first returns Accepted when the leader puts it into its log and then Committed when the value is sufficiently replicated.
The Accepted result means that the value may be committed later, but it may not.
The layers that wrap the consensus layers can be a bit smarter about the return value to the client. They have a broader view of the system and may retry the requests and/or query the new leader about the value. That way the system-as-a-whole can have the customary single return value.

How to determine Kafka topic partition offset if we haven't consumed any messages yet

librdkafka contains the function rd_kafka_position which fetches the current offsets for the given topic-partitions. But the comment says:
The \p offset field of each requested partition will be set to the offset
of the last consumed message + 1, or RD_KAFKA_OFFSET_INVALID in case there was
no previous message.
In other words, it won't give you any useful information if no messages have been consumed yet.
I'm interested in the case where I've just subscribed to a topic, and I've already called rd_kafka_seek to either:
seek to a known position (in the case of error recovery), or
seek to the very end of the partition.
What I'd like to know, in this context, is what the offset would be for the next message if one were to be consumed. In other words, in the first case, it should be the same offset that was passed to rd_kafka_seek, and in the second case, it should be 1 plus the offset of the last message that was in the partition when rd_kafka_seek was called.
Unfortunately, just like the comment says, rd_kafka_position doesn't return this information. If no messages have been consumed yet, it gives -1001 (RD_KAFKA_OFFSET_INVALID). If I consume a message and then call rd_kafka_position, it gives the correct offset.
Is there some other function that I can call in order to get the offset before consuming any messages?
I'm not sure what you are after.... "offset" is something that is consumer-specific, in most cases (except for two cases I mention below). It tracks read progress of each specific consumer for each topic/partition, and if there was no reading done by that consumer yet - there is no consumer-specific offset for that topic/partition yet. So, asking for an offset of this consumer in this case does not make any sense - the consumer has not yet read anything, so there is no offset associated with it and it could be started from any offset you wish it to start.
The two main cases when consumer-unrelated offsets are useful are:
when you know which offsets in a topic you want to start processing from based on either time of the messages or some custom
error logging/reporting you have in your application
or when you want to start from either EARLIEST or LATEST available offsets in the topic
If you know what position in a partition you want a consumer to start reading from - you just seek to that position and let your consumer start consuming messages from then on. And then you can track progress of this consumer by asking what offset it is at at any point in time ....
And if you want to start either from earliest or latest position - you can find out what that position is (using KAfkaAdminClient.listOffsets(), for example, in 2.5.x version - that is in Java, I don't know what is an equivalent method in Python), and then again seek to that position and start you consumer from it.
So, in brief, again, you can only expect to get a correct offset for a consumer if it has read anything from the topic; otherwise - the only consumer-unrelated meaningful information would be the earliest, latest or some specific (known) offsets determined by you
committed() (C++ API) or rd_kafka_committed() (C API) retrieves the committed offsets of every topic/partition. Some discussion here:
https://github.com/edenhill/librdkafka/issues/1964

The purpose of using Q-Learning algorithm

What is the point of using Q-Learning? I have used an example code that represents 2D board with pawn moving on this board. At the right end of the board there is goal which we want to reach. After completion of algorithm I have a q table with values assigned to every state-action junction. Is it all about getting this q table to see which state-actions (which actions are the best in case of specific states) pairs are the most useful? That's how I understand it right now. Am I right?
Is it all about getting this q table to see which state-actions (which
actions are the best in case of specific states) pairs are the most
useful?
Yep! That's pretty much it. Given a finite state space, Q-learning is guaranteed to eventually learn the optimal policy. Once an optimal policy is reached (also known as convergence) every time the agent is in a given state s, it looks in its Q-table for the action a with the highest reward for that (s, a) pair.

Paxos questions: if proposer down, what happened?

I'm going through kinds of scenarios which the basic Paxos algorithm could get agreement of final result. There's one case I can't explain the result.
There's two proposed P1 P2, three acceptor A1 A2 A3. P1 would propose value u, P2 would propose value v.
1. P1(send id n) finish the prepare step, receive all promise from A1 A2 A3, then in A1 A2 A3 all store n as id.
2. P2(send id n+1) then A1 A2 A3 store n+1 as id
3. P2 down.
4. P1 send accept request with (n, u) to A1 A2 A3, of course A1 A2 A3 would refuse the request, unfortunately at the same time P2 already down.
Such proposer down case, what would we do next? another new round of Paxos?
Do a new paxos round, this is exactly what it is for.
The proposers send their value in the Prepare message, so the acceptors will send P2's value to P1 in the next paxos round.
I reviewed all my notes and material from class when I studied this a few years ago. And for a Paxos correct implementation, it must be fault tolerant and never terminate, in both side, proposers and acceptors. Since the question is about fault tolerance for proposers, i'll focus on them:
A solution (but not the only one) to this issue, is to replicate the proposers, having several instances of each proposer type, one of them being the leader/master choosen at the begining, and is the one who sends the proposals. If the master fails, another one, whom might be decided in a new election, or use a priority stablished on the initialization, steps up as new master, and take its place.
In this case you could have 3 instances of P2: P2-1, P2-2, P2-3 with P2-1 being the leader by default, if P2-1 fails, then P2-2 can step up.
Have to keep in mind, that the acceptors can ask for acknowledgement of P2 and P2-2 is still in the middle of stepping up as new leader, so probably is a good idea to set a retry after a timeout, to give P2-2 enough time to be ready.
"Paxos Made Simple" Lamport
"It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on."
From the description Step 4:
(A1,A2,A3) would reply the accept request from P1, send id+1 to P1, then P1 is notified and increase id -> id+2. P1 send to (A1,A2,A3) prepare request again with id+2. In order to avoid live-lock between P1 and P2, better way should be only one proposer (reference to 2.4 chapter of "Paxos Made Simple").

Resources