Paxos questions: if proposer down, what happened? - distributed-transactions

I'm going through kinds of scenarios which the basic Paxos algorithm could get agreement of final result. There's one case I can't explain the result.
There's two proposed P1 P2, three acceptor A1 A2 A3. P1 would propose value u, P2 would propose value v.
1. P1(send id n) finish the prepare step, receive all promise from A1 A2 A3, then in A1 A2 A3 all store n as id.
2. P2(send id n+1) then A1 A2 A3 store n+1 as id
3. P2 down.
4. P1 send accept request with (n, u) to A1 A2 A3, of course A1 A2 A3 would refuse the request, unfortunately at the same time P2 already down.
Such proposer down case, what would we do next? another new round of Paxos?

Do a new paxos round, this is exactly what it is for.
The proposers send their value in the Prepare message, so the acceptors will send P2's value to P1 in the next paxos round.

I reviewed all my notes and material from class when I studied this a few years ago. And for a Paxos correct implementation, it must be fault tolerant and never terminate, in both side, proposers and acceptors. Since the question is about fault tolerance for proposers, i'll focus on them:
A solution (but not the only one) to this issue, is to replicate the proposers, having several instances of each proposer type, one of them being the leader/master choosen at the begining, and is the one who sends the proposals. If the master fails, another one, whom might be decided in a new election, or use a priority stablished on the initialization, steps up as new master, and take its place.
In this case you could have 3 instances of P2: P2-1, P2-2, P2-3 with P2-1 being the leader by default, if P2-1 fails, then P2-2 can step up.
Have to keep in mind, that the acceptors can ask for acknowledgement of P2 and P2-2 is still in the middle of stepping up as new leader, so probably is a good idea to set a retry after a timeout, to give P2-2 enough time to be ready.

"Paxos Made Simple" Lamport
"It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on."
From the description Step 4:
(A1,A2,A3) would reply the accept request from P1, send id+1 to P1, then P1 is notified and increase id -> id+2. P1 send to (A1,A2,A3) prepare request again with id+2. In order to avoid live-lock between P1 and P2, better way should be only one proposer (reference to 2.4 chapter of "Paxos Made Simple").

Related

RAFT: What happens when Leader change during operation

I want to understand following scenario:
3 node cluster ( n1, n2, n3 )
Assume n1 is the leader
Client send operation/request to n1
For some reason during the operation i.e. ( appendentry, commitEntry ... ) leader changes to n2
But n1 has successfully written to its log
Should this be considered as failure of operation/request ? or It should return "Leader Change"
This is an indeterminate result because we don't know at the time if the value will be committed or not. That is, the new leader may decide to keep or overwrite the value; we just do not have the information to know.
In some systems that I have maintained the lower-level system (the consensus layer) gives two success results for each request. It first returns Accepted when the leader puts it into its log and then Committed when the value is sufficiently replicated.
The Accepted result means that the value may be committed later, but it may not.
The layers that wrap the consensus layers can be a bit smarter about the return value to the client. They have a broader view of the system and may retry the requests and/or query the new leader about the value. That way the system-as-a-whole can have the customary single return value.

Starvation of one of 2 streams in ConnectedStreams

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Weight assignment to define an objective function

I have a set of jobs with execution times (C1,C2...Cn) and deadlines (D1,D2,...Dn). Each job will complete its execution in some time, i.e,
response time (R1,R2,....Rn). However, there is a possibility that not every job will complete its execution before its deadline. So I define a variable called Slack for each job, i.e., (S1,S2,...Sn). Slack is basically the difference between the deadline and the response time of jobs, i.e.,
S1=D1-R1
S2=D2-R2, .. and so on
I have a set of slacks [S1,S2,S3,...Sn]. These slacks can be positive or negative depending on the deadline and completion time of tasks, i.e., D and R.
The problem is I need to define weights (W) for each job (or slack) such that the job with negative slack (i.e., R>D, jobs that miss deadlines) has more weight (W) than the jobs with positive slack and based on these weights and slacks I need to define an objective function that can be used to maximize the slack.
The problem doesn't seem to be a difficult one. However, I couldn't find a solution. Some help in this regard is much appreciated.
Thanks
This can often be done easily with variable splitting:
splus(i) - smin(i) = d(i) - r(i)
splus(i) ≥ 0, smin(i) ≥ 0
If we have a term in the objective so that we are minimizing:
sum(i, w1 * splus(i) + w2 * smin(i) )
this will work ok: we don't need to add the complementarity condition splus(i)*smin(i)=0.

Apache flink complex analytics stream design & challenges

Problem statement:
Trying to evaluate Apache Flink for modelling advanced real time low latency distributed analytics
Use case abstract:
Provide complex analytics for instruments I1, I2, I3... etc each having product definition P1, P2, P3; configured with user parameters (Dynamic) U1, U2,U3 & requiring streaming Market Data M1, M2, M3...
Instrument Analytics function (A1,A2) are complex in terms of computation complexity, some of them could take 300-400ms but can be computed in parallel.
From above clearly Market data stream would be much faster (<1ms) than analytics function & need to consume latest consistent market data for calculations.
Next challenge is multiple Dependendant Enrichment functions E1,E2,E3 (e.g. Risk/PnL) which combine streaming Market data with instrument analytics result (E.g. Price or Yield)
Last challenge is consistency for calculations - as function A1 could be faster than A2 and need a consistent all instrument result from given market input.
Calculation Graph dependency examples (scale it to hundreds of instruments & 10-15 market data sources):
In case above image is not visible, graph dependency flow is like:
- M1 + M2 + P1 => A2
- M1 + P1 => A1
- A1 + A2 => E2
- A1 => E1
- E1 + E2 => Result numbers
Questions:
Correct design/model for these calculation data streams, currently I use ConnectedStreams for (P1 + M1), Another approach could be to use Iterative model feeding same instruments static data to itself again?
Facing issue to use just latest market data events in calculations as analytics function (A1) is lot slower than Market data (M1) streaming.
Hence need stale market data eviction for next iteration retaining those where no value is not available (LRU cache like)
Need to synchronize / correlate function execution of different time complexity so that iteration 2 starts only when everything in iteration 1 finished
This is quite a broad question and to answer it more precisely, one would need a few more details.
Below are a few thoughts that I hope will point you in a good direction and help you to approach your use case:
Connected streams by key (a.keyBy(...).connect(b.keyBy(...)) are the most powerful join- or union-like primitive. Using CoProcessFunction on a connected stream should give you the flexibility to correlate or join values as needed. You can for example store the events from one stream in the state while waiting for a matching event to arrive from the other stream.
Holding always the latest data of one input is easily doable by just putting that value into the state of a CoFlatMapFunction or a CoProcessFunction. For each event from input 1, you store the event in the state. Each event from stream 2, you look into the state to find the latest event from stream 1.
To synchronize on time, you could actually look into using event time. Event time can also be "logical time", meaning just a version number, iteration number, or anything. You only need to make sure that the timestamp you assign and the watermarks you generate reflect that consistently.
If you window by event time then, you will get all data of that version together, regardless of whether one operator is faster than others, or the events arrive via paths with different latency. That is the beauty of real event time processing :-)

Why paxos acceptor must send back any value they have already accepted

I'm learning MIT 6.824 class and I have a question about paxos. When proposer send a prepare to acceptor, acceptor will return a prepare_ok with n and v which highest accept see. I wonder why acceptor need to return n and v?
In a nutshell, the acceptor must return v because if the value was already committed then the new proposer needs to know what it is. There is no global "is_committed" flag, so the proposer gathers all these values and their associated rounds from at least a majority of the acceptors. The proposer then sends the value with the highest round to all the acceptors.
As you can see, a proposer always finishes what another proposer has started when it receives a value from an acceptor. This is a little bit similar to a lot of wait-free algorithms.

Resources