On receiving the prepare message from proposer, the Acceptor responds with either promise or refuse.
If proposer did not received enough promises and times-out, the proposer should discard the promises/refusals received and start a new prepare round with a greater ballot number.
If the restart is only because of timeout instead of receiving a refusal, can we reuse the same ballot number?
Short answer: don't do it.
When discovering Paxos Lamport made the assumption that messages could be dropped or duplicated. So at any point you could send the same message again, and the algorithm would deal with it. So technically you can send re-use the same ballot number as long as the payload is exactly the same. Here are some off-the-cuff reasons why you shouldn't.
First, the algorithm says to use a higher ballot number. You have to really know what you are doing if you are going to change a distributed algorithm like this. Reasoning about distributed systems can be very, very hard. And even if you did know what you are doing, you have know way of knowing if the maintainers who come after you know what they are doing.
Second, the base algorithm doesn't actually have anything about refusals/nacks; they are merely optimizations. (Remember they could be dropped at any time.) So not receiving a prepare should be considered a refusal.
Third, there could be another proposer out there. If you decide to re-use the same ballot number, you are essentially giving up and letting the other win. But if the other proposer is using the same algorithmic embellishment, it is also giving up. You are effectively choosing the leader before-hand.
Fourth, not receiving a quorum of responses means your system is in trouble: a network partition; half the majority of hosts not responding quickly enough. These are important things to think about. How will sending the exact same message help the problem?
In the end, re-using the ballot number doesn't buy you anything, but it does complicate things.
You previous proposal was not accepted most likely because it is a smaller ballot number compared to those proposed by the other replicas. So if you insist the old ballot number in your next proposal, you will be rejected again with a very high chance.
Related
Let's say there is a DynamoDB key with a value of 0, and there is a process that repeatably reads from this key using eventually consistent reads. While these reads are occurring, a second process sets the value of that key to 1.
Is it ever possible for the read process to ever read a 0 after it first reads a 1? Is it possible in DynamoDB's eventual consistency model for a client to successfully read a key's fully up-to-date value, but then read a stale value on a subsequent request?
Eventually, the write will be fully propagated and the read process will only read 1 values, but I'm unsure if it's possible for the reads to go 'backward in time' while the propagation is occuring.
The property you are looking for is known as monotonic reads, see for example the definition in https://jepsen.io/consistency/models/monotonic-reads.
Obviously, DynamoDB's strongly consistent read (ConsistentRead=true) is also monotonic, but you rightly asked about DynamoDB's eventually consistent read mode.
#Charles in his response gave a link, https://www.youtube.com/watch?v=yvBR71D0nAQ&t=706s, to a nice official official talk by Amazon on how eventually-consistent reads work. The talk explains that DynamoDB replicates written data to three copies, but a write completes when two out of three (including one designated as the "leader") of the copies were updated. It is possible that the third copy will take some time (usually a very short time to get updated).
The video goes on to explain that an eventually consistent read goes to one of the three replicas at random.
So in that short amount of time where the third replica has old data, a request might randomly go to one of the updated nodes and return new data, and then another request slightly later might go by random to the non-updated replica and return old data. This means that the "monotonic read" guarantee is not provided.
To summarize, I believe that DynamoDB does not provide the monotonic read guarantee if you use eventually consistent reads. You can use strongly-consistent reads to get it, of course.
Unfortunately I can't find an official document which claims this. It would also be nice to test this in practice, similar to how he paper http://www.aifb.kit.edu/images/1/17/How_soon_is_eventual.pdf tested whether Amazon S3 (not DynamoDB) guaranteed monotonic reads, and discovered that it did not by actually seeing monotonic-read violations.
One of the implementation details which may make it hard to see these monotonic-read violations in practice is how Amazon handles requests from the same process (which you said is your case). When the same process sends several requests in sequence, it may (but also may not...) may use the same HTTP connections to do so, and Amazon's internal load balancers may (but also may not) decide to send those requests to the same backend replica - despite the statement in the video that each request is sent to a random replica. If this happens, it may be hard to see monotonic read violations in practice - but it may still happen if the load balancer changes its mind, or the client library opens another connection, and so on, so you still can't trust the monotonic read property to hold.
Yes it is possible. Requests are stateless so a second read from the same client is just as likely as any other request to see slightly stale data. If that’s an issue, choose strong consistency.
You will (probably) not ever get the old data after getting the new data..
First off, there's no warning in the docs about repeated reads returning stale data, just that a read after a write may return stale data.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
But more importantly, every item in DDB is stored in three storage nodes. A write to DDB doesn't return a 200 - Success until that data is written to 2 of 3 storage nodes. Thus, it's only if your read is serviced by the third node, that you'd see stale data. Once that third node is updated, every node has the latest.
See Amazon DynamoDB Under the Hood
EDIT
#Nadav's answer points that it's at least theoretically possible; AWS certainly doesn't seem to guarantee monotonic reads. But I believe the reality depends on your application architecture.
Most languages, nodejs being an exception, will use persistent HTTP/HTTPS connections by default to the DDB request router. Especially given how expensive it is to open a TLS connection. I suspect though can't find any documents confirming it that there's at least some level of stickiness from the request router to a storage node. #Nadav discusses this possibility. But only AWS knows for sure and they haven't said.
Assuming that belief is correct
curl in a shell script loop - more likely to see the old data again
loop in C# using a single connection - less likely
The other thing to consider is that in the normal course of things, the third storage node in "only milliseconds behind".
Ironically, if the request router truly picks a storage node at random, a non-persistent connection is then less likely to see old data again given the extra time it takes to establish the connection.
If you absolutely need monotonic reads, then you'd need to use strongly consistent reads.
Another option might be to stick DynamoDB Accelerator (DAX) in front of your DDB. Especially if you're retrieving the key with GetItem(). As I read how it works it does seem to imply monotonic reads, especially if you've written-through DAX. Though it does not come right out an say so. Even if you've written around DAX, reading from it should still be monotonic, it's just there will be more latency until you start seeing the new data.
Reading the Brewer's conjecture, it says Partition Tolerance means nodes are not able to pass messages to other nodes in a cluster, and not that a few nodes are down.
This idea seems to be strengthened by the definition of Availability which refers to only 'non failing' nodes being able to respond to requests.
Therefore, am I correct in understanding that Partition Tolerance has nothing to do with nodes failing and becomoing unresponsive to requests? It only concerns itself with how the still functioning nodes are behaving (are they consistent and available) when they are not able to talk to each other?
Thanks.
IMHO a (CAP) partition refers to any sort of reason that may prevent a node from receiving a reply within a reasonable amount of time. In practice you can have many potential causes, but when you are reasoning at a theoretical level it doesn't really matter which. Bear in mind that the CAP theorem is theory.
When a node does not receive the expected message it cannot make any assumptions on why, maybe the other node died or maybe it's just latency or some other reason, it doesn't know nor it cares - what matters is that the node became isolated and that a request may be received during this "isolation". In such case what can it do? See the three possible scenarios illustrated in this answer.
I'm confused about using Paxos algorithm.
Seems that Paxos can be used to such scenario: multiple server (a cluster, I assume each server has all 3 roles, proposer, acceptor, leaner) need to keep the same command sequences to achieve consistence and backup. I assume there are some clients sending commands to this server (clients may send in parallel). Each time the command is dispatched to multiple server by one Paxos instance.
Different clients can send different commands to different proposers, Right?
If so, one command from some client will raise a Paxos instance. So,
Multiple Paxos instance may run at the same time?
If so, client-A sends "A += 1" command to proposer-A, and client-B sends "B += 2" command to proposer-B at nearly the same time, I suppose to see each server has received 2 commands, "A += 1" and "B += 2".
However,
Given 5 servers, say S1-S5, S1 send command "A += 1" and S5 send command "B += 1", S2 promise S1 however S3, S4 promise S5, so finally S3,S4,S5 got "B += 1" but S1,S2 got nothing because the number of promise is not majority. Seems like the Paxos does not help at all. We don't get the expected "A += 1" and "B += 2" at all 5 servers?
So I guess in practical application of Paxos, no parallel Paxos instances are allowed? If so, how to avoid parallel Paxos instances, seems that we still need a centralized server to flag whether there is a Paxos running or not if we allowed multiple clients and multiple proposers.
Also, I have questions about the proposer number. I search the internet and some claims the following
is a solution:
5 servers, given corresponding index k(0-4), each server uses number 5*i + k for this server's "i"th proposal.
For me, this seems not meet the requirements at all, because server-1's first proposal number is always 1 and server-4's first proposal number is always 4, but server-4 may raise the proposal earlier than server-1, however it's proposal number is bigger.
So I guess in practical application of Paxos, no parallel Paxos
instances are allowed? If so, how to avoid parallel Paxos instances,
seems that we still need a centralized server to flag whether there is
a Paxos running or not if we allowed multiple clients and multiple
proposers.
You don't need a centralised service only need nodes to redirect clients to the current leader. If they don't know who the leader is they should throw an error and the client should select another node from DNS/config until they find or are told the current leader. Clients only need a reasonably up to date list of which nodes are in the cluster so that they can contact a majority of current nodes then they will find the leader when it becomes stable.
If nodes get separated during a network partition you may get lucky and its a clean and stable partition which only leads to one majority and it will get a new stable leader. If you get an unstable partition or some dropped directional packets such that two or nodes start to be "duelling leaders" then you can use randomised timeouts and adaptive back-off to minimise two nodes attempting to get elected at the same time leading to failed rounds and wasted messages. Clients will get wasted redirects, errors or timeouts and will be scanning for the leader during a duel until it is resolved to a stable leader.
Effectively paxos goes for CP out of CAP so it can loose the A (availability) due to duelling leaders or no majority being able to communicate. Perhaps if this really was high risk in practice people would be writing about having nodes blacklist any node which repeatedly tries to lead but which never gets around to committing due to persistent unstable network partition issues. Pragmatically one can imagine that folks monitoring the system will get alerts and kill such a node and fix the network before trying to add complex "works unattended no matter what" features into their protocol.
With respect to the number for proposals a simple scheme is a 64bit long with a 32bit counter packed into the highest bits and the 32bit IP address of the node packed into the lowest bits. That makes all numbers unique and interleaved and only assumes you don't run multiple nodes on the same machine.
If you take a look at Multi-Paxos on Wikipedia it's about optimising the flow for a stable leader. Failovers should be really rare so once you get a leader it can gallop with accept messages only and skip proposals. When you are galloping if you are bit packing the leader identity into the event numbers a node can see that subsequent accepts are from the same leader and are sequential for that leader. The galloping with a stable leader is a good optimisation and creating a new leader with proposals is an expensive thing requiring proposal messages and the risk of duelling leaders so should be avoid unless it is cluster startup or failover.
however it's proposal number is bigger.
That's exactly the point of partitioning the proposal space. The rule is, that only the most recent proposal, the one with the highest number seen, shall be accepted. Thus, if three proposals were sent out, only the one proposal with the largest number will ever get an accepted majority.
If you do not do that, chances are that multiple parties continue to spit out proposals with simply incremented numbers (nextNumber = ++highestNumberSeen) like crazy and never come to a consensus.
i have always worked with concurrency control, but recently i have thought about how the non-determinism on the execution of transactions in a database can change the final result from the user's point of view.
Consider that two persons P1 and P2 both want to withdraw 50 euros from a bank account that has precisely 50 euros.
P1 requests the operation from an ATM at time 8:00
P2 requests the operation from an ATM at time 8:02
Both requests eventually arrive at the bank database system, but due to non-deterministic factors (transaction ordering, OS thread scheduling, etc) P2's request is executed first and the withdrawal is successful, and P1's request fails because it was executed after P2's request and hence there was not enough Money to withdraw.
We arrive at a situation where the person who first requested the operation ends without the Money. Are these concerns taken in account in real time systems? I hear some people saying that these things are not important and the world will go on, the only concern is to not violate the consistency constraints (no Money disappears, no Money magically appears)
Nonetheless, i think that this time-request fairness is also important.
Thanks for the attention
I decided to write extended answer to refer to these consideration in the future.
Nonetheless, I think that this time-request fairness is also important.
Is victory really unfair?
In fact, the system is fair because time-request fairness is guaranteed based on which request was written to database first.
The only problem is that it is not explicit to users.
Solution: set clear rules of the game
Eventually, this is less about fairness of ATM system in providing adequate service. Instead, this is about SLA it guarantees.
If withdrawal transaction completes within maximum timeout of 5 minutes, all a person should do to guarantee his victory is to be the first AND 5 minutes ahead of the other one to touch ATM.
Otherwise, in overlapping time intervals, the winner is selected randomly (as it appears).
Both rules are fair if they are agreed upon.
DETAILS
Article "Race Conditions Don't Exist" review similar example in CQRS context.
Guaranteeing absolute fairness is sometimes prohibitively expensive or completely impossible.
Ultimately, what matters is whether operational model is adequate to specific Domain.
Is it fair for the system to let users judge about itself based on their own assumptions?
We deal with multiple views P1 and P2.
Both views show only the money left at the time of balance request (fair). None of the views was given explicit guarantee that the entire sum was exclusively locked for this specific view (fair).
What may get users upset is their own communication (additional view) outside the system. For example, both person see each other and know exactly who was the first to submit withdraw request at its own ATM. System is not responsible for consistent views outside itself. Users are upset because they wrongly assume system selects winner based on them pressing ATM button first.
How far can we go technically guarantee?
System could lock entire balance the moment user touches ATM. There are lots of details why this is not practical. Most importantly, it may not satisfy those who assume it is unfair to lock balance without eventually withdrawing anything. And how to guarantee to acquire lock first by the user who touched the ATM first?
System could be designed to wait for candidate requests within timeout adjusted to maximum allowed delays in request propagation. Then it may have several candidates for victory and it may use timestamps given by source ATMs. However, even then there would be unfairness due precision of clock synchronization.
Effectively, first is never absolutely defined due to physics - see uncertainty principle ;)
I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.
Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.
Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)