Race conditions on parallel Paxos instances running

Race conditions on parallel Paxos instances running - distributed

I'm confused about using Paxos algorithm.
Seems that Paxos can be used to such scenario: multiple server (a cluster, I assume each server has all 3 roles, proposer, acceptor, leaner) need to keep the same command sequences to achieve consistence and backup. I assume there are some clients sending commands to this server (clients may send in parallel). Each time the command is dispatched to multiple server by one Paxos instance.
Different clients can send different commands to different proposers, Right?
If so, one command from some client will raise a Paxos instance. So,
Multiple Paxos instance may run at the same time?
If so, client-A sends "A += 1" command to proposer-A, and client-B sends "B += 2" command to proposer-B at nearly the same time, I suppose to see each server has received 2 commands, "A += 1" and "B += 2".
However,
Given 5 servers, say S1-S5, S1 send command "A += 1" and S5 send command "B += 1", S2 promise S1 however S3, S4 promise S5, so finally S3,S4,S5 got "B += 1" but S1,S2 got nothing because the number of promise is not majority. Seems like the Paxos does not help at all. We don't get the expected "A += 1" and "B += 2" at all 5 servers?
So I guess in practical application of Paxos, no parallel Paxos instances are allowed? If so, how to avoid parallel Paxos instances, seems that we still need a centralized server to flag whether there is a Paxos running or not if we allowed multiple clients and multiple proposers.
Also, I have questions about the proposer number. I search the internet and some claims the following
is a solution:
5 servers, given corresponding index k(0-4), each server uses number 5*i + k for this server's "i"th proposal.
For me, this seems not meet the requirements at all, because server-1's first proposal number is always 1 and server-4's first proposal number is always 4, but server-4 may raise the proposal earlier than server-1, however it's proposal number is bigger.

So I guess in practical application of Paxos, no parallel Paxos
instances are allowed? If so, how to avoid parallel Paxos instances,
seems that we still need a centralized server to flag whether there is
a Paxos running or not if we allowed multiple clients and multiple
proposers.
You don't need a centralised service only need nodes to redirect clients to the current leader. If they don't know who the leader is they should throw an error and the client should select another node from DNS/config until they find or are told the current leader. Clients only need a reasonably up to date list of which nodes are in the cluster so that they can contact a majority of current nodes then they will find the leader when it becomes stable.
If nodes get separated during a network partition you may get lucky and its a clean and stable partition which only leads to one majority and it will get a new stable leader. If you get an unstable partition or some dropped directional packets such that two or nodes start to be "duelling leaders" then you can use randomised timeouts and adaptive back-off to minimise two nodes attempting to get elected at the same time leading to failed rounds and wasted messages. Clients will get wasted redirects, errors or timeouts and will be scanning for the leader during a duel until it is resolved to a stable leader.
Effectively paxos goes for CP out of CAP so it can loose the A (availability) due to duelling leaders or no majority being able to communicate. Perhaps if this really was high risk in practice people would be writing about having nodes blacklist any node which repeatedly tries to lead but which never gets around to committing due to persistent unstable network partition issues. Pragmatically one can imagine that folks monitoring the system will get alerts and kill such a node and fix the network before trying to add complex "works unattended no matter what" features into their protocol.
With respect to the number for proposals a simple scheme is a 64bit long with a 32bit counter packed into the highest bits and the 32bit IP address of the node packed into the lowest bits. That makes all numbers unique and interleaved and only assumes you don't run multiple nodes on the same machine.
If you take a look at Multi-Paxos on Wikipedia it's about optimising the flow for a stable leader. Failovers should be really rare so once you get a leader it can gallop with accept messages only and skip proposals. When you are galloping if you are bit packing the leader identity into the event numbers a node can see that subsequent accepts are from the same leader and are sequential for that leader. The galloping with a stable leader is a good optimisation and creating a new leader with proposals is an expensive thing requiring proposal messages and the risk of duelling leaders so should be avoid unless it is cluster startup or failover.

however it's proposal number is bigger.
That's exactly the point of partitioning the proposal space. The rule is, that only the most recent proposal, the one with the highest number seen, shall be accepted. Thus, if three proposals were sent out, only the one proposal with the largest number will ever get an accepted majority.
If you do not do that, chances are that multiple parties continue to spit out proposals with simply incremented numbers (nextNumber = ++highestNumberSeen) like crazy and never come to a consensus.

Related

How to deal with race condition in case when it's possible to have multiple servers (and each of them can have multiple threads)

Let's say we have an inventory system that tracks the available number of products in a shop (quantity). So we can have something similar to this:
Id
Name
Quantity
1
Laptop
10
We need to think about two things here:
Be sure that Quantity is never negative
If we have simultaneous requests for a product we must ensure valid Quantity.
In other words, we can have:
request1 for 5 laptops (this request will be processed on thread1)
request2 for 1 laptop (this request will be processed on thread2)
When both requests are processed, the database should contain
Id
Name
Quantity
1
Laptop
4
However, that might not be the case, depending on how we write our code.
If on our server we have something similar to this:
var product = _database.GetProduct();
if (product.Quantity - requestedQuantity >= 0)
{
product.Quantity -= requestedQuantity;
_database.Save();
}
With this code, it's possible that both requests (that are executed on separate threads) would hit the first line of the code at the exact same time.
thread1: _database.GetProduct(); // Quantity is 10
thread2: _database.GetProduct(); // Quantity is 10
thread1: _product.Quantity = 10 - 5 = 5
thread2: _product.Quantity = 10 - 1 = 9
thread1: _database.Save(); // Quantity is 5
thread2: _database.Save(); // Quantity is 9
What has just happened? We have sold 6 laptops, but we reduced just one from the inventory.
How to approach this problem?
To ensure only positive quantity we can use some DB constraints (to imitate unsigned int).
To deal with race condition we usually use lock, and similar techniques.
And depending on a case that might work, if we have one instance of a server...But, what should we do when we have multiple instances of the server and the server is running on multithreading environment?
It seems to me that the moment you have more than one web server, your only reasonable option for locking is the database. Why do I say reasonable? Because we have Mutex.
A lock allows only one thread to enter the part that's locked and the lock is not shared with any other processes.
A mutex is the same as a lock but it can be system-wide (shared by multiple processes).
Now...This is my personal opinion, but I expect that managing Mutex between a few processes in microservice-oriented world where a new instance of the server can spin up each second or where the existing instance of the server can die each second is tricky and messy (Do we have some Github example?).
How to solve the problem then?
Stored procedure* - offload the responsibility to the database. Write a new stored procedure and wrap the whole logic into a transaction. Each of the servers will call this SP and we don't need to worry about anything. But this might be slow?
SELECT ...FOR UPDATE - I saw this while I was investigating the problem. With this approach, we still try to solve the problem on 'database' level.
Taking into account all of the above, what should be the best approach to solve this problem? Is there any other solution I am missing? What would you suggest?
I am working in .NET and using EF Core with PostgreSQL, but I think that this is really a language-agnostic question and that principle for solving the issue is similar in all environments (and similar for many relational databases).

After reading the majority of the comments let's assume that you need a solution for a relational database.
The main thing that you need to guarantee is that the write operation at the end of your code only happens if the precondition is still valid (e.g. product.Quantity - requestedQuantity).
This precondition is evaluated at the application side in memory. But the application only sees a snapshot of the data at the moment, when database read happened: _database.GetProduct(); This might become obsolete as soon as someone else is updating the same data. If you want to avoid using SERIALIZABLE as a transaction isolation level (which has performance implications anyway), the application should detect at the moment of writing if the precondition is still valid. Or said differently, if the data is unchanged while it was working on it.
This can be done by using offline concurrency patterns: Either an optimistic offline lock or a pessimistic offline lock. Many ORM frameworks support these features by default.

ndb data contention getting worse and worse

I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.

Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.

Library design methodology

I want to make the "TRAP AGENT" library. The trap agent library keeps the tracks of the various parameter of the client system. If the parameter of the client system changes above threshold then trap agent library at client side notifies to the server about that parameter. For example, if CPU usage exceeds beyond threshold then it will notify the server that CPU usage is exceeded. I have to measure 50-100 parameters (like memory usage, network usage etc.) at client side.
Now I have the basic idea about the design, but I am stuck with the entire library design.
I have thought of below solutions:
I can create a thread for each parameter (i.e. each thread will monitor single parameter).
I can create a process for each parameter (i.e. each process will monitor single parameter).
I can classify the various parameters into the various groups, like data usage parameter will fall into network group, CPU memory usage parameter will fall into the system group, and then will create thread for each group.
Now 1st solution is looking good as compare to 2nd. If I am adopting 1st solution then it may fail when I want to upgrade my library for 100 to 1000 parameters. Because I have to create 1000 threads at that time, which is not good design (I think so; if I am wrong correct me.)
3rd solution is good, but response time will be high since many parameters will be monitored in single thread.
Is there any better approach?

In general, it's a bad idea to spawn threads 1-to-1 for any logical mapping in your code. You can quickly exhaust the available threads of the system.
In .NET this is very elegantly handled using thread pools:
Thread vs ThreadPool
Here is a C++ discussion, but the concept is the same:
Thread pooling in C++11
Processes are also high overhead on Windows. Both designs sound like they would ironically be quite taxing on the very resources you are trying to monitor.
Threads (and processes) give you parallelism where you need it. For example, letting the GUI be responsive while some background task is running. But if you are just monitoring in the background and reporting to a server, why require so much parallelism?
You could just run each check, one after the other, in a tight event loop in one single thread. If you are worried about not sampling the values as often, I'd say that's actually a benefit. It does no help to consume 50% CPU to monitor your CPU. If you are spot-checking values once every few seconds that is probably fine resolution.
In fact high resolution is of no help if you are reporting to a server. You don't want to denial-of-service-attack your server by doing a HTTP call to it multiple times a second once some value triggers.
NOTE: this doesn't mean you can't have a pluggable architecture. You could create some base class that represents checking a resource and then create subclasses for each specific type. Your event loop could iterate over an array or list of objects, calling each one successively and aggregating the results. At the end of the loop you report back to the server if any are out of range.
You may want to add logic to stop checking (or at least stop reporting back to the server) for some "cool down period" once a trap hits. You don't want to tax your server or spam your logs.

You can follow below methodology:
1.You can have two threads one thread is dedicated to measure emergency parameter and second thread monitors non emergency parameter.
hence response time for emergency parameter will be less.
2.You can define 3 threads.First thread will monitor the high priority(emergency parameter).Second thread will monitor the intermediate priority parameter. and last thread will monitor lowest priority parameter.
So overall response time will be improved as compared to first solution.
3.If response time is not concern then you can monitor all the parameters in single thread.But in this case response time becomes worst when you upgrade your library to monitor 100 to 1000 parameters.
So in 1st case there will be more response time for non emergency parameter.While in 3rd case there will be definitely very high response time.
So solution 2 is better.

Handling multiple calls to BeginExecuteNonQuery in SQL Server 2008

I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.

Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.

Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)

Architecting a Work Item Processing System with Modified FIFO Semantics in Windows

I’m building a system that generates “work items” that are queued up for back-end processing. I recently completed a system that had the same requirements and came up with an architecture that I don’t feel is optimal and was hoping for some advice for this new system.
Work items are queued up centrally and need to be processed in an essentially FIFO order. If this were the only requirement, then I would probably favor an MSMQ or SQL Server Service Broker solution. However, in reality, I need to select work items in a modified FIFO order. A work item has several attributes, and they need to be assigned in FIFO order where certain combinations of attribute values exist.
As an example, a work item may have the following attributes: Office, Priority, Group Number and Sequence Number (within group). When multiple items are queued for the same Group Number, they are guaranteed to be queued in Sequence Number order and will have the same priority.
There are several back-end processes (currently implemented as Windows Services) that pull work times in modified FIFO order given certain configuration parameters for the given service. The service running Washington, DC is configured to process only work items for DC, while the service in NY may be configured to process both NY and DC items (mainly to increase overall throughput). In addition to this type of selectivity, higher priority items should be processed first, and items that contain the same “Group Number” must be processed in Sequence Number order. So if the NY service is working on a DC item in group 100 with sequence 1, I don’t want the DC service to pull off DC item in group 100 sequence 2 because sequence 1 is not yet complete. Items in other groups should remain eligible for processing.
In the last system, I implemented the queues with SQL tables. I created stored procedures to submit items and, more importantly, to “assign” items to the Windows Services that were responsible for processing them. The assignment stored procedures contain the selection logic I described above. Each Windows Service would call the assignment stored procedure, passing it the parameters that were unique to that instance of the service (e.g. the eligible offices). This assignment stored procedure stamps the work item as assigned (in process) and when the work is complete, a final stored procedure is called to remove the item from the “queue” (table).
This solution does have some advantages in that I can quickly examine the state of these “queues” by a simple SQL select statement. I’m also able to manipulate the queues easily (e.g. I can bump priorities with a simple SQL update statement). However, on the downside, I occasionally have to deal with deadlocks on these queue tables and have the burden of writing these stored procedures (which gets tedious after a while).
Somehow I think that either MSMQ (with or without WCS) or Service Broker should be able to provide a more elegant solution. Rolling my own queuing/work-item-processing system just feels wrong. But as far as I know, these technologies don’t offer the flexibility that I need in the assignment process. I am hoping that I am wrong. Any advice would be welcome.

It seems to me that your concept of an atomic unit of work is a Group. So I would suggest that you only queue up a message that identified a Group Id, and then your worker will have to go to a table that maps Group Id to 1 or more Work Items.
You can handle your other problems by using more than one queue - NY-High, NY-Low, DC-High, DC-Low, etc.
In all honesty, though, I think you are better served to fix your deadlock issues in your current architecture. You should be reading the TOP 1 message from your queue table with Update Lock and Read Past hints, ordered by your priority logic and whatever filter criteria you want (Office/Location). Then you process your 1 message, change it's status or move it to another table. You should be able to call that stored procedure in parallel without a deadlock issue.

Queues are for FIFO order, not random access order. Even though you are saying that you want FIFO order, you want FIFO order with respect to a random set of variables, which is essentially random order. If you want to use queues, you need to be able to determine order before the message goes in the queue, not after it goes in.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight