How does Raft deals with delayed replies in AppendEntries RPC? - distributed

I came up with one question when reading the Raft paper. The scenario is followed. There are 3 newly started Raft instances, R1, R2, R3. R1 is elected as leader, with nextIndex {1, 1, 1}, matchIndex {0, 0, 0} and term 1. Now it receives command 1 from the client and the logs of the instances are as follow:
R1: [index 0, command 0], [index 1, command 1]
R2: [index 0, command 0]
R3: [index 0, command 0]
What if the network is not reliable? If R1 is sending this log to R2 but the AppendEntries RPC times out, the leader R1 has to resend the [index 1, command 1] again. Then it may receive replies{term: 1, success: true} twice.
The paper says:
If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log entries starting at nextIndex
• If successful: update nextIndex and matchIndex for follower (§5.3)
• If AppendEntries fails because of log inconsistency: decrement nextIndex and retry (§5.3)
So the leader R1 will increse nextIndex and matchIndex twice: nextIndex {1, 3, 1}, matchIndex {0, 2, 0}, which is not correct. When the leader sends the next AppendEntries RPC, i.e., a heartbeat or log replication, it can fix the nextIndex, but the matchIndex will never have a chance to be fixed.
My solution is to add a sequence number to both AppendEntries arguments and results for every single RPC calls. However, I was wondering if there is a way to solve this problem only with the arguments given by the paper, that is, without the sequence number.
Any advice will be appreciated and thank you in advance.

The protocol assumes that there’s some context with respect to which AppendEntries RPC a follower is responding to. So, at some level there does need to be a sequence number (or more accurately a correlation ID), whether that be at the protocol layer, messaging layer, or in the application itself. The leader has to have some way to correlate the request with the response to determine which index a follower is acknowledging.
But there’s actually an alternative to this that’s not often discussed. Some modifications of the Raft protocol have the followers send their last log index in responses. You could also use that last log index to determine which indexes have been persisted on the follower.

Related

Negative Number in SPID

I am constantly getting Azure SQL Server sessions with (blocked by) "Blk By" column of -5.
What could this session be please? I have searched for negative number SPID and I can see information for -2, -3, -4 but not for -5.
(I have removed identifiable information for hostname, login dbname columns)
Copied below is a listing from sp_who2
According to the documentation for blocking_session_id
-5 = Session ID of the blocking latch owner could not be determined because it is not tracked for this latch type (for example, for an SH latch).
Edit: I came across this from a Microsoft engineer. The bit I found interesting:
A​​ blocking session id of -5 alone does not indicate a performance problem. ​​ The addition of -5 is just an indication that the session is waiting on an asynchronous action to complete where-as​​ prior to the addition,​​ the same session wait would have showed​​ blocking session = 0​​ but​​ was still in a wait state.

Plone site update runs ages now

I update a Plone instance from 4.0.3 to 4.3.11 and now the site update runs for about 16 hours. Shure, the webserver timouted after an hour or so, but the process is still running. Strace says:
select(12, [4 11], [], [4 11], {25, 609847}) = 0 (Timeout)
futex(0x1d01d30, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1d01d30, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1d01d30, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x1d01d30, FUTEX_WAKE_PRIVATE, 1) = 1
select(12, [4 11], [], [4 11], {30, 0}
while this line
select(12, [4 11], [], [4 11], {30, 0}
repeats very often and sometimes this occours:
futex(0x1d01d30, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
iostat is telling me that the Disk (new SSD) is utilized 10% max, but mostly it is idling around. It is also the system-disk, so I don't exclusively see plones Disk IO.
The Database for the site contains about 60.0000 Objects, mostly of the same type. They are very small objects with no fancy extra.
The machine has 16GB memory and 8 Cores. While only one core is performing the actual plone-upgrade (why?)
Does it really take this long to Upgrade the ZEO DB with 60.000 objects? How can I know, that he is really doing something? (strace is not very telling here).
The machine has 16GB memory and 8 Cores. While only one core is performing the actual plone-upgrade (why?)
Because only one thread (so one CPU) is running the upgrade.
Does it really take this long to Upgrade the ZEO DB with 60.000 objects?
It is not normal. Maybe you have some custom code which is making strange things. Are you connecting to other services (solr, other databases, ...)? Are you generating document previews? How big is your Data.fs and how many blobs do you have?
How can I know, that he is really doing something?
The first step for debugging it, is to know what is happening. Try to install https://pypi.python.org/pypi/Products.LongRequestLogger (or similar addons).
This is going to point out the point where you are stuck in.
If you have the instance running in foreground you can also have a traceback by sending the USR1 signal. See:
What's the modern way to solve Plone deadlock issues?
for a more complete insight.
webserver timeouted after an hour or so
This also sounds strange. If the webserver is apache or nginx, the time out should be in the minute range.
If you call directly the instance port, you should not have any timeout at all.
I suggest you to do so.
Also the instance logs (usually under $BUILDOUT_DIRECTORY/var/log/) should suggest you about the status of your upgrades.

what could be the reason for high "SQL Server parse and compile time"?

a little background first:
i have a legacy application with security rules inside the app.
to reuse the database model with an addition app on this model with integreated security model inside! the database, i deside to use views with the security rules inside the view sql.the logic works well but the perf was not really good (high io cause by scan some/many tbls).so i use indexed views instead of standard views for the basic coles i need for the security and add additional a view on top of the index view with security rules. works perfect when i see the io perf.but now o have a poor parse and compile time.
when i erase all buffers a simple sql against the top view deliver this timings:
"SQL Server-Analyse- und Kompilierzeit:
, CPU-Zeit = 723 ms, verstrichene Zeit = 723 ms.
-----------
7
(1 Zeile(n) betroffen)
#A7F38F33-Tabelle. Scananzahl 1, logische Lesevorgänge 7, physische Lesevorgänge 0, Read-Ahead-Lesevorgänge 0, logische LOB-Lesevorgänge 0, physische LOB-Lesevorgänge 0, Read-Ahead-LOB-Lesevorgänge 0.
xADSDocu-Tabelle. Scananzahl 1, logische Lesevorgänge 2, physische Lesevorgänge 0, Read-Ahead-Lesevorgänge 0, logische LOB-Lesevorgänge 0, physische LOB-Lesevorgänge 0, Read-Ahead-LOB-Lesevorgänge 0.
SQL Server-Ausführungszeiten:
, CPU-Zeit = 0 ms, verstrichene Zeit = 0 ms.
when i execute the same stmt again the parsetime is of course zero.
in past i see sometimes later with the same statement a very long parse time again when i reexecute it >1sec (no dml is done during this time!).now i have deactivate all statistics automatisations and never see this long parse times again.
but what could be the reason for so a long initial parse and compile time? this time is very huge and cause a very bad perf on the app itselve with this solution.
is there a way to look deeper inside the parse time to find the root cause for it?
the reason for poor compile time is the number of indexed views.
“The query optimizer may use indexed views to speed up the query execution. The view does not have to be referenced in the query for the optimizer to consider that view for a substitution.”
https://msdn.microsoft.com/en-us/library/ms191432(v=sql.120).aspx
this means, that the optimizer may check when parse a sql all! index from views.
i have a sample on my db, where i can see this behaviour, that a simple sql on base tables use the index from an indexed view.
as far as good, but when you reach a limit from about 500 idx the system escalate and the optimizer need at least more then 10 times more cpu and memory to calc the plan. this behaviour is nearly the same from version 2008 to 2014.

Identical messages committed during a network partition

I'm working on a distributed database. I'm in a situation where, during the healing of a partition (nodes are beginning to recognize the ones that they were split from) two different clients try and commit a Compare-and-Set of 3 to 4, and both are successful. Logically, this should not be possible, but I'm curious if there is any functional problem with both returning successful. Both clients correctly believe what the final state is, and the command that they sent out was successful. I can't think of any serious problems. Are there any?
The "standard" definition of CAS (to the extent that there is such a thing?) guarantees that at most one writer will see a successful response for a particular transition. A couple examples that depend on this guarantee:
// generating a unique id
while (true) {
unique_id = read(key)
if (compare_and_set(key, unique_id, unique_id + 1)) {
return unique_id
}
}
If two clients both read 3 and successfully execute compare_and_set(key, 3, 4), they'll both think they've "claimed" 3 as their unique id and may end up colliding down the road.
// distributed leases/leader election
while (true) {
locked_until = read(key)
if (locked_until < now()) {
if (compare_and_set(key, locked_until, now() + TEN_MINUTES)) {
// I'm now the leader for ~10 minutes.
return;
}
}
sleep(TEN_MINUTES)
}
Similar problem here: if two clients see that the lock is available and both successfully CAS to acquire it, they'll both believe that they are the leader at the same time.

Appengine looping across large datasets

I need to loop over a large dataset within appengine. Ofcourse, as the datastore times out after a small amount of time, I decided to use tasks to solve this problem, here's an attempt to explain the method I'm trying to use:
Initialization of task via http post
0) Create query (entity.query()), and set a batch_size limit (i.e. 500)
1) Check if there are any cursors--if this is the first time running, there won't be any.
2a) If there are no cursors, use iter() with the following options: produce_cursors = true, limit= batch_size
2b) If there are curors, use iter() with same options as 2a + set start_cursor to the cursor.
3) Do a for loop to iterate through the results pulled by iter()
4) Get cursor_after()
5) Queue new task (basically re-run the task that was running) passing the cursor into the payload.
So if this code were to work the way I wanted, there'd only be 1 task running at any particular time in the queue. However, I started running the task this morning and 3 hours later when I looked at the queue, there were 4 tasks in it! This is weird because the new task should only be launched at the end of the task launching it.
Here's the actual code with no edits:
class send_missed_swipes(BaseHandler): #disabled
def post(self):
"""Loops across entire database (as filtered) """
#Settings
BATCH_SIZE = 500
cursor = self.request.get('cursor')
start = datetime.datetime(2014, 2, 13, 0, 0, 0, 0)
end = datetime.datetime(2014, 3, 5, 0, 0, 00, 0)
#Filters
swipes = responses.query()
swipes = swipes.filter(responses.date>start)
if cursor:
num_updated = int(self.request.get('num_updated'))
cursor = ndb.Cursor.from_websafe_string(cursor)
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE,start_cursor=cursor)
else:
num_updated = 0
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE)
count = 0
for swipe in swipes:
count += 1
if swipe.date>end:
pass
else:
uKey = str(swipe.uuId.urlsafe())
pKey = str(swipe.pId.urlsafe())
act = swipe.act
taskqueue.add(queue_name="analyzeData", url="/admin/analyzeData/send_swipes", params={'act':act,'uKey':uKey,'pKey':pKey})
num_updated += 1
logging.info('count = '+str(count))
logging.info('num updated = '+str(num_updated))
cursor = swipes.cursor_after().to_websafe_string()
taskqueue.add(queue_name="default", url="/admin/analyzeData/send_missed_swipes", params={'cursor':cursor,'num_updated':num_updated})
This is a bit of a complicated question, so please let me know if I need to explain it better. And thanks for the help!
p.s. Threadsafe is false in app.yaml
I believe a task can be executed multiple times, therefore it is important to make your process idempotent.
From doc https://developers.google.com/appengine/docs/python/taskqueue/overview-push
Note that this example is not idempotent. It is possible for the task
queue to execute a task more than once. In this case, the counter is
incremented each time the task is run, possibly skewing the results.
You can create task with name to handle this
https://developers.google.com/appengine/docs/python/taskqueue/#Python_Task_names
I'm curious why threadsafe=False in your yaml?
A bit off topic (since I'm not addressing your problems), but this sounds like a job for map reduce.
On topic: you can create custom queue with max_concurrent_requests=1. You could still have multiple tasks in the queue, but only one would be executing at a time.

Resources