Sequential consistency in newbie terms? - distributed

Sequential consistency
The result of any execution is the same as if the operations of all
the processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in the
order specified by its program.
I'm new to distributed system, what does execution mean in this context, and please explain this definition in a simple way?

A program running in a sequentially consistent and distributed environment will behave as if all the instructions are interleaved in a sequential manner. This means multiple execution paths are possible and allowed, provided that the instruction order of each thread of execution is preserved.
Example:
Lets say we have a program with two threads that is run in a distributed system with 2 processors:
Thread 1: print "Hello\n" ;
print "world \n"
Thread 2: print "Hi! \n"
Assumption: In this language "print" is thread safe and not buffered.
Sequential consistency rule: "Hello" will always be printed before "world" ; .
Possible execution 1:
Processor 1 | Processor 2
|
Hello |
World |
| Hi!
|
Possible execution 2
Processor 1 | Processor 2
|
Hello |
|Hi!
world |
Possible execution 3
Processor 1 | Processor 2
|
|Hi!
Hello |
world |
Not possible execution (printing "world" before "Hello" breaks sequential consistency)
Processor 1 | Processor 2
|
|Hi!
world |
Hello |
Now, revisiting your definition:
The result of any execution is the same as if the operations of all
the processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in the
order specified by its program.
And rewording it with the example above:
In a distributed environment that is sequential consistent, the result of any execution(see the possible 3 executions in the above example) = executing the instructions of the processor 1 & processor 2 in a some sequential order, while preserving the instruction order specified in the program("Hello" should be printed before "world").
i.e, the execution results are the same as if the instructions executed in the different processors were interleaved sequentially and executed in a single 1 core processor. Thus being sequential consistent provides predictability to the distributed system and establishes certain important guarantees when memory access is involved.
Hope this helps!

Related

Parallel efficiency drops inconsistently

My question is probably of trivial nature. I parallelised a CFD code using MPI libraries and now I am trying to investigate my parallel efficiency. To start with, I created a case which would provide equal loads among the ranks and constant ratio of volume of calculations over transferred data. Thus, my expectation would be that as I increase the ranks, any runtime changes would be attributed to the communication delays only. However, I realised that subroutines that do not invoke rank communication (so they only do domain calculations, hence they deal with the same load for all ranks) contribute significantly-actually the most- runtime increases. What am I missing here? Does this even make sense?
Does this even make sense?
Yes!
The more processes you create (every process has a rank), the more you reach the limit of your system's capability to execute processes in a truly parallel manner.
Your system (e.g. your computer) can run in parallel a certain amount of processes, when this limit is surpassed, then some processes wait to be executed (thus not all processes run in parallel), which harms performance.
For example, assuming that a computer has 4 cores and you create 4 processes, then every core can execute a process, thus your performance is harmed by the communicated between the processes, if any.
Now, in the same computer, you create 8 processes. What will happen?
4 of the processes will start execute in parallel, but the other 4 will wait for a core to get available, so that they can run too. This is not a truly parallel execution (some processes will execute in linear fashion). Moreover, depending on the OS scheduling policy, some processes may be interleaved, causing overhead at every switch.

fsync() atomicity across data blocks

When calling fsync() on a file, can the file become corrupted?
For example, say my file spreads across to disk blocks:
A B
|---------| |--------|
| Hello, | -> | World! |
|---------| |--------|
| 1234567 | | 89abcd |
|---------| |--------|
Say I want to change the entire file contents to lower case (in a very inefficient manner). So I seek to position 1 of the file to change "H" into "h" and then position 8 to change "W" to "w". I then call fsync() on the file. The file is spread across two disk blocks.
Is the ordering of the writes maintained?
Is the fsync() operation atomic *across the disk
The fsync call won't return until both writes are written to disk, along with any associated metadata. If your computer crashes (typically by losing power) and you have a corrupted file then log a bug report with the filesystem maintainers - that shouldn't happen. If fsync returns then the data is safely on disk.
To answer your questions though, there's no reason why the filesystem and disk driver can't reorder the writes (they see them as non-overlapping and it might be useful to write the second one first if that's where the disk head is on rotating media). And secondly there's no way for fsync to be atomic as it deals with real life hardware. It should act atomically though to the user (you will have the first copy of the file or the second but not something corrupted).

What causes an openmp program to run some threads in D state for large dataset size?

I am implementing an OpenMP multithreaded program on following machine.
x86_64, On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
It is a multithreaded clustering program. It shows expected speedup for dataset size upto 2 mil rows ~ 250 MB data but while testing on larger dataset, many of the threads in htop shows D state and CPU% substantially less than 99-100%. Note that for datasize upto this size, every thread runs in R state CPU% ~100%. The running time becomes ~100 times more than sequential case.
Free memory seems to be available and swp memory is 0 for all cases.
Regarding data structures used, there are 3 shared data structures size O(n) and then each thread is creating its private linked list that is stored for merging step further. I suspected its because of the extra memory utilised by this per thread data structure, but even if I comment it out program shows the same problem. Please let me know if I should provide more details.
I have only picked up OpenMP and parallel computing few months back so please let me know what can be the possible problems?

Is this enough to detect race conditions?

Say I have a multithreaded application and I run it with the same inputs. Is it enough to instrument every load and stores to detect write-write and write-read data races? I mean from the logged load and store addresses, if we can see which thread did which load and which thread did which store, we can detect write-read and write-write data race by noticing the overlapped addresses. Or am I missing something?
Or am I missing something?
You are missing a lot. As Pubby said, if you see read, then write in T1, and later read, then write in T2, you can't say anything about absence of races. You need to know about locks involved.
You may want to use a tool, such as Google's ThreadSanitizer instead.
Update:
But will my approach cover all races or at least some of the races?
Your comments here and on other answers appear to show that you don't understand what a race is.
Your approach may expose some of the races, yes. It is guaranteed to not cover most of them (which will make the exercise futile).
Here is a simple example from Wikipedia that I have slightly modified:
As a simple example let us assume that two threads T1 and T2 each want
to perform arithmetic on the value of a global integer by one. Ideally, the
following sequence of operations would take place:
Integer i = 0; (memory)
T1 reads the value of i from memory into register1: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T1 stores the value of register1 in memory: 1
T2 reads the value of i from memory into register2: 1
T2 multiplies the value of i in register2: (register2 contents) * 2 = 2
T2 stores the value of register2 in memory: 2
Integer i = 2; (memory)
In the case shown above, the final value of i is 2, as expected.
However, if the two threads run simultaneously without locking or
synchronization, the outcome of the operation could be wrong. The
alternative sequence of operations below demonstrates this scenario:
Integer i = 0; (memory)
T1 reads the value of i from memory into register1: 0
T2 reads the value of i from memory into register2: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T2 multiplies the value of i in register2: (register2 contents) * 2 = 0
T1 stores the value of register1 in memory: 1
T2 stores the value of register2 in memory: 0
Integer i = 0; (memory)
The final value of i is 0 instead of the expected result of 2. This
occurs because the increment operations of the second case are not
mutually exclusive. Mutually exclusive operations are those that
cannot be interrupted while accessing some resource such as a memory
location. In the first case, T1 was not interrupted while accessing
the variable i, so its operation was mutually exclusive.
All of these operations are atomic. The race condition occurs because this certain order does not have the same semantics as the first. How do you prove the semantics are not the same as the first? Well, you know they are different for this case, but you need to prove every possible order to determine you have no race conditions. This is a very hard thing to do and has an immense complexity (probably NP-hard or requiring AI-complete) and thus can't be checked reliably.
What happens if a certain order never halts? How do you even know it will never halt in the first place? You're basically left with solving the halting problem which is an impossible task.
If you're talking about using consecutive reads or writes to determine the race, then observe this:
Integer i = 0; (memory)
T2 reads the value of i from memory into register2: 0
T2 multiplies the value of i in register2: (register2 contents) * 2 = 0
T2 stores the value of register2 in memory: 0
T1 reads the value of i from memory into register1: 0
T1 increments the value of i in register1: (register1 contents) + 1 = 1
T1 stores the value of register1 in memory: 1
Integer i = 1; (memory)
This has the same read/store pattern as the first but gives different results.
The most obvious thing you'll learn is that there are several threads using the same memory. That's not necessarily bad in itself.
Good uses would include protection by semaphores, atomic access and mechanisms like RCU or double buffering.
Bad uses would include races conditions, true and false sharing:
Race conditions mostly stem from ordering issues - if a certain task A writes something at the end of its execution whereas task B needs that value at its start, you better make sure that the read of B only happens after A is completed. Semaphores, signals or similar are a good solution to this. Or run it in the same thread of course.
True sharing means that two or more cores are aggressively reading and writing the same memory address. This slows down the processor as it will constantly have to send any changes to the caches of the other cores (and the memory of course). YOur approach could catch this, but probably not highlight it.
False sharing is even more complex than true sharing: processor caches do not work on single bytes but on "cache lines" - which hold more than one value. If core A keeps hammering byte 0 of a line whereas core B keeps writing to byte 4, the cache updating will still stall the whole processor.

How to know number of existing Openmp threads

I have an OpenMP program running with say 6 threads on a 8-core machine. How can I extract this information (num_threads = 6) from another program (non-openmp, plain C program). Can I get this info from underlying kernel.
I was using run_queue lengths using "sar -q 1 0" but this doesn't yield consistent results. sometimes it gives 8, few times more or less.
In Linux, threads are processes (see first post here), so you can ask for a list of running processes with ps -eLf. However, if the machine has 8 cores, it is possible that OpenMP created 8 threads (even though it currently uses 6 of them for your computation); in this case, it is your code that must store somewhere (e.g. a file, or a FIFO) information about the threads that it is using.

Resources