Software mutex not working with out of order execution - c

In the article https://en.m.wikipedia.org/wiki/Mutual_exclusion#Software_solutions it is given that
These algorithms do not work if out-of-order execution is used on the platform that executes them. Programmers have to specify strict ordering on the memory operations within a thread.
But those algorithms are simple C programs and if we can't be sure of them working as expected in ooo systems how can we be sure that our other programs will work correctly?
What do these programs do that fails in ooo situation?
Basically when do ooo programs not work, so we can be careful in their usage?
Can we trust ooo processors to do what we code?

There are no problems with out-of-order execution within a program (more precisely within a thread). This can only lead to problems if there is concurrency (two or more threads run in parallel), for example with a software mutex.
Barriers with mfence (or lfence and sfence in special cases) can help on x86 plattforms. They instruct the processor that no out-of-order execution is to take place at this point. These are assembler instructions, so in C you have to write
asm volatile("mfence");
or use the corresponding instrinct.
Another problem could be that the compiler arranges the instructions differently than they are in the program or makes other optimizations (for example, does not write the values in the memory at all but keeps them in a register). To prevent this, the keyword volatile must be used at the variables you use for the software mutex.

Related

Do I need to use smp_mb() after binding the CPU

Suppose my system is a multicore system, if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder
the cpu instructions?
I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.
You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs than a plain load and an acquire barrier. (Notably AArch64 or 32-bit ARMv8 with ldar or ldapr)
But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a compiler memory barrier like asm("" ::: "memory") or C11 atomic_signal_fence(seq_cst), not a CPU run-time barrier like atomic_thread_fence(seq_cst) or the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).
See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.
And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.
Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)
See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.

Lock-free atomic operations on ARM architectures prior to ARMv6

Are there any implementations thereof in C? All those I've seen so far are based on the LDREX/STREX instructions, which were introduced only in the ARMv6 architecture. The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.
Are there any implementations thereof in C?
No. It is impossible to implement this is 'C' without support from the compiler or assembly (assembly support in compiler). 'C' has no instruction to guarantee that something executes atomically.
The only possible solution for previous architectures seems to be to disable/enable IRQs, which makes the operations blocking.
Many lock-free algorithms need 'CAS' (compare-and-set). swp and swpb can be used to do some primitive four value operations, but they are not CAS. In order to do four sources and one consumer, you can give each of four bytes to the sources, using swpb and have the consumer use swp to transfer the four 'work' bytes. Most ARM cpus prior to ARMv6 are single core and locking interrupts is the common way to do things. ARMv6 cores have support for LDREX/STREX. The swp instruction is not multi-cpu friendly as it locks the entire bus for the transaction (read/write). However, swp can be used for spin locks if it is the only thing available.
Linux has support for a 'compare and exchange' with OS help. The gist is that a small fixed assembler sequence does the compare and exchange. Interrupt and data abort code is hooked to make sure if this code is interrupted that it is restarted.

What is meant by interleaved "multi-threading" in c?

Can anybody explain what is an interleaved multi-threading means?
Real-time examples are also allowed.
This is described by Intel as hyper-threading. The CPU has a single core with 2 register sets.
These can be used to increase the utilisation of the core.
This is opaque to code, in that it behaves like 2 cores. However only one at a time can run.
If multi-threaded, your code still needs atomic, mutexes etc
The wiki says:
The purpose of interleaved multithreading is to remove all data
dependency stalls from the execution pipeline. Since one thread is
relatively independent from other threads, there's less chance of one
instruction in one pipe stage needing an output from an older
instruction in the pipeline. Conceptually, it is similar to preemptive
multitasking used in operating systems; an analogy would be that the
time slice given to each active thread is one CPU cycle.
This Wikipedia page explains it pretty well.
Basically it's about interleaving instructions from different OS-level threads on the CPU, to reduce the risk of costly dependencies between instructions.

Performance of System()

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed
For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application
PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values
PAPI_stop_counters();
I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running.
from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why
Total Cycles =========== 140733358872510
Instructions Completed =========== 4203968
Floating Point Instructions =========== 0
Floating Point Operations =========== 4196867
Loads =========== 140733358872804
Stores =========== 4204037
Branches Taken =========== 15774436
system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.
Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.
The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.
Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.
For more information see:
Related discussion here
Intel® Performance Counter Monitor here
Performance measurements with x86 RDTSC instruction here
As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.
The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.
What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.
I would have to know about the internals of "matmul" in order to give you a better description.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Resources