how to copy multiple data elements between CPUs using cacheline atomicity? - c

I'm trying to implement an atomic copy for multiple data elements between CPUs. I packed multiple elements of data into a single cacheline to manipulate them atomically. So I wrote the following code.
In this code, (compiled with -O3) I aligned a global struct data into a single cacheline, and I set the elements in a CPU followed by a store barrier. It is to make globally visible from the other CPU.
At the same time, in the other CPU, I used an load barrier to access the cacheline atomically. My expectation was that the reader (or consumer) CPU should bring a cache line of data into the its own cache hierarchy L1, L2 etc.. So, since I do not use load barrier again until the next read, the elements of the data would be the same, but it does not work as expected. I can't keep the cacheline atomicity in this code. The writer CPU seems putting elements into the cacheline piece by piece. How could it be possible?
#include <emmintrin.h>
#include <pthread.h>
#include "common.h"
#define CACHE_LINE_SIZE 64
struct levels {
uint32_t x1;
uint32_t x2;
uint32_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));
struct levels g_shared;
void *worker_loop(void *param)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(15, &cpuset);
pthread_t thread = pthread_self();
int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
fatal_relog_if(status != 0, status);
struct levels shared;
while (1) {
_mm_lfence();
shared = g_shared;
if (shared.x1 != shared.x7) {
printf("%u %u %u %u %u %u %u\n",
shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
exit(EXIT_FAILURE);
}
}
return NULL;
}
int main(int argc, char *argv[])
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(16, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
fatal_relog_if(status != 0, status);
pthread_t worker;
int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
fatal_elog_if(istatus != 0);
uint32_t val = 0;
while (1) {
g_shared.x1 = val;
g_shared.x2 = val;
g_shared.x3 = val;
g_shared.x4 = val;
g_shared.x5 = val;
g_shared.x6 = val;
g_shared.x7 = val;
_mm_sfence();
// _mm_clflush(&g_shared);
val++;
}
return EXIT_SUCCESS;
}
The output is like below
3782063 3782063 3782062 3782062 3782062 3782062 3782062
UPDATE 1
I updated the code as below using AVX512, but the problem is still here.
#include <emmintrin.h>
#include <pthread.h>
#include "common.h"
#include <immintrin.h>
#define CACHE_LINE_SIZE 64
/**
* Copy 64 bytes from one location to another,
* locations should not overlap.
*/
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
__m512i zmm0;
zmm0 = _mm512_load_si512((const void *)src);
_mm512_store_si512((void *)dst, zmm0);
}
struct levels {
uint32_t x1;
uint32_t x2;
uint32_t x3;
uint32_t x4;
uint32_t x5;
uint32_t x6;
uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));
struct levels g_shared;
void *worker_loop(void *param)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(15, &cpuset);
pthread_t thread = pthread_self();
int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
fatal_relog_if(status != 0, status);
struct levels shared;
while (1) {
mov64((uint8_t *)&shared, (uint8_t *)&g_shared);
// shared = g_shared;
if (shared.x1 != shared.x7) {
printf("%u %u %u %u %u %u %u\n",
shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
exit(EXIT_FAILURE);
} else {
printf("%u %u\n", shared.x1, shared.x7);
}
}
return NULL;
}
int main(int argc, char *argv[])
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(16, &cpuset);
pthread_t thread = pthread_self();
memset(&g_shared, 0, sizeof(g_shared));
int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
fatal_relog_if(status != 0, status);
pthread_t worker;
int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
fatal_elog_if(istatus != 0);
uint32_t val = 0;
while (1) {
g_shared.x1 = val;
g_shared.x2 = val;
g_shared.x3 = val;
g_shared.x4 = val;
g_shared.x5 = val;
g_shared.x6 = val;
g_shared.x7 = val;
_mm_sfence();
// _mm_clflush(&g_shared);
val++;
}
return EXIT_SUCCESS;
}

I used an load barrier to access the cacheline atomically
No, barriers do not create atomicity. They only order your own operations, not stop operations from other threads from appearing between two of our own.
Non-atomicity happens when another thread's store becomes visible between two of our loads. lfence does nothing to stop that.
lfence here is pointless; it just makes the CPU running this thread stall until it drains its ROB/RS before executing the loads. (lfence serializes execution, but has no effect on memory ordering unless you're using NT loads from WC memory e.g. video RAM).
Your options are:
Recognize that this is an X-Y problem and do something that doesn't require 64-byte atomic loads/stores. e.g. atomically update a pointer to non-atomic data. The general case of that is RCU, or perhaps a lock-free queue using a circular buffer.
Or
Use a software lock to get logical atomicity (like _Atomic struct levels g_shared; with C11) for threads that agree to cooperate by respecting the lock.
A SeqLock might be a good choice for this data if it's read more often than it changes, or especially with a single writer and multiple readers. Readers retry when tearing may have been possible; check a sequence number before/after the read, using sufficient memory-ordering. See Implementing 64 bit atomic counter with 32 bit atomics for a C++11 implementation; C11 is easier because C allows assignment from a volatile struct to a non-volatile temporary.
Or hardware-supported 64-byte atomicity:
Intel transactional memory (TSX) available on some CPUs. This would even let you
do an atomic RMW on it, or atomically read from one location and write to another. But more complex transactions are more likely to abort. Putting 4x 16-byte or 2x 32-byte loads into a transaction should hopefully not abort very often even under contention. Safe for grouping stores into a separate transaction. (Hopefully the compiler is smart enough to end the transaction with the loaded data still in registers, so it doesn't have to be atomically stored to a local on the stack, too.)
There are GNU C/C++ extensions for transactional memory. https://gcc.gnu.org/wiki/TransactionalMemory
AVX512 (allowing a full-cache-line load or store) on a CPU which happens to implement it in a way that makes aligned 64-byte loads/stores atomic. There's no on-paper guarantee that anything wider than an 8-byte load/store is ever atomic on x86, except for lock cmpxchg16b and movdir64b.
In practice we're fairly sure that modern Intel CPUs like Skylake transfer whole cache-lines atomically between cores, unlike AMD. And we know that on Intel (not AMD) a vector load or store that doesn't cross a cache-line boundary does make a single access to L1d cache, transferring all the bits in the same clock cycle. So an aligned vmovaps zmm, [mem] on Skylake-avx512 should in practice be atomic, unless you have an exotic chipset that glues many sockets together in a way that creates tearing. (Multi-socket K10 vs. single-socket K10 is a good cautionary tale: Why is integer assignment on a naturally aligned variable atomic on x86?)
MOVDIR64B - only atomic for the store part, and only supported on Intel Tremont (next-gen Goldmont successor). This still doesn't give you a way to do a 64-byte atomic load. Also it's a cache-bypassing store so not good for inter-core communication latency. I think the use-case is generating a full-size PCIe transaction.
See also SSE instructions: which CPUs can do atomic 16B memory operations? re: lack of atomicity guarantees for SIMD load/store. CPU vendors have for some reason not chosen to provide any written guarantees or ways to detect when SIMD loads/stores will be atomic, even though testing has shown that they are on many systems (when you don't cross a cache-line boundary.)

Related

What is responsible for a RAM bottleneck when performing random memory accesses?

I have a program which accesses single bytes in a large array at random. Since this array exceeds the L2 cache, it requires many queries to RAM. I created a benchmark which emulates this program by generating random numbers and querying a large array. It seems like the benchmark is under-performing relative to my RAM's advertised speed.
I have DDR4-2933 RAM which is supposed to handle 2933 MT/s. The maximum transfer rate I have been able to achieve is 56.8 MT/s. What would be the bottleneck preventing faster execution? If I had to speculate, I might say the CPU could only reorder instructions within some fixed window and this would limit the parallelization of the fetches. Although, I have no evidence beyond the benchmarks.
Benchmark & Methodology
I created a short program which populates a large array with random numbers. Then, it loads values at random offsets from the array and XORs them together. Memory accesses should be reorderable.
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static uint64_t s[4];
uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 3) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
clock_t start = clock();
double cpu_time_used;
uint8_t val = 0;
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
cpu_time_used = ((double) (clock() - start)) / CLOCKS_PER_SEC;
printf("time: %.3f\n", cpu_time_used);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", cpu_time_used / calls);
printf("MT / sec: %.3e\n", calls / cpu_time_used / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
All benchmarks were executed on a Intel(R) Core(TM) i9-10885H CPU running at 2.40 GHz with CPU scaling disabled. The program was compiled with gcc -O3 main.c -o main -g -O3 -flto and called with nice -n -2 ./main 31 1000000000 for an array of size 2^31 and 1e9 accesses. The number of transactions per second issued to RAM can be approximated by the number of bytes fetched since virtually all of the lookups result in cache missed. I tested this using perf and it seemed like around 95% of cache lookups resulted in misses.
The random number generator occupies about 7.1% of program overhead. This was measured by removing the line val ^= buf[idx]; which executes the fetch. perf reported around 99% of memory overhead was due to last-level cache misses.
Follow-up Benchmarks
Loop unrolling:
By default, GCC 12.2.0 did not unroll the loop. It took some cajoling. I replaced:
for(size_t i = 0; i < calls; i++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
with:
for(size_t i = 0; i < batches; i++) {
#pragma GCC unroll 3
for(size_t j = 0; j < 8; j++) {
uint64_t idx = next() & smask;
val ^= buf[idx];
}
}
I look five timings at 2.0 GHz. The version without loop unrolling seemed to perform better but not significantly. I'm not very surprised since the branch would seem highly predictable.
no unrolling unrolling
0 23.249 24.287
1 24.044 24.520
2 23.455 24.358
3 23.233 24.167
4 22.969 23.833
Multi-processing
I implemented a threaded version of the benchmark. It evenly divides the accesses among the threads. I also switch to measuring wall clock time. Here are the measurements (taken at 2.4 GHz):
threads time MT/s
0 1 17.345 57.653502
1 2 8.949 111.744329
2 4 5.022 199.123855
3 8 3.533 283.045570
4 16 3.203 312.207306
5 32 3.200 312.500000
6 64 3.173 315.159155
7 124 3.165 315.955766
Modified program:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <sys/mman.h>
#include <threads.h>
#include <pthread.h>
/* XOROSHIRO PRNG */
static inline uint64_t rotl(const uint64_t x, int k) {
return (x << k) | (x >> (64 - k));
}
static thread_local uint64_t s[4];
static uint64_t next(void) {
const uint64_t result = rotl(s[1] * 5, 7) * 9;
const uint64_t t = s[1] << 17;
s[2] ^= s[0];
s[3] ^= s[1];
s[1] ^= s[2];
s[0] ^= s[3];
s[2] ^= t;
s[3] = rotl(s[3], 45);
return result;
}
struct xor_args {
size_t iters;
uint8_t * buf;
uint8_t res;
uint64_t smask;
pthread_t id;
};
#include <sys/random.h>
void * xor_worker(struct xor_args * a) {
if(-1 == getrandom(&s, sizeof(s), GRND_RANDOM)) {
fprintf(stderr, "getrandom() failed\n");
exit(1);
}
uint8_t val = 0;
for(size_t i = 0; i < a->iters; i++) {
uint64_t idx = next() & a->smask;
val ^= a->buf[idx];
}
a->res = val;
return NULL;
}
/* BENCHMARK */
int main(int argc, char ** argv) {
char const * prog = argc > 0 ? argv[0] : "[program]";
if(argc != 4) {
fprintf(stderr, "Usage: %s [size (power of 2)] [n_runs] [threads]", prog);
exit(1);
}
unsigned int sshift = strtol(argv[1], NULL, 10);
uint64_t calls = strtol(argv[2], NULL, 10);
int threads = strtol(argv[3], NULL, 10);
size_t size = 1ull << sshift;
uint64_t smask = (1ull << sshift) - 1;
uint8_t * buf = malloc(size);
if(!buf) {
fprintf(stderr, "no mem");
exit(1);
}
// Seed PRNG with magic values
s[0] = 0x0BE38E2AC;
s[1] = 0x23D933C53;
s[2] = 0xE72482E32;
s[3] = 0x35C339D23;
for(size_t i = 0; i < size; i++) {
buf[i] = next();
}
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
struct xor_args * args = malloc(sizeof(struct xor_args) * threads);
int s;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
arg->iters = calls / (uint64_t) threads;
if(i == 0) {
args->iters += calls % threads;
}
arg->smask = smask;
arg->buf = buf;
s = pthread_create(&arg->id, NULL, (void * (*)(void *)) xor_worker, arg);
if(s != 0) {
fprintf(stderr, "pthread_create() failed");
exit(1);
}
}
uint8_t val = 0;
for(int i = 0; i < threads; i++) {
struct xor_args * arg = &args[i];
s = pthread_join(arg->id, NULL);
if(s != 0) {
fprintf(stderr, "pthread_join() failed");
exit(1);
}
val ^= arg->res;
}
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("time: %.3f\n", elapsed);
printf("calls: %ld\n", calls);
printf("time/call: %.3e\n", elapsed / calls);
printf("M T / sec: %.3e\n", calls / elapsed / 1e6);
printf("val: %d\n", val); // ensure val is not optimized out
}
The program is bound by the memory hierarchy. Indeed, on modern processor, accessing 1 byte from the RAM causes the whole cache line to be fetched from the RAM. This means 64 bytes on your processor. One channel of DDR4-2933 RAM can reach a theoretical bandwidth of 8*2933e6/1024**3 = 21.85 GiB/s. However, regarding the fact that the cache is useless here and that 64 bytes are fetched, the maximum throughput is 21.85/64*1024 = 350 MiB/s. Based on this, the program should at least take 1000000000/(350*1024**2) = 2.72 s.
In practice, no modern processor can fully saturate a DDR4 memory (especially using only 1 core). Modern RAM are designed to be fast mainly for contiguous access, not random one, despite the name. This is because speeding up random accesses is very hard and reading contiguous data is frequent (and far simpler to optimize at the hardware level).
The main problem with DDR RAM is its latency which is huge since several decades (50-120 ns per cache line fetch). It has not changed much since the last decade while processors have become significantly faster. As a result, it is critical to be able to mitigate this latency by sending a lot of simultaneous requests, that is increasing the concurrency. However, the execution flow is sequential and the number of simultaneous in-flight requests that 1 core can achieve is limited. Modern processors can execution few instructions in parallel from a sequential program as long as the instructions to be executed are independent. The thing is the program is pretty sequential due to the random number generator though multiple instructions can still be executed in parallel.
Multiple read requests can be done concurrently, but the size of the buffers to receive data from the RAM and the caches is limited. Intel processor units dedicated for that is the line-fill buffer (LFB) between the L1 and L2 cache, the super-queue between the L2 and the L3, and the integrated memory controller (iMC) between the L3 and the RAM. You can use perf to check which one is the bottleneck. AFAIK, the LFB has about 12-14 slots on your target processor so it should not be a bottleneck here. The super-queue can read 32-bytes/cycle, that is, 1 cache-line every 2 cycle, or 1 byte of buf every 2 cycles. That being said the L3 latency is pretty big: at least about a dozen of nanoseconds. In general, prefetching units are used to mitigate the latency but random accesses are unpredictable so hardware prefetchers are completely useless in this case.
Virtual memory make random accesses slower. Indeed, when a cache-line is fetched, the processor needs to translate a virtual address to a physical one. This is done by the Translation Lookaside Buffer (TLB) unit of a processor core. It is basically a cache able to translate the address of a virtual memory page to a physical one. There are multiple TLB units for different caches and the size of the cache is dependent of the size of the pages. Pages are typically 4K ones on your processor unless your OS decides to use huge pages in this specific case. Assuming its does not, the biggest TLB (STLB) has 1536 entries for 4K pages. This means it cannot be efficient for random accesses done on a 1536*4K = 6 MiB buffer. Beyond this limit, the number of TLB miss will be frequent. When a TLB miss happen, the processor needs to call special kernel functions so to know how to translate the virtual addresses based on kernel data structures. This operation is very expensive (like any kernel calls in general) so it introduces a significant additional latency. If huge pages are used, then the threshold is 3 GiB (for pages of 2 MiB) or even 16 GiB (for 1 GiB pages). Using huge pages can thus significantly speed things up compared to basic 4K pages.
Assuming the processor has large enough buffers and TLB caches so not to limit the concurrency, the DDR4-SDRAM are typically the main bottleneck. Indeed, it does not only has a huge latency, but it is also optimized for contiguous accesses. Indeed, such RAM devices are split in banks so to reduce the device latency. Contiguous memory requests can be efficiently managed by multiple banks, but random requests are often significantly less efficient due to bank-conflict: when 2 requests are assigned to the same banks, they are processed serially rather than concurrently. While random requests tends to to be spread amongst banks, this load-balancing is not as good as contiguous request resulting in a bit lower throughout.
In the end, the speed of the program is certainly latency-bound and the time taken is calls * latency / concurrency. It looks like latency / concurrency is 10-15 ns in your case.
One solution is to speed up a bit this program is to use prefetching instructions. It often help to hide the latency by telling the processor to prefetch data in the caches (or temporary buffers) early so subsequent accesses are faster since data are supposed to be closer from computing units. This optimization is brittle : data should be prefetched early enough (otherwise the processor will stall waiting due to cache misses) and data should be kept in the target caches (not be evicted by newer prefetching instructions). In your case, this is not so easy since the index is random. Prefetching relatively large chunks so to then read the values faster should help a bit. It should be slower for small arrays and faster for big ones. Note that software prefetching is not a silver-bullet : it is generally not as good as hardware prefetching (due to the limited hardware concurrency).
An alternative solution is generally to use multiple cores (thanks to multiple threads). That being said, this is not easy too here due to the (inherently sequential) random number generator (RNG) unless it is Ok to use multiple RNG. In this case, fetching full cache lines is generally the main bottleneck due to the high concurrency, the small RAM bandwidth and mainly due to the poor efficiency (1 byte used over 64). Unfortunately, there is currently no way to efficiently extract single bytes from a DDR4-SDRAM using any mainstream x86-64 processor (including yours).
For more information, please read:
What Every Programmer Should Know About Memory
How much of ‘What Every Programmer Should Know About Memory’ is still valid?
How does the CPU cache affect the performance of a C program

Having a race condition in my MPSC ring buffer

I was trying to build a MPSC lock-free ring buffer for learning purpose, and am running into race conditions.
A description of the MPSC ring buffer:
It is guaranteed that poll() is never called when the buffer is empty.
Instead of mod'ing head and tail like a traditional ring buffer, it lets them proceed linearly, and AND's them before using them (since the buffer size is a power of 2, this works ok with overflow).
We keep MAX_PRODUCERS-1 slots open in the queue so that if multiple producers come and see one slot is available and proceed, they can all place their entries.
It uses 32-bit quantities for head and tail, so that it can snapshot them with a 64-bit atomic read without a lock.
My test involves a couple of threads writing some known set of values to the queue, and a consumer thread polling (when the buffer is not empty) and summing all, and verifying the correct result is obtained. With 2 or more producers, I get inconsistent sums (and with 1 producer, it works).
Any help would be much appreciated. Thank you!
Here is the code:
struct ring_buf_entry {
uint32_t seqn;
};
struct __attribute__((packed, aligned(8))) ring_buf {
union {
struct {
volatile uint32_t tail;
volatile uint32_t head;
};
volatile uint64_t snapshot;
};
volatile struct ring_buf_entry buf[RING_BUF_SIZE];
};
#define RING_SUB(x,y) ((x)>=(y)?((x)-(y)):((x)+(1ULL<<32)-(y)))
static void ring_buf_push(struct ring_buf* rb, uint32_t seqn)
{
size_t pos;
while (1) {
// rely on aligned, packed, and no member-reordering properties
uint64_t snapshot = __atomic_load_n(&(rb->snapshot), __ATOMIC_SEQ_CST);
// little endian.
uint64_t snap_head = snapshot >> 32;
uint64_t snap_tail = snapshot & 0xffffffffULL;
if (RING_SUB(snap_tail, snap_head) < RING_BUF_SIZE - MAX_PRODUCERS + 1) {
uint32_t exp = snap_tail;
if (__atomic_compare_exchange_n(&(rb->tail), &exp, snap_tail+1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
pos = snap_tail;
break;
}
}
asm volatile("pause\n": : :"memory");
}
pos &= RING_BUF_SIZE-1;
rb->buf[pos].seqn = seqn;
asm volatile("sfence\n": : :"memory");
}
static struct ring_buf_entry ring_buf_poll(struct ring_buf* rb)
{
struct ring_buf_entry ret = rb->buf[__atomic_load_n(&(rb->head), __ATOMIC_SEQ_CST) & (RING_BUF_SIZE-1)];
__atomic_add_fetch(&(rb->head), 1, __ATOMIC_SEQ_CST);
return ret;
}

understand membarrier function in linux

Example of using membarrier function from linux manual: https://man7.org/linux/man-pages/man2/membarrier.2.html
#include <stdlib.h>
static volatile int a, b;
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("mfence" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
asm volatile ("mfence" : : : "memory");
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
The code above transformed to use membarrier() becomes:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/membarrier.h>
static volatile int a, b;
static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}
static int
init_membarrier(void)
{
int ret;
/* Check that membarrier() is supported. */
ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
if (ret < 0) {
perror("membarrier");
return -1;
}
if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
fprintf(stderr,
"membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
return -1;
}
return 0;
}
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
if (init_membarrier())
exit(EXIT_FAILURE);
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
This "membarrier" description is taken from the Linux manual. I am still confused about how does trhe "membarrier" function add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.
Could you please help me to describe it in more detail.
Thanks!
This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + store buffer with store forwarding memory model).
With only one local MFENCE you could still get reordering:
FAST using just mfence, not membarrier
a = 1 exec
read_b = b; // 0
b = 1;
mfence (force b=1 to be visible before reading a)
read_a = a; // 0
a = 1 visible (global vis. delayed by store buffer)
But consider what would happen if an mfence-on-every-core had to be part of every possible order, between the slow-path's store and its reload.
This ordering would no longer be possible. If read_b=b has already read a 0, then a=1 is already pending1 (if it isn't visible already). It's impossible for it to stay private until after read_a = a because membarrier() makes sure a full barrier runs on every core, and SLOW waits for that to happen (membarrier to return) before reading a.
And there's no way to get 0,0 from having SLOW execute first; it runs membarrier itself so its store is definitely visible to other threads before it reads a.
footnote 1: Waiting to execute, or waiting in the store buffer to commit to L1d cache. The asm("":::"memory") ensures that, but is actually redundant because volatile itself guarantees that the accesses happen in asm in program order. And we basically need volatile for other reasons when hand-rolling atomics instead of using C11 _Atomic. (But generally don't do that unless you're actually writing kernel code. Use atomic_store_explicit(&a, 1, memory_order_release);).
Note it's actually the store buffer that creates StoreLoad reordering (the only kind x86 allows), not so much OoO exec. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (if it turns out they weren't the result of mis-speculation or something!).
Also note that in-order CPUs can do their memory accesses out of order. They start instructions (including loads) in order, but can let them complete out of order, e.g. by scoreboarding loads to allow hit-under-miss. See also How is load->store reordering possible with in-order commit?

Audio samples producer multiple threads OSX

This question is a follow-up to a former question (Audio producer threads with OSX AudioComponent consumer thread and callback in C), including a test example, which works and behaves as expected but does not quite answer the question. I have substantially rephrased the question, and re-coded the example, so that it only contains plain-C code. (I've found out that few Objective-C portions of code in the former example only caused confusion and distracted the reader from what's essential in the question.)
In order to take advantage of multiple processor cores as well as to make the CoreAudio pull-model render thread as lightweight as possible, the LPCM samples' producer routine clearly has to "sit" on a different thread, outside the real-lime-priority render thread/callback. It must feed the samples to a circular buffer (TPCircularBuffer in this example), from which the system would schedule data pull-out in quants of inNumberFrames.
The Grand Central Dispatch API offers a simple solution, which I've deduced upon some individual research (including trial-and-error coding). This solution is elegant, since it doesn't block anything nor conflict between push and pull models. Yet the GCD, which is supposed to take care of "sub-threading" does not by far meet the specific parallelization requirements for the work threads of the producer code, so I had to explicitely spawn a number of POSIX threads, depending on the number of logical cores available. Although results are already remarkable in terms of speeding-up the computation I still feel a bit unconfortable mixing the POSIX and GCD. In particular it goes for the variable wait_interval, and computing it properly, not by predicting how many PCM samples may the render thread require for the next cycle.
Here's the shortened and simplified (pseudo)code for my test program, in plain-C.
Controller declaration:
#include "TPCircularBuffer.h"
#include <AudioToolbox/AudioToolbox.h>
#include <AudioUnit/AudioUnit.h>
#include <dispatch/dispatch.h>
#include <sys/sysctl.h>
#include <pthread.h>
typedef struct {
TPCircularBuffer buffer;
AudioComponentInstance toneUnit;
Float64 sampleRate;
AudioStreamBasicDescription streamFormat;
Float32* f; //array of updated frequencies
Float32* a; //array of updated amps
Float32* prevf; //array of prev. frequencies
Float32* preva; //array of prev. amps
Float32* val;
int* arg;
int* previous_arg;
UInt32 frames;
int state;
Boolean midif; //wip
} MyAudioController;
MyAudioController gen;
dispatch_semaphore_t mSemaphore;
Boolean multithreading, NF;
typedef struct data{
int tid;
int cpuCount;
}data;
Controller management:
void setup (void){
// Initialize circular buffer
TPCircularBufferInit(&(self->buffer), kBufferLength);
// Create the semaphore
mSemaphore = dispatch_semaphore_create(0);
// Setup audio
createToneUnit(&gen);
}
void dealloc (void) {
// Release buffer resources
TPCircularBufferCleanup(&buffer);
// Clean up semaphore
dispatch_release(mSemaphore);
// dispose of audio
if(gen.toneUnit){
AudioOutputUnitStop(gen.toneUnit);
AudioUnitUninitialize(gen.toneUnit);
AudioComponentInstanceDispose(gen.toneUnit);
}
}
Dispatcher call (launching producer queue from the main thread):
void dproducer (Boolean on, Boolean multithreading, Boolean NF)
{
if (on == true)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
if((multithreading)||(NF))
producerSum(on);
else
producer(on);
});
}
return;
}
Threadable producer routine:
void producerSum(Boolean on)
{
int rc;
int num = getCPUnum();
pthread_t threads[num];
data thread_args[num];
void* resulT;
static Float32 frames [FR_MAX];
Float32 wait_interval;
int bytesToCopy;
Float32 floatmax;
while(on){
wait_interval = FACT*(gen.frames)/(gen.sampleRate);
Float32 damp = 1./(Float32)(gen.frames);
bytesToCopy = gen.frames*sizeof(Float32);
memset(frames, 0, FR_MAX*sizeof(Float32));
availableBytes = 0;
fbuffW = (Float32**)calloc(num + 1, sizeof(Float32*));
for (int i=0; i<num; ++i)
{
fbuffW[i] = (Float32*)calloc(gen.frames, sizeof(Float32));
thread_args[i].tid = i;
thread_args[i].cpuCount = num;
rc = pthread_create(&threads[i], NULL, producerTN, (void *) &thread_args[i]);
}
for (int i=0; i<num; ++i) rc = pthread_join(threads[i], &resulT);
for(UInt32 samp = 0; samp < gen.frames; samp++)
for(int i = 0; i < num; i++)
frames[samp] += fbuffW[i][samp];
//code for managing producer state and GUI updates
{ ... }
float *head = TPCircularBufferHead(&(gen.buffer), &availableBytes);
memcpy(head,(const void*)frames,MIN(bytesToCopy, availableBytes));//copies frames to head
TPCircularBufferProduce(&(gen.buffer),MIN(bytesToCopy,availableBytes));
dispatch_semaphore_wait(mSemaphore, dispatch_time(DISPATCH_TIME_NOW, wait_interval * NSEC_PER_SEC));
if(gen.state == stopped){gen.state = idle; on = false;}
for(int i = 0; i <= num; i++)
free(fbuffW[i]);
free(fbuffW);
}
return;
}
A single producer thread may look somewhat like this:
void *producerT (void *TN)
{
Float32 samples[FR_MAX];
data threadData;
threadData = *((data *)TN);
int tid = threadData.tid;
int step = threadData.cpuCount;
int *ret = calloc(1,sizeof(int));
do_something(tid, step, &samples);
{ … }
return (void*)ret;
}
Here is the render callback (CoreAudio real-time consumer thread):
static OSStatus audioRenderCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
MyAudioController *THIS = (MyAudioController *)inRefCon;
// An event happens in the render thread- signal whoever is waiting
if (THIS->state == active) dispatch_semaphore_signal(mSemaphore);
// Mono audio rendering: we only need one target buffer
const int channel = 0;
Float32* targetBuffer = (Float32 *)ioData->mBuffers[channel].mData;
memset(targetBuffer,0,inNumberFrames*sizeof(Float32));
// Pull samples from circular buffer
int32_t availableBytes;
Float32 *buffer = TPCircularBufferTail(&THIS->buffer, &availableBytes);
//copy circularBuffer content to target buffer
int bytesToCopy = ioData->mBuffers[channel].mDataByteSize;
memcpy(targetBuffer, buffer, MIN(bytesToCopy, availableBytes));
{ … };
TPCircularBufferConsume(&THIS->buffer, availableBytes);
THIS->frames = inNumberFrames;
return noErr;
}
Grand Central Dispatch already takes care of dispatching operations to multiple processor cores and threads. In typical real-time audio rendering or processing, one never needs to wait on a signal or semaphore, as the circular buffer consumption rate is very predictable, and drifts extremely slowly over time. The AVAudioSession API (if available) and Audio Unit API and callback allow you to set and determine the callback buffer size, and thus the maximum rate at which the circular buffer can change. Thus you can dispatch all render operations on a timer, render the exact number needed per timer period, and let the buffer size and state compensate for any jitter in thread dispatch time.
In extremely long running audio renders, you might want to measure the drift between timer operations and real-time audio consumption (sample rate), and tweak the number of samples rendered or the timer offset.

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.
Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.
It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

Resources