What does "cacheline aligned" mean?

What does "cacheline aligned" mean? - c

I read this article about PostgreSQL performance: http://akorotkov.github.io/blog/2016/05/09/scalability-towards-millions-tps/
One optimization was "cacheline aligment".
What is this? How does it help and how to apply this in code?

CPU caches transfer data from and to main memory in chunks called cache lines; a typical size for this seems to be 64 bytes.
Data that are located closer to each other than this may end up on the same cache line.
If these data are needed by different cores, the system has to work hard to keep the data consistent between the copies residing in the cores' caches. Essentially, while one thread modifies the data, the other thread is blocked by a lock from accessing the data.
The article you reference talks about one such problem that was found in PostgreSQL in a data structure in shared memory that is frequently updated by different processes. By introducing padding into the structure to inflate it to 64 bytes, it is guaranteed that no two such data structures end up in the same cache line, and the processes that access them are not blocked more that absolutely necessary.
This is only relevant if your program parallelizes execution and accesses a shared memory region, either by multithreading or by multiprocessing with shared memory. In this case you can benefit by making sure that data that are frequently accessed by different execution threads are not located close enough in memory that they can end up in the same cache line.
The typical way to do that is by adding “dead” padding space at the end of a data structure.
I found some interesting articles on the topic that you may want to read:
http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273?pgno=3
http://www.drdobbs.com/tools/memory-constraints-on-thread-performance/231300494
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?

I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.

mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.

"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Is it safe to read and write to an array at different positions from multiple threads in C with phtreads?

Let's suppose that there are two threads, A and B. There is also a shared array: float X[100].
Thread A writes to the array one element at a time in order, every 10 steps it updates a shared variable index (in a safe way) that indicates the current index, and it also sends a signal to thread B.
As soon as thread B receives the signal, it reads index in a safe way, and then proceed to read the elements of X until position index.
Is it safe to do this? Thread A really updates the array or just a copy in cache?

Every sane way of one thread sending a signal to another provides the assurance that anything written by a thread before sending a signal is guaranteed to be visible to a thread after it receives that signal. So as long as you sent the signal through some means that provided this guarantee, which they pretty much all do, you are safe.
Note that attempting to use a condition variable without a predicate protected by a mutex is not a sane way of one thread sending a signal to another! Among other things, it doesn't guarantee that the thread that you think received the signal actually received the signal. You do need to make sure the thread that does the reads in fact received the very signal sent by the thread that does the writes.

Is it safe to do this?
Provided your data modification is rendered safe and protected by critical sections, locks or whatever, this kind of access is perfectly safe for what concerns hardware access.
Thread A really updates the array or just a copy in cache?
Just a copy in cache. Most caches are presently write-back and just write data back to memory when a line is ejected from the cache if it has been modified. This largely improves memory bandwidth, especially in a multicore context.
BUT all happens as if the memory had been updated.
For shared memory processors, there are generally cache coherency protocols (except in some processors for real time applications). The basic idea of these protocols is that a state is associated with every cache line.
State describes informations concerning the line in the cache of the different processors.
These states indicate, for instance, if the line is only present in the current cache, or is shared by several caches, in sync with memory, invalid... See for instance this description of the popular MESI cache coherence protocol.
So what happens, when a cache line is written and is also present in another processor?
Thanks to the state, the cache knows that one or more other processor also have a copy of the line and it will send an invalidate signal. The line will be invalidated in the other caches and when they want to read or write it, they have to reload its content. Actually, this reload will be served by the cache that has the valid copy to limit memory accesses.
This way, whilst data is only written in the cache, the behavior is similar to a situation where data would have been written to memory.
BUT, despite the fact that functionally the hardware will ensure correctness of the transfer, one must be take into account the cache existence, to avoid performances degradation.
Assume cache A is updating a line and cache B is reading it. Whenever cache A writes, the line in cache B is invalidated. And whenever cache B wants to read it, if the line has been invalidated, it must fetch it from cache A. This can lead to many transfers of the line between the caches and render inefficient the memory system.
So concerning your example, probably 10 is not a good idea, and you should use informations on the caches to improve your exchanges between sender and receiver.
For instance, if you are on a pentium with 64 bytes cache lines, you should declare X as
_Alignas(64) float X[100];
This way the starting address of X will be a multiple of 64 and fit cache lines boundaries. The _Alignas quaiifier exists since C17, and by including stdalign.h, you can also use similarly alignas(64). Before C17, there were several extensions in most compilers in order to have an aligned placement.
And of course, you should indicate process B to read data only when a full 64 bytes line (16 floats) has been written.
This way, when thread B accesses the data, the cache line will not be modified any longer by thread A and only one initial transfer between caches A and B Will take place. This reduction in the number of transfers between the caches may have a significant impact on performances depending on your program.

If you're using a variable to that tracks readiness to read the index, the variable is protected by a mutex and the signalling is done via a pthread condition variable that thread B waits on under the mutex, then yes.
If you're using POSIX signals, then I believe you need a synchronization mechanism on top of that. Writing to an atomic variable with memory_order_release in thread A, and reading it with memory_order_acquire in thread B should guarantee in the most lightweight fashion that writes in A preceding the write to the atomic should be visible in B after it has read the atomic.
For best performance, the array sharing should be also done in such a way that the shared parts of the array do not cross cache-line boundaries (or else you're performance might degrade due to false sharing).

How does mmap work when 2 programs map the same file

I am trying to understand how mmap works while looking at man mmap.
As I understand it, it adds a mapping to the page table that maps between the file and the virtual address (which is the address that is given void *addr)
So, what happens when 2 programs map the same file?
Are there 2 entries in the page table, one for each program?

So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
In modern operating systems, each process has its own page table for its memory, that may point to pages of physical memory shared with other user and kernel processes.
With MAP_SHARED, this mapping is shared: updates to the mapping are visible to other processes that map this file, and are carried through to the underlying file. The file may not actually be updated until msync(2) or munmap() is called.
This seems very interesting, but there are numerous caveats:
The actual pages mmapped by both processes for the same file may reside at the same address or at a different address in each process, storing pointers into this shared memory may not allow the other process to use them as they might point to inconsistent addresses.
The implementation may use the same physical memory pages for both mappings or not: for subtile reasons (cache strategies, out of sync reading...), even if it is the same physical memory, modifications done by one process to its memory may not be immediately reflected in the memory of the other process.
So the modification may or may not be visible to the other processes mmapping the file nor reading it via read or the FILE* stream API.
If one of the processes calls msync(), the modifications should be visible in all maps and for all yet unread portions of the file, bearing in mind that the FILE* streaming APIs may have buffered some data in internal unshared buffers: modifications in this area will not be reflected.
Conclusion: it is risky and unreliable to use these mechanisms to implement inter process communication. The behavior may depend on system specific characteristics such as the OS strategies, the CPU and cache architectures, the type of RAM in use, the clock speed, and who knows what else. It is safer to rely on proven APIs that may indeed be implemented using mmapped memory, but only if it is know to provide the correct semantics.

The actual system implementation is different. At the risk of over simplification (and omitting paging here):
A mmap will map physical page frames to a file.
So, what happens when 2 programs map the same file? Are there 2 entries in the page table, one for each program?
If two processes (P and Q) map to the same file, then P and Q will each have there own page table; each page table will have entry mapping to the same physical page frame (which could be mapped to different addresses within P and Q).

mmap thread safety in a multi-core and multi-cpu environment

I am a little confused as to the real issues between multi-core and multi-cpu environments when it comes to shared memory, with particular reference to mmap in C.
I have an application that utilizes mmap to share multiple segments of memory between 2 processes. Each process has access to:
A Status and Control memory segment
Raw data (up to 8 separate raw data buffers)
The Status and Control segment is used essentially as an IPC. IE, it may convey that buffer 1 is ready to receive data, or buffer 3 is ready for processing or that the Status and Control memory segment is locked whilst being updated by either parent or child etc etc.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Does this assumption of single-process RAM access also apply to multi-cpu systems? That is, a single PC style board with multiple CPU's (and I guess, multiple cores within each CPU).
If not, I will need to seriously rethink my logic to allow for multi-cpu'd single-boarded machines!
Any thoughts would be greatly appreciated!
PS - by single boarded I mean a single, standalone PC style system. This excludes mainframes and the like ... just to clarify :)

RAM is only ever accessed by a single core (or process) at any one time.
Take a step back and think about your assumption means. Theoretically, yes, this statement is true, but I don't think it means what you think it means. There are no practical conclusions you can draw from this other than maybe "the memory will not catch fire if two CPUs write to the same address at the same time". Let me explain.
If one CPU/process writes to a memory location, then a different CPU/process writes to the same location, the memory writes will not happen at the same time, they will happen one at a time. You can't generally reason about which write will happen before the other, you can't reason about if a read from one CPU will happen before the write from the other CPU, one some older CPUs you can't even reason if multi-byte (multi-word, actually) values will be stored/accessed one byte at a time or multiple bytes at a time (which means that reads and writes to multibyte values can get interleaved between CPUs or processes).
The only thing multiple CPUs change here is the order of memory reads and writes. On a single CPU reading memory you can be pretty sure that your reads from memory will see earlier writes to the same memory (iff no other hardware is reading/writing the memory, then all bets are off). On multiple CPUs the order of reads and writes to different memory locations will surprise you (cpu 1 writes to address 1 and then 2, but cpu 2 might just see the new value at address 2 and the old value at address 1).
So unless you have specific documentation from your operating system and/or CPU manufacturer you can't make any assumptions (except that when two writes to the same memory location happen one will happen before the other). This is why you should use libraries like pthreads or stdatomic.h from C11 for proper locking and synchronization or really dig deep down into the most complex parts of the CPU documentation to actually understand what will happen. The locking primitives in pthreads not only provide locking, they are also guarantee that memory is properly synchronized. stdatomic.h is another way to guarantee memory synchronization, but you should carefully read the C11 standard to see what it promises and what it doesn't promise.

One potential issue is that each core has it's own cache (usually just level1, as level2 and level3 caches are usually shared). Each cpu would also have it's own cache. However most systems ensure cache coherency, so this isn't the issue (except for performance impact of constantly invalidating caches due to writes to the same memory shared in a cache line by each core or processor).
The real issue is that there is no guarantee against reordering of reads and writes due to optimizations by the compiler and/or the hardware. You need to use a Memory Barrier to flush out any pending memory operations to synchronize the state of the threads or shared memory of processes. The memory barrier will occur if you use one of the synchronization types such as an event, mutex, semaphore, ... . Not all of the shared memory reads and writes need to be atomic, but you need to use synchronization between threads and/or processes before accessing any shared memory possibly updated by another thread and/or process.

This does not sound right to me. Two processes on two different cores can both load and store data to RAM at the same time. In addition to this caching strategies can result in all kinds of strangeness-es. So please make sure all access to shared memory is properly synchronized using (interprocess) synchronization objects.

My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Even if this holds true for some particular architecture, such an assumption is entirely wrong in general. You should have proper synchronisation between the processes that modify the shared memory segment, unless atomic intrinsics are used and the algorithm itself is lock-free.
I would advise you to put a pthread_mutex_t in the shared memory segment (shared across all processes). You will have to initialise it with the PTHREAD_PROCESS_SHARED attribute:
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(mutex, &mutex_attr);

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.

My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.

...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.