char x[2048] and cache line issue - c

The following is the simple c source code, where char x[2048] is a global var and func1 is called by thread1, func2 is called by thread2:
char x[2048]={0} , y[16]={0};
void func1(){
strcpy(x,y);
}
void func2(){
printf("(%s)\n",x);
}
int main(int argc, char **argv){
strncpy(y,argv[1],sizeof(y)-1);
}
In Intel's cpu, one cache line has 64 bytes in it, so x should occupy 32
cache lines, my questions are:
while thread1 calls func1, should all 32 cache lines available to that CPU cache until then do strcpy? (or) compiler knows just one cache line is enough to do the job?
While thread2 call func2, should all 32 cache lines available to that CPU cache until then do printf? (or) compiler can identify one cache line is enough?

I suggest you read the Wikipedia page: https://en.wikipedia.org/wiki/CPU_cache
Some background:
Normally, cache line ($L) are transparent to programs. So most programers don't deal with cache line (bring it in, kick it out) directly. The CPU, once find that code/data is not in $L, would stall for such memory access and bring in $L on demand.
Although there are coding techniques to bring in data into cache line in code (e.g. via prefetch instruction), normally compiler won't be smart enough to do this for you as it might prefetch too early (so by the time $L is used, it has already been kicked out), or too late (CPU still has to stall for memory access).
Answer to you question:
No. Compiler doesn't know how many $Ls needs to be brought in (how could it know whether a piece of data is already in $L or not, so just be safe side and not outsmart itself). Compiler just issue, for example, MOV instruction, and CPU, while executing this instruction, found that operand is not in $, so would bring them in on demand. As you program only copies till '\0', so is the $L bringing in stops there.
The same as #1. Only $Ls that are read would be brought in and compiler has nothing to do with this.
More Info:
CPU prefetcher might bring in additional $Ls besides those currently needed. For example, it might bring in next $L with hoping for data locality.
Some advanced program use prefetch instructions to improve program performance. Suppose you know that your code would access some location in the near future, you can prefetch it, and by the time you need it, it is there already so won't incur $L miss penalty. But it's hard to get it right (you have to know the memory access pattern of your code and insert the prefetch instruction at the right place. Some high performance code designs software pipeline to do this, but again it's an advanced topic).
https://en.wikipedia.org/wiki/Instruction_prefetch

On x86 and x64 (as well as modern ARMs and other common CPU's), the cache is fully transparent to user-mode programs.
As a result, strcpy performs the first read, the CPU pulls in one cache line automatically, strcpy quits on the \0 and it's done. Same thing with printf("%s",x).

Related

Why is memory barrier not required for UP? [duplicate]

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

How to flush the CPU cache in Linux from a C program?

I am writing a C program in which I need to flush my memory. I would like know if there is any UNIX system command to flush the CPU cache.
This is a requirement for my project which involves calculating the time taken for my logic.
I have read about the cacheflush(char *s, int a, int b) function but I am not sure as to whether it will be suitable and what to pass in the parameters.
I take it you mean "CPU cache", not memory cache
The link above is good: the suggestion "write a lot of data via CPU" is not Windows specific
Here's another variation on the same theme:
How to clear CPU L1 and L2 cache
Here's an article about Linux and CPU cache:
http://lwn.net/Articles/252125/
NOTE:
At this (very, very low) level, "Linux" != "Unix"
This is how Intel suggests flushing the cache:
mem_flush(const void *p, unsigned int allocation_size){
const size_t cache_line = 64;
const char *cp = (const char *)p;
size_t i = 0;
if (p == NULL || allocation_size <= 0)
return;
for (i = 0; i < allocation_size; i += cache_line) {
asm volatile("clflush (%0)\n\t"
:
: "r"(&cp[i])
: "memory");
}
asm volatile("sfence\n\t"
:
:
: "memory");
}
If you're writing a user-mode (not kernel-mode) program, and if it's single-threaded, then there's really no reason for you to ever bother flushing your cache in the first place. Your user-mode program can just forget that it even exists; it's just there to speed up your program's execution, and the OS manages it via the processor's MMU.
There are only a couple reasons I can think of that you might actually want to flush the cache from your user-mode application:
Your app is intended to run on a symmetric multiprocessor system, or has data transactions with external hardware)
You're simply testing your cache for some sort of performance test (in which case you should probably really should be writing your test to operate in kernel mode, perhaps as a driver).
In any case, assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
This assumes you have a block of memory you just wrote to and you want to make sure it's flushed out of the cache back to main memory. The block begins at addr, and it's nbytes long, and it's in one of the two caches (or both):
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
Normally you'd only need to flush the DCACHE, since when you write data to "memory" (i.e. to the cache), it's normally data, not instructions.
If you want to flush "all of the cache" for some strange testing reason, you could malloc() a big block that you know is larger than your CPU's cache (shoot, make it 8 times as big!), write any old garbage into it, and just flush that entire block.
See also: How to perform cache operations in C++?
OK, sorry about my first answer. I later read your follow-up comments below your question, so I realize now that you want to flush the INSTRUCTION CACHE to boot your program (or parts of it) out of the cache, so that when you test its performance, you also test its initial load time out of main memory into the instruction cache. Do you also need to flush any data your code will use out to main memory, so that both data and code are fresh loads?
Before anything else, I'd like to mention that main memory itself is also a form of cache, with your hard disk (either the program on disk, or swap space on disk) being the lowest, slowest place your program's instructions could be coming from. That said, when you first run through a routine for the first time, if it hasn't already been loaded into main memory from disk by virtue of being near other code that has already executed, then its CPU instructions will first have to be loaded from disk. That takes an order of magnitude or more longer than loading it from main memory into the cache. Then once it's loaded into main memory, it takes somewhere along the lines of an order of magnitude longer to load from main memory into the cache than it takes to load from the cache into the CPU's instruction fetcher. So if you want to test your code's cold-start performance, you have to decide what cold-start means.... pulling it out of disk, or pulling it out of main memory. I don't know of any command to "flush" instructions/data out of main memory out to swap space, so flushing it out to main memory is about as much as you can do (that I know of), but keep in mind that your test results may still differ from the first run (when it may be pulling it off disk) to subsequent runs, even if you do flush the instruction cache.
Now, how would one go about flushing the instruction cache to ensure that their own code is flushed out to main memory?
If I needed to do this (very odd thing to do in my opinion), I'd probably start by finding the length & approximate placement of my functions in memory. Since I'm using Linux, I'd issue the command "objdump -d {myprogram} > myprogram.dump.txt", then I'd open myprogram.dump.txt in an editor and search for the functions I want to flush out, and figure out how long they are by subtracting their end address form their start address using a hex calculator. I'd write down the sizes of each. Later I'd add cacheflush() calls in my code, giving it the address of each function I want to flush out as 'addr' and the length I found as 'nbytes', and ICACHE. Just for safety I'd probably fudge a little & add about 10% to the size, just in case I make a few tweaks to the code and forget to adjust the nbytes. I'd make a call to cacheflush() like this for each function I want to flush out. Then if I need to flush out the data also, if it's using global/static data, I can flush those also (DCACHE), but if it's stack or heap data, there's really nothing realistic that I can (or should) do to flush that out of cache. Trying to do so would be an exercise in silliness, because it would be creating a condition that would never or very rarely exist in normal execution.
Assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
...where cache is one of:
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
BTW, is this homework for a class?

Resources