How do I invalidate all (data) caches from the command line? - c

For some precise measurements, I would like to invalidate/flush all caches up to RAM (main memory), from the command line (so that the main program running time evaluation is not affected by this process). I have found the following (The first and last from here):
1. echo 3 > /proc/sys/vm/drop_caches
and I could build a (pre-executed) program with the following
2. #include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
or I could finally do a
3. int main() {
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 0xffff; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
}
My question is: for what I need to do, which version is best, and if it is #2, what is the address I should be giving it as a start address? My uname -a is Linux 3.2.0-33-generic #52-Ubuntu SMP Thu Oct 18 16:19:45 UTC 2012 i686

You're running on an operating system that will do other things behind your back. The operating system will handle interrupts, run various daemons in the background and do various maintenance tasks and potentially move your running process to a different cpu, etc.
Invalidating caches is the least of your worries and if your measurements have to be this accurate, you need to reevaluate the environment where you run your test. Even if you manage to get everything the operating system does under control (which basically means making your tested code part of the operating system), you still need to consider TLB behavior and branch prediction buffers (which will affect your performance more than caches), get control over SMM (which you typically can't unless you have control over your BIOS) and understand how the clocks you use for measuring really behave (I'd guess that a temperature difference of 10 degrees will affect your measurement more than having a clean cache).
In other words - forget it. A typical way to measure things realistically is to run it "enough" times and take an average (or minimum or maximum or median, depending on what you want to prove).
To add more: Your method number 1 flushes the filesystem caches and has nothing to do with data caches on the cpu. Number 2 I have no idea about, my linux flavour doesn't have it. Number 3 might work if you'd have perfect cache associativity on your cpu, which you don't and would have to make sure that the physical pages allocated by the operating system will touch every possible cache line, which you can't. You'd also have to make sure that you either execute it on the same cpu your test will run on, or on all cpus and nothing will get scheduled to run in between. Since you want to run this from the command line, your shell will stomp all over the caches long before your program runs (and the exec system call and filesystem operations won't help).
The only way to reliably clear caches on your architecture is the wbinvd instruction which you're not allowed to call because you're not the kernel and are not supposed to mess around with caches.

Related

Why is memory barrier not required for UP? [duplicate]

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

char x[2048] and cache line issue

The following is the simple c source code, where char x[2048] is a global var and func1 is called by thread1, func2 is called by thread2:
char x[2048]={0} , y[16]={0};
void func1(){
strcpy(x,y);
}
void func2(){
printf("(%s)\n",x);
}
int main(int argc, char **argv){
strncpy(y,argv[1],sizeof(y)-1);
}
In Intel's cpu, one cache line has 64 bytes in it, so x should occupy 32
cache lines, my questions are:
while thread1 calls func1, should all 32 cache lines available to that CPU cache until then do strcpy? (or) compiler knows just one cache line is enough to do the job?
While thread2 call func2, should all 32 cache lines available to that CPU cache until then do printf? (or) compiler can identify one cache line is enough?
I suggest you read the Wikipedia page: https://en.wikipedia.org/wiki/CPU_cache
Some background:
Normally, cache line ($L) are transparent to programs. So most programers don't deal with cache line (bring it in, kick it out) directly. The CPU, once find that code/data is not in $L, would stall for such memory access and bring in $L on demand.
Although there are coding techniques to bring in data into cache line in code (e.g. via prefetch instruction), normally compiler won't be smart enough to do this for you as it might prefetch too early (so by the time $L is used, it has already been kicked out), or too late (CPU still has to stall for memory access).
Answer to you question:
No. Compiler doesn't know how many $Ls needs to be brought in (how could it know whether a piece of data is already in $L or not, so just be safe side and not outsmart itself). Compiler just issue, for example, MOV instruction, and CPU, while executing this instruction, found that operand is not in $, so would bring them in on demand. As you program only copies till '\0', so is the $L bringing in stops there.
The same as #1. Only $Ls that are read would be brought in and compiler has nothing to do with this.
More Info:
CPU prefetcher might bring in additional $Ls besides those currently needed. For example, it might bring in next $L with hoping for data locality.
Some advanced program use prefetch instructions to improve program performance. Suppose you know that your code would access some location in the near future, you can prefetch it, and by the time you need it, it is there already so won't incur $L miss penalty. But it's hard to get it right (you have to know the memory access pattern of your code and insert the prefetch instruction at the right place. Some high performance code designs software pipeline to do this, but again it's an advanced topic).
https://en.wikipedia.org/wiki/Instruction_prefetch
On x86 and x64 (as well as modern ARMs and other common CPU's), the cache is fully transparent to user-mode programs.
As a result, strcpy performs the first read, the CPU pulls in one cache line automatically, strcpy quits on the \0 and it's done. Same thing with printf("%s",x).

How should I read Intel PCI uncore performance counters on Linux as non-root?

I'd like to have a library that allows 'self profiling' of critical sections of Linux executables. In the same way that one can time a section using gettimeofday() or RDTSC I'd like to be able to count events such as branch misses and cache hits.
There are a number of tools that do similar things (perf, PAPI, likwid) but I haven't found anything that matches what I'm looking for. Likwid comes closest, so I'm mostly looking at ways to modify it's existing Marker API.
The per-core counters are values are stored in MSR's (Model Specific Registers), but for current Intel processors (Sandy Bridge onward) the "uncore" measurements (memory accesses and other things that pertain to the CPU as a whole) are accessed with PCI.
The usual approach taken is that the MSR's are read using the msr kernel module, and that the PCI counters (if supported) are read from the sysfs-pci hierarchy. The problem is that both or these require the reader to be running as root and have 'setcap cap_sys_rawio'. This is difficult (or impossible) for many users.
It's also not particularly fast. Since the goal is to profile small pieces of code, the 'skew' from reading each counter with a syscall is significant. It turns out that the MSR registers can be read by a normal user using RDPMC. I don't yet have a great solution for reading the PCI registers.
One way would be to proxy everything through an 'access server' running as root. This would work, but would be even slower (and hence less accurate) than using /proc/bus/pci. I'm trying to figure out how best to make the PCI 'configuration' space of the counters visible to a non-privileged program.
The best I've come up with is to have a server running as root, to which the client can connect at startup via a Unix local domain socket. As root, the server will open the appropriate device files, and pass the open file handle to the client. The client should then be able to make multiple reads during execution on its own. Is there any reason this wouldn't work?
But even if I do that, I'll still be using a pread() system call (or something comparable) for every access, of which there might be billions. If trying to time small sub-1000 cycle sections, this might be too much overhead. Instead, I'd like to figure out how to access these counters as Memory Mapped I/O.
That is, I'd like to have read-only access to each counter represented by an address in memory, with the I/O mapping happening at the level of the processor and IOMMU rather than involving the OS. This is described in the Intel Architectures Software Developer Vol 1 in section 16.3.1 Memory Mapped I/O.
This seems almost possible. In proc_bus_pci_mmap() the device handler for /proc/bus/pci seems to allow the configuration area to be mapped, but only by root, and only if I have CAP_SYS_RAWIO.
static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma)
{
struct pci_dev *dev = PDE_DATA(file_inode(file));
struct pci_filp_private *fpriv = file->private_data;
int i, ret;
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
/* Make sure the caller is mapping a real resource for this device */
for (i = 0; i < PCI_ROM_RESOURCE; i++) {
if (pci_mmap_fits(dev, i, vma, PCI_MMAP_PROCFS))
break;
}
if (i >= PCI_ROM_RESOURCE)
return -ENODEV;
ret = pci_mmap_page_range(dev, vma,
fpriv->mmap_state,
fpriv->write_combine);
if (ret < 0)
return ret;
return 0;
}
So while I could pass the file handle to the client, it can't mmap() it, and I can't think of any way to share an mmap'd region with a non-descendent process.
(Finally, we get to the questions!)
So presuming I really want have a pointer in a non-privileged process that can read from PCI configuration space without help from the kernel each time, what are my options?
1) Maybe I could have a root process open /dev/mem, and then pass that open file descriptor to the child, which then can then mmap the part that it wants. But I can't think of any way to make that even remotely secure.
2) I could write my own kernel module, which looks a lot like linux/drivers/pci/proc.c but omits the check for the usual permissions. Since I can lock this down so that it is read-only and just for the PCI space that I want, it should be reasonably safe.
3) ??? (This is where you come in)
maybe the answer is a little late. The answer is using likwid.
As you said read MSR/sysfs-pci has to be done by root. Building likwid accessDaemon and giving it the right to access the MSR would bypass this issue. Of course, due to some inter-process communication, performance values could have some delay. This delay is not very high.
(For small code sections, the performance counters are unprecise in some how, in any way.)
Likwid can also with uncore events.
Best

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

How to flush the CPU cache in Linux from a C program?

I am writing a C program in which I need to flush my memory. I would like know if there is any UNIX system command to flush the CPU cache.
This is a requirement for my project which involves calculating the time taken for my logic.
I have read about the cacheflush(char *s, int a, int b) function but I am not sure as to whether it will be suitable and what to pass in the parameters.
I take it you mean "CPU cache", not memory cache
The link above is good: the suggestion "write a lot of data via CPU" is not Windows specific
Here's another variation on the same theme:
How to clear CPU L1 and L2 cache
Here's an article about Linux and CPU cache:
http://lwn.net/Articles/252125/
NOTE:
At this (very, very low) level, "Linux" != "Unix"
This is how Intel suggests flushing the cache:
mem_flush(const void *p, unsigned int allocation_size){
const size_t cache_line = 64;
const char *cp = (const char *)p;
size_t i = 0;
if (p == NULL || allocation_size <= 0)
return;
for (i = 0; i < allocation_size; i += cache_line) {
asm volatile("clflush (%0)\n\t"
:
: "r"(&cp[i])
: "memory");
}
asm volatile("sfence\n\t"
:
:
: "memory");
}
If you're writing a user-mode (not kernel-mode) program, and if it's single-threaded, then there's really no reason for you to ever bother flushing your cache in the first place. Your user-mode program can just forget that it even exists; it's just there to speed up your program's execution, and the OS manages it via the processor's MMU.
There are only a couple reasons I can think of that you might actually want to flush the cache from your user-mode application:
Your app is intended to run on a symmetric multiprocessor system, or has data transactions with external hardware)
You're simply testing your cache for some sort of performance test (in which case you should probably really should be writing your test to operate in kernel mode, perhaps as a driver).
In any case, assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
This assumes you have a block of memory you just wrote to and you want to make sure it's flushed out of the cache back to main memory. The block begins at addr, and it's nbytes long, and it's in one of the two caches (or both):
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
Normally you'd only need to flush the DCACHE, since when you write data to "memory" (i.e. to the cache), it's normally data, not instructions.
If you want to flush "all of the cache" for some strange testing reason, you could malloc() a big block that you know is larger than your CPU's cache (shoot, make it 8 times as big!), write any old garbage into it, and just flush that entire block.
See also: How to perform cache operations in C++?
OK, sorry about my first answer. I later read your follow-up comments below your question, so I realize now that you want to flush the INSTRUCTION CACHE to boot your program (or parts of it) out of the cache, so that when you test its performance, you also test its initial load time out of main memory into the instruction cache. Do you also need to flush any data your code will use out to main memory, so that both data and code are fresh loads?
Before anything else, I'd like to mention that main memory itself is also a form of cache, with your hard disk (either the program on disk, or swap space on disk) being the lowest, slowest place your program's instructions could be coming from. That said, when you first run through a routine for the first time, if it hasn't already been loaded into main memory from disk by virtue of being near other code that has already executed, then its CPU instructions will first have to be loaded from disk. That takes an order of magnitude or more longer than loading it from main memory into the cache. Then once it's loaded into main memory, it takes somewhere along the lines of an order of magnitude longer to load from main memory into the cache than it takes to load from the cache into the CPU's instruction fetcher. So if you want to test your code's cold-start performance, you have to decide what cold-start means.... pulling it out of disk, or pulling it out of main memory. I don't know of any command to "flush" instructions/data out of main memory out to swap space, so flushing it out to main memory is about as much as you can do (that I know of), but keep in mind that your test results may still differ from the first run (when it may be pulling it off disk) to subsequent runs, even if you do flush the instruction cache.
Now, how would one go about flushing the instruction cache to ensure that their own code is flushed out to main memory?
If I needed to do this (very odd thing to do in my opinion), I'd probably start by finding the length & approximate placement of my functions in memory. Since I'm using Linux, I'd issue the command "objdump -d {myprogram} > myprogram.dump.txt", then I'd open myprogram.dump.txt in an editor and search for the functions I want to flush out, and figure out how long they are by subtracting their end address form their start address using a hex calculator. I'd write down the sizes of each. Later I'd add cacheflush() calls in my code, giving it the address of each function I want to flush out as 'addr' and the length I found as 'nbytes', and ICACHE. Just for safety I'd probably fudge a little & add about 10% to the size, just in case I make a few tweaks to the code and forget to adjust the nbytes. I'd make a call to cacheflush() like this for each function I want to flush out. Then if I need to flush out the data also, if it's using global/static data, I can flush those also (DCACHE), but if it's stack or heap data, there's really nothing realistic that I can (or should) do to flush that out of cache. Trying to do so would be an exercise in silliness, because it would be creating a condition that would never or very rarely exist in normal execution.
Assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
...where cache is one of:
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
BTW, is this homework for a class?

Resources