So after programming the basic L1 and L2 cache related routines in Linux kernel (arch/arm/mm/cache-X.S) say for example specific to ARM11 Processor, is there a test utility/program available to test whether the cache is working properly such that invalidation, flush happens properly. How we can ensure it instead of just relying on our own programs.
You can use the perfcounters subsystem. It's basically an abstraction over CPU performance counters, which are hardware registers recording events like cache misses, instructions executed, branch mispredictions etc. It also provides abstraction for software events (sic) such as minor/major page faults, task migrations, task context-switches and tracepoints. The perf tool can be used to monitor and verify correct cache behaviour - for instance, you can check cache flushing works correctly by filling the cache, flushing it, measuring cache misses on subsequent memory accesses and comparing it to your expected result.
You can have a look to LMBench a deep benchmark that can be run on almost every linux platform (I already successly used it on x86, ARM9 and CortexA8 architectures). You'll be able to measure cache performance.
If your cache is using the RAM, you can flush the Linux RAM:
echo "1" > /proc/sys/vm/drop_caches
echo "0" > /proc/sys/vm/drop_caches
free -m
Related
I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.
Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?
Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.
Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.
More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.
What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.
I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.
My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.
We know that there are levels in memory hierarchy,
cache,primary storage and secondary storage..
Can we use a c program to selectively store a variable in a specific block in memory hierarchy?
Reading your comments in the other answers, I would like to add a few things.
Inside an Operating System you can not restrict in which level of the memory hierarchy your variables will be stored, since the one who controls the metal is the Operating System and it enforces you to play by its rules.
Despite this, you can do something that MAY get you close to testing access time in cache (mostly L1, depending on your test algorithm) and in RAM memory.
To test cache access: warm up accessing a few times a variable. Them access a lot of times the same variable (cache is super fast, you need a lot of accesses to measure access time).
To test the main memory (aka RAM): disable the cache memory in the BIOS and run your code.
To test secondary memory (aka disk): disable disk cache for a given file (you can ask your Operating System for this, just Google about it), and start reading some data from the disk, always from the same position. This might or might not work depending on how much your OS will allow you to disable disk cache (Google about it).
To test other levels of memory, you must implement your own "Test Operating System", and even with that it may not be possible to disable some caching mechanisms due to hardware limitations (well, not actually limitations...).
Hope I helped.
Not really. Cache is designed to work transparently. Almost anything you do will end up in cache, because it's being operated on at the moment.
As for secondary storage, I assume you mean HDD, file, cloud, and so on. Nothing really ever gets stored there unless you do so explicitly, or set up a memory mapped region, or something gets paged to disk.
No. A normal computer program only has access to main memory. Even secondary storage (disk) is usually only available via operating system services, typically used via the stdio part of the library. There's basically no direct way to inspect or control the hierarchical caches closer to the CPU.
That said, there are cache profilers (like Valgrind's massif tool) which give you an idea about how well your program uses a given type of cache architecture (as well as speculative execution), and which can be very useful to help you spot code paths that have poor cache performance. They do this essentially by emulating the hardware.
There may also be architecture-specific instructions that give you some control over caching (such as "nontemporal hints" on x86, or "prefetch" instructions), but those are rare and peculiar, and not commonly exposed to C program code.
It depends on the specific architecture and compiler that you're using.
For example, on x86/x64, most compilers have a variety of levels of prefetch instructions which hint to the CPU
that a cache-line should be moved to a specific level in the cache from DRAM (or from higher-order caches - e.g. from L3 to L2).
On some CPUs, non-temporal prefetch instructions are available that when combined with non-temporal read instructions, allow you to bypass the caches and read directly into a register. (Some CPUs have non-temporals implemented by forcing the data to a specific way of the cache though; so you have to read the docs).
In general though, between L1 and L2 (or L3, or L4, or DRAM) it's a bit of a black-hole. You can't specifically store a value in one of these - some caches are inclusive of each other (so if a value is in L1, it's also in L2 and L3), some are not. And the caches are designed to periodically drain - so if a write goes to L1, eventually it works its way out to L2, then L3, then DRAM - especially in multi-core architectures with strong-memory models.
You can bypass them entirely on write (use streaming-store or mark the memory as write-combining).
You can measure the different access times by:
Using memory-mapped files as a backing store for your data (this will measure the time for first access to reach the CPU - just wrap it in a timer call, such as QueryPerformanceCounter or __rdtscp). Be sure to unmap and close the file from memory between each test, and turn off any caching. It'll take a while.
Flushing the cache between accesses to get the time-to-access from DRAM.
It's harder to measure the difference between different cache levels, but if your architecture supports it, you can prefetch into a cache level, spin in a loop for some period of time, and then attempt it.
It's going to be hard to do this on a system using a commercial OS, because they tend to do a LOT of work all the time, which will disturb your measurements.
In Modern multicore processors, we normally have a local L1 cache but a shared L2 cache. Is it possible to bypass the L1 cache for some portion of the memory while still using L2 cache for it? I want to do this to improve timing predictability, at the cost of performance it may be.
As far as I know, there is no way to bypass the L1 cache on mainstream CPUs.
However, to achieve your goal (i.e. avoid cache misses that may cause variation in timings mesurements), you may try to ask your compiler to prefetch the data into the cache.
If you use GCC or LLVM, see __builtin_prefetch.
However, your question is quite vague, and I am unsure that your proposal will suit your needs.
Caches
I strongly suspect that you have misunderstood what a cache does and what it is for.
Caches are transparent from the point of view of memory contents. If one core writes to a memory location then every other core whose caches (L1, L2, L3 etc), shared or not, happen to be caching that location will get updated also.
Note that that does not mean that the cores aren't able to race for the value. You can still have a race condition whereby one core reading a location fractionally before another writes it 'gets the wrong value'. Furthermore that will happen whether or not your CPU has caches of any sort. To solve that 'ordering' problem you have to use semaphores or other IPC primitives in your source code.
Some cache systems do allow you to 'drop hints' to them. Matthieu Rouget gave an example of that with __builtin_prefetch. These sorts of things allow the programmer to tell the cache system that it might well be worth getting some data in advance. Some systems (e.g. PowerPC 7450) sort of allowed the programmer to use part of the cache as memory instead of cache, kind of the ultimate in programmer cache control.
However, none of these things make any difference to the view of memory that all the caches have. If one cache's contents get updated, the rest are also updated.
Caches and Performance Programming
The very best programmers are able to extract peak performance from a CPU by coding around the behaviour of the cache. In that realm one generally finds oneself wishing that the cache wasn't there at all. The ultimate embodiment of this is the Cell processor in the PS3. The maths cores on that have no cache at all. Instead you have to in effect do all your own data fetching and write back yourself in your source code, rather than leave it up to some cache to second guess what data your program is going to ask for. Get it right and the performance is still blisteringly good.
Bus Snooping
Some CPUs don't have cache bus snooping, which can be a particular problem when writing device drivers. Bus snooping is a mechanism whereby the CPU caches spot the content of memory being updated by something other than the CPU cores (e.g. by a DMA controller reading data from a device). And the same the other way round - DMAs from memory get values currently stuck in cache. AFAIK almost all CPUs these days do bus snooping, so that is not likely to be a problem.
On systems with IO as well as memory address spaces (e.g. Intel) I don't think that I/O address space is cached anyway. For systems with memory mapped devices their memory is generally not cached either, and the OS sets up the CPU that way (see this).
Timing Predictability
To return to the reason for your question - timing predictability. You may be using the wrong technology. If your system has timing constraints whereby the problem is variations in main memory write times, then frankly using a multicore CPU sounds like the wrong thing in the first place. #Griwes is quite right on that point (and indeed the entire comment). You'll more likely need to resort to a pure hardware design, something along the lines of an FPGA (no comments about whether firmware is really software please!).
If, as I suspect, you're actually trying to avoid using semaphores and other IPC primitives to synchronise two threads in your system then you're not going to succeed, shared caches or not. You need to use semaphores and such to make your code work properly.
My C program which uses sorting runs 10x slower the first time than other times. It uses file of integers to sort and even if I change the numbers, program still runs faster. When I restart the PC, the very first time program runs 10x slower. I use time to count the time.
The operating system holds the data in RAM even if it's not needed anymore (this is called "caching"), so when the program runs again, it gets all data from there and there's no disk I/O. Even when you change the data, that change happens in RAM first, and it stays there even after its written to the file.
It doesn't stay in RAM forever though, mind you. If the memory is needed for something else, the cache is deleted. At that point, a disk access is needed (and it's cached in RAM again at that point.)
This is why first access after a reboot is always slow; the data hasn't been cached yet since it was never read from the file.
You have to make hypothesis and confront them to reality. The first you can reasonably make is that it does smell a lot like a caching issue !
Ask yourself those questions :
Does my data fits in free RAM (= is my file cached by the OS FS cache
?)
Does my data fits in CPU data cache ?
Does my data fits in HDD internal cache ?
The most easy hypothesis to discard is the FS cache. Under linux, just issue sync; echo 3 > /proc/sys/vm/drop_caches between each call to your program. The first will make sure the cached data will make it to the physical medium (hard drive), the second will drop the content of the filesystem cache from memory.
The 'physical medium' might be the HDD cache itself, so beware... Under linux you can disable this "write-back" cache with the command hdparm -W 0 <device>, for instance if you are working with drive sda, hdparm -W 0 /dev/sda will do the job. You might want to re-enable it after you are finished with your tests :)
Another hypothesis is the CPU cache, have a look at How can I do a CPU cache flush in x86 Windows? and How to clear CPU L1 and L2 cache
Well, it may or may not be one of those, but it doesn't hurt trying :)
If your program does network access then that could be the reason for the initial delay. Many network protocols need time to setup things. Some examples:
DNS: if your program does any network access, chances are it needs to resolve a hostname to an IP address. The first time it would need at least a network round trip to populate a local cache. Following requests would be shorter.
Networked filesystems (NFS, CIFS and others): opening files can happen through the network.
Even some seemingly innocuous library functions can require network access: the users list for the host can be on a remote directory server.
Appart from this you could use some low level tracing tool to see where the time is spent. On linux a basic tool is strace -r. There is probably some similar tool for other systems. Your compiler must also come with a profiler (i.e. gprof for GCC or maybe Valgrind).
I had a very similar issue but I wasn't loading in a large file - so I was baffled at the long first execution time (caching couldn't have been the issue).
This answer pointed me in the right direction - it was my real-time anti-virus protection. Every time I recompiled the program it would re-scan it as being potentially malicious. I added my project path as an "Exception" to Avira's (in my case) real-time virus protection.
Program execution on the first execution is now lightening quick!
This is nothing new, not just your program many popular commercial softwares face this problem.
To start with check this MATLAB Article about slow fist time execution
In case of other programming language which runs on a Virtual Machine like C# or Java this is quite common.
http://en.wikipedia.org/wiki/Just-in-time_compilation#Startup_delay_and_optimizations
Caching is a good reason for that to happen in C but still 10x is quite a long duration..It might be also possible that you system was loading other resources after you restart.
You should run the program after say 10 minutes after restart for better results. All the startup application would be loaded by that time. (10 minutes ---- depends on the number of startup applications and the time it takes to start each of them)
This is because of compiler optimatization ,what it does is it caches the result for Temoparal Locality and the activation record is saved,time is also saved because the binding object donot have to be reloaded again during Linking Stage
There are two components to the time measured
If you are reading a file from disk,and loading it in memory - and sorting :
1)Time to read the file & store it in an array
2)Time of sorting
Were these measured separately?
Can you check this out?
Invalidating Linux Buffer Cache
Instead of doing a restart, if repeating the experiment with clearing the cache gives the same result, then you can infer that File buffer caching effects were not factored into.