Is there a programming approach to disable and enable smart cache capability in an intel cpu through c or c++ or maybe assembly code. i would like to measure algorithm performance with and without smart cache!is there such option availables or not?. I search alot but did not find anything useful. my cpu is intel 6700hq.
Smart cache is a architectural feature, and relies on a certain hardware structure to be present (in detail, the L2/L3 caches of individual cores to not be separated, as well as certain optimizations in data prefetch logic etc.). As such, it is highly unlikely that this feature can be disabled (although I was unable to find any reference on this).
Related
I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.
Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.
The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).
The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?
I have recently bought an STM32 NUCLEO Dev Kit and wondered if this what an actual Embedded Systems Engineer would use in the industry when developing a product?
I'm using Kiel Uvision 5, STMCubeMX and STM32 ST-LINK Utility to develop certain projects. As I am used to using PIC and using registers like PORTA, OSCCON, TIMER0 etc, I see that Kiel Uvision 5 uses ready made functions like HAL_GPIO_TogglePin(.........) etc. Is this the usual way they do this in industry or work more directly with the registers?
It's heavily opinion based and I wouldn't get surprised if this question got closed for this reason. This answer only touches on few aspects of what you're asking about. It's a very broad topic and it's going to be hard - if not impossible - to include everything in one post which wouldn't end up being several pages long. However to give you my perspective on the topic, while trying to remain unbiased, the short answer is.. it depends.
If you're asking about what is used in most common cases, it's likely going to be the HAL (previously StdPeriph) functions you've mentioned. The reason is - they get the job done in most common cases. After all it always comes down to what the cost of creating a product is going to be. If HAL functions are "good enough" for the purpose, they're going to be used simply because they're faster to develop with. The higher the development cost, the more you'll want to cut it (or move it elsewhere) and using abstractions is one way of doing so.
However, even though I think it's safe to assume that HAL / Std Periph / any other (including proprietary) abstraction layer is generally used, it's not always the case for at least two reasons I can think of:
Existing functions may not be suitable for your purpose. Giving HAL as an example, it works pretty well for most common cases, but sometimes your needs may be so specific that you'll have to go and mess "under the hood", often ending up either writing your own variation of the functions of building something new on top of HAL. Personally I can think of at least few examples where HAL functions weren't exactly what I needed. It doesn't necessarily mean that the library is bad, it's just sometimes the requirements are very specific.
Messing with registers directly may sometimes be required for performance reasons. HAL and similar are an abstraction layer and as any abstraction, they take more time to execute than using the registers directly. If you're trying to squeeze absolute maximum out of given peripheral, you'll sometimes have to go down to register level.
Now to a more biased portion of my answer.. I can see why you ask this question. Coming from PIC world where Flash or CPU clocks were more precious, it does make sense to use registers directly there. In case of STM32, it's not as critical anymore. Having said that, you'll sometimes stumble upon opinions that "using registers is the only true way", but personally I find such discussions ending up being purely academic. I see registers or any abstraction built on top of it as tools and you should use the right tools for the right job. Two examples of NOT using the right tools:
You use only registers as "the only right way" either because you believe it yourself or you've been told so. Your products take twice (if not more) time to develop, your code takes less space in flash (so now you use 46% of 1MB flash instead of 48%). Code that is performance-critical meets its goals. Code that has relaxed execution time constraints is also super efficient, but it doesn't affect the end customer much, if at all. Your code is also less reusable - you find yourself rewriting same portions of code over and over every time you release new product for a new MCU family.
You only use HAL / any other similar abstraction because "you didn't pick so powerful MCU to have to go down to register level", or because you're told you should never ever touch registers. You develop much faster and you're able to release two products instead of just one using registers. However when there are execution time constraints / transmission speeds you have to hit, you find yourself picking MCUs more powerful than should theoretically be needed. Sometimes you find yourself writing wrappers around HAL because they don't give you exactly the functionality you need - it feels like making it more complicated than it should be.
So after all, if there was anything to take out of what I'm trying to say is that you should use what is suitable for the job on a case-by-case basis. In case of STM32, you nowadays have 3 options: HAL (top abstraction level), HAL LL (Low Level abstraction - often simple wrapper functions around register acceses) or using registers directly. Which one you choose should come from what your requirements are.
I use Nucleo boards all the time. It allows me to start write software before I have the actual hardware ready.
HAL & register way is rather the programmer choice than the "industry standard". I personally use HAL drivers when program more complicated peripherals like USB and Ethernet to avoid wiring the full stacks from the scratch.
I have just finished my degree and we heavily used the STM32 platform with CMSIS and HAL.
It is a matter of preference. The HAL libraries offer higher abstraction but come with some quirks that are not very intuitive. The HAL (was/)is buggy sometimes. We have encountered SPI transfer bugs which made the higher transfer rates of SPI unusable because of a delay in the per byte transfer.
CMSIS offers lower level access but still abstracts away from simple bit manipulation. I don't think that direct register access is a great way to program anymore and at least CMSIS should be used. But it still a matter of opinion, preference and what is right for the job at hand. If you need something quick: HAL. If you need really fine control: CMSIS.
(sidenote I believe that CMSIS has been phased out in favor of HAL but it is still usable at this time)
All of the other answers are valid.
Just wanted to chime in and say I used STM32 regularly at my job (and at two different companies). We used the HAL drivers at both companies. There are obviously some issues with them, but used regularly enough by others that you can easily find support online. Additionally CubeMx does a decent job with them, at least enough to get you started on using the peripherals. So the 0->something step feel smaller. But getting really optimized code and design, you may want to dive deeper to really understand what each of the HAL drivers are doing and decide which method works for your project, goals, and requirements.
I am using openmp to parallelize loops in my code in order to optimize it
I hear that openmp also shows something good or bad cache behavior
How do i see these cache interactions to arrange good cache behavior for my openmp omp pragma loop program?
OpenMP itself cannot be used to get information about the cache usage of your program. Depending on your platform there are some tools that will give insights to the cache behavior.
On Linux systems you can use perf.
perf stat -e cache-references,cache-misses <your-exe>
outputs statistics about the cache-misses. There are a lot more events that can be used (see here for further details). Common events are collected if you simply run:
perf stat <your-exe>
Another tool that can also be used for Windows is the IntelĀ® Performance Counter Monitor. Although it only works with Intel CPUs it can collect additional information like the occupied memory bandwidth (on supported models).
However, the tools can help you to measure the cache usage of your program, but did not improve it. You have to manually optimize your code and recheck if the cache misses have been reduced.
If you're looking to a specific kernel, you might want to consider [PAPI].1
I would like to model the behavior of caches in Intel architectures (LRU, inclusive, K-Way Associative, etc)., I've read wikipedia, Ulrich Drepper's great paper on memory, and the Intel Manual Volume 3A: System Programming Guide (chapter 11, but it's not very helpful, because they only explain what can be manipulated at the software level). I've also read a bunch of academic papers, but as usual, they do not make their code available for replication... even after asking for it. My question is, is there already a publicly available framework to model cache behavior? If not, is there a document detailing the behavior of caches from Intel at the deepest levels? I could not find one.
There are plenty of cache simulators out there, Dinero for e.g. (pun obviously intended) should be fairly simple and is often used for educational purposes.
Note that this simulator is trace-driven, it means it feeds on a list of memory access addresses, it doesn't know how to run a binary. You can produce such traces by emulating them with binary instrumentation tools, for e.g.
pin
qemu
bochs
etc.. Note that some of these offer internal cache simulators already, and may be possible to play with.
Other simulators can simulate full CPU/system behavior, not just caches, and can therefore support running a binary. Most of them include within them a simulated cache system. For e.g.:
gem5
simplescalar
multi2sim
marss86
and many others
On the other hand, writing your own cache simulator is fairly simple - if you can work on a memory trace (writing an actual fronend is way more complicated). You won't be able to get a too detailed spec on actual caches in Intel/AMD products, but the basic functionality is detailed in any computer architecture textbook or even wikipedia, the parameters (size, associativity, coherency policies) are mostly documented in the published guides, and may often change between product generations. You can always ask here if you encounter any specific question :)
Edit:
Regarding the second part of the question - there's no publicly available documentation of the exact cache implementation of Intel CPUs, but the dry "specs" (size, associativity, policies) are in the optimization guide:
Now, modeling these caches should be straightforward, but there may be some hidden caveats, like powerdown features or specialized LRU behaviors. One such reported example can be found here - http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ (if this is true, it might be worth implementing for accuracy), but aside from that I believe the overall behavior shouldn't be affected by these details too much, for any practical use.
Is there a way I can access the (Intel]) hardware counters for each core programmatically? (that is, no perf, perfmon, or valgrind, and I should add "simple", so no PAPI, e.g.) I'd like to know (for each core) how many L1-LLC cache hits/misses it (= a certain program running on that core) incurred in. This is for Linux 3.2.0-32, C, and using GCC.
The performance counters in the processor can not be read from "user-mode" code, so you need some sort of kernel module to do this. Once you have that, it's not terribly hard, there are a number of MSR's.
You can perhaps also use /dev/cpu/core-number/msr to read the values without a kernel module.
To describe all the details of how you do this is a little too much for an answer (unless I copy'n'paste the entire section of Intel's programmers manual (Vol3) - which I don't think is quite what we want here...)