Simple cache profiling API - c

Is there a way I can access the (Intel]) hardware counters for each core programmatically? (that is, no perf, perfmon, or valgrind, and I should add "simple", so no PAPI, e.g.) I'd like to know (for each core) how many L1-LLC cache hits/misses it (= a certain program running on that core) incurred in. This is for Linux 3.2.0-32, C, and using GCC.

The performance counters in the processor can not be read from "user-mode" code, so you need some sort of kernel module to do this. Once you have that, it's not terribly hard, there are a number of MSR's.
You can perhaps also use /dev/cpu/core-number/msr to read the values without a kernel module.
To describe all the details of how you do this is a little too much for an answer (unless I copy'n'paste the entire section of Intel's programmers manual (Vol3) - which I don't think is quite what we want here...)

Related

Question to any embedded systems engineers out there using STM32 NUCLEO

I have recently bought an STM32 NUCLEO Dev Kit and wondered if this what an actual Embedded Systems Engineer would use in the industry when developing a product?
I'm using Kiel Uvision 5, STMCubeMX and STM32 ST-LINK Utility to develop certain projects. As I am used to using PIC and using registers like PORTA, OSCCON, TIMER0 etc, I see that Kiel Uvision 5 uses ready made functions like HAL_GPIO_TogglePin(.........) etc. Is this the usual way they do this in industry or work more directly with the registers?
It's heavily opinion based and I wouldn't get surprised if this question got closed for this reason. This answer only touches on few aspects of what you're asking about. It's a very broad topic and it's going to be hard - if not impossible - to include everything in one post which wouldn't end up being several pages long. However to give you my perspective on the topic, while trying to remain unbiased, the short answer is.. it depends.
If you're asking about what is used in most common cases, it's likely going to be the HAL (previously StdPeriph) functions you've mentioned. The reason is - they get the job done in most common cases. After all it always comes down to what the cost of creating a product is going to be. If HAL functions are "good enough" for the purpose, they're going to be used simply because they're faster to develop with. The higher the development cost, the more you'll want to cut it (or move it elsewhere) and using abstractions is one way of doing so.
However, even though I think it's safe to assume that HAL / Std Periph / any other (including proprietary) abstraction layer is generally used, it's not always the case for at least two reasons I can think of:
Existing functions may not be suitable for your purpose. Giving HAL as an example, it works pretty well for most common cases, but sometimes your needs may be so specific that you'll have to go and mess "under the hood", often ending up either writing your own variation of the functions of building something new on top of HAL. Personally I can think of at least few examples where HAL functions weren't exactly what I needed. It doesn't necessarily mean that the library is bad, it's just sometimes the requirements are very specific.
Messing with registers directly may sometimes be required for performance reasons. HAL and similar are an abstraction layer and as any abstraction, they take more time to execute than using the registers directly. If you're trying to squeeze absolute maximum out of given peripheral, you'll sometimes have to go down to register level.
Now to a more biased portion of my answer.. I can see why you ask this question. Coming from PIC world where Flash or CPU clocks were more precious, it does make sense to use registers directly there. In case of STM32, it's not as critical anymore. Having said that, you'll sometimes stumble upon opinions that "using registers is the only true way", but personally I find such discussions ending up being purely academic. I see registers or any abstraction built on top of it as tools and you should use the right tools for the right job. Two examples of NOT using the right tools:
You use only registers as "the only right way" either because you believe it yourself or you've been told so. Your products take twice (if not more) time to develop, your code takes less space in flash (so now you use 46% of 1MB flash instead of 48%). Code that is performance-critical meets its goals. Code that has relaxed execution time constraints is also super efficient, but it doesn't affect the end customer much, if at all. Your code is also less reusable - you find yourself rewriting same portions of code over and over every time you release new product for a new MCU family.
You only use HAL / any other similar abstraction because "you didn't pick so powerful MCU to have to go down to register level", or because you're told you should never ever touch registers. You develop much faster and you're able to release two products instead of just one using registers. However when there are execution time constraints / transmission speeds you have to hit, you find yourself picking MCUs more powerful than should theoretically be needed. Sometimes you find yourself writing wrappers around HAL because they don't give you exactly the functionality you need - it feels like making it more complicated than it should be.
So after all, if there was anything to take out of what I'm trying to say is that you should use what is suitable for the job on a case-by-case basis. In case of STM32, you nowadays have 3 options: HAL (top abstraction level), HAL LL (Low Level abstraction - often simple wrapper functions around register acceses) or using registers directly. Which one you choose should come from what your requirements are.
I use Nucleo boards all the time. It allows me to start write software before I have the actual hardware ready.
HAL & register way is rather the programmer choice than the "industry standard". I personally use HAL drivers when program more complicated peripherals like USB and Ethernet to avoid wiring the full stacks from the scratch.
I have just finished my degree and we heavily used the STM32 platform with CMSIS and HAL.
It is a matter of preference. The HAL libraries offer higher abstraction but come with some quirks that are not very intuitive. The HAL (was/)is buggy sometimes. We have encountered SPI transfer bugs which made the higher transfer rates of SPI unusable because of a delay in the per byte transfer.
CMSIS offers lower level access but still abstracts away from simple bit manipulation. I don't think that direct register access is a great way to program anymore and at least CMSIS should be used. But it still a matter of opinion, preference and what is right for the job at hand. If you need something quick: HAL. If you need really fine control: CMSIS.
(sidenote I believe that CMSIS has been phased out in favor of HAL but it is still usable at this time)
All of the other answers are valid.
Just wanted to chime in and say I used STM32 regularly at my job (and at two different companies). We used the HAL drivers at both companies. There are obviously some issues with them, but used regularly enough by others that you can easily find support online. Additionally CubeMx does a decent job with them, at least enough to get you started on using the peripherals. So the 0->something step feel smaller. But getting really optimized code and design, you may want to dive deeper to really understand what each of the HAL drivers are doing and decide which method works for your project, goals, and requirements.

Will it be feasible to use gcc's function multi-versioning without code changes?

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains vectorized loops, then patch the code with FMV attributes and compile it again.
How feasible will it be for GCC to do it automatically? For example, by passing -mmultiarch=sandybridge,skylake (or a similar -m option listing CPU extensions like AVX and AVX2).
Right now I'm interested in two usage scenarios:
Use this option for our large math-heavy program for delivering releases to our customers. I don't want to pollute the code with non-standard attributes and I don't want to modify the third-party libraries we use.
The other Linux distributions will be able to do this easily, without patching the code as Intel does. This should give all Linux users massive performance gains.
No, but it doesn't matter. There's very, very little code that will actually benefit from this; for the most part by doing it globally you'll just (without special effort to sort matching versions in pages together) make your system much more memory-constrained and slower due to the huge increase in code size. Most actual loads aren't even CPU-bound; they're syscall-overhead-bound, GPU-bound, IO-bound, etc. And many of the modern ones that are CPU-bound aren't running precompiled code but JIT'd code (i.e. everything running in a browser, whether that's your real browser or the outdated and unpatched fork of Chrome in every Electron app).

How do I make many system calls at once with the linux kernel?

I was wondering if I could make a large number of system calls at the same time, with only one switch overhead. I need this because I have a need to make many (128) system calls at the same time. If I could do this without switching between kernel and userland 256+ times I think it could make my (speed sensitive) library significantly faster.
You really can't do that from an application program. What you could do is build a loadable kernel module that implements those operations and presents a simple API -- then you can change context once, do all the work, and return.
However, as with most of these sorts of optimization questions, the first thing to ask is "why do you think it's going to be necessary?" Do you have timing information etc? Have you profiled? How much of a performance issue do you really have, and is the additional complexity going to be worth the speedup?
I don't think Linux will support syscall chaining anytime soon. You might have more luck implementing this on another kernel and porting your application.
That said, it's not difficult to write a proxy to do the job in kernelspace for you, but don't expect it to be merged upstream. I've worked on real-time stuff and we had a solution like that, but never used in production because of support issues :/.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

How would I go about creating my own VM?

I'm wondering how to create a minimal virtual machine that'll be modeled after the Intel 16 bit system. This would be my first actual C project, most of my code is 100 lines or less, but I have the core fundamentals down, read K&R, and understand how things ought to work, so this pretty much is a test of wits.
Could anyone guide me in as far as documentation, tools, tutorials, or plain old tips/pointers on how to go about this, thus far I understand that I require somewhere to store data, a CPU of sorts and some sort of mechanism that functions as an interrupt controller.
I'm doing this to learn: Systems internals, ASM internals and C - three facets of computing that I want to learn in a singular project.
Please be kind enough not to tell me to do something simpler - that would only be annoying. :)
Thanks for reading, and hopefully writing!
Virtual machines fall into two categories: those that interpret the code instruction at a time and those that compile the code to native instructions (e.g. "JIT").
The interpretation category is usually built around an instruction execution loop, using a switch statement, computed gotos or function pointers to determine how to execute each instruction.
There is a fun platform that is worth studying for its simplicity and fun: Corewars.
Corewars is a programming challenge game where programs written in "Redcode" run on a MARS VM. There are many MARS VMs, typically written in C.
It has also inspired 8086-based versions, where programs written in 8086 assembler battle.
Well, for starters I would pick up a reference book for assembly language for the processor you intend to virtualize, like 80286 or similar.
For a JIT, you might want to dynamically generate and execute x86 code.
If you want to write a Virtual Machine using the x86 VMM technology you will need quite a bit of things.
There are a few instructions that are critical such as VM_ENTER/VM_EXIT (name can change depending on the chip, AMD and INTEL use different names but the functionalities are the same). Those instructions are actually privileged and therefore, you will need to write a kernel module to use them.
The first step for your VM to start is to boot it and therefore, you will need a 'BIOS' which will be loaded. Then you need to emulate devices, etc. You could even run an old version of MSDOS in such a VM if you wanted to.
All in all, it clearly isn't trivial and requires a lot of time and effort.
Now, you could do something similar to what VMWare used to do before the Virtualization ready CPUs appeared.

Resources