Does anybody have any experience in maintaining single codebase for both CPU and GPU?
I want to create an application which when possible would use GPU for some long lasting calculations, but if a compatible GPU is not present on a target machine it would just use regular CPU version. It would be really helpfull if I could just write a portion of code using conditional compilation directives which would compile both to a CPU version and GPU version. Of course there will be some parts which are different for CPU and GPU, but I would like to keep the essense of the algorithm in one place. Is it at all possible?
OpenCL is a C-based language. OpenCL platforms exist that run on GPUs (from NVidia and AMD) and CPUs (from Intel and AMD).
While it is possible to execute the same OpenCL code on both GPUs and CPUs, it really needs to be optimized for the target device. Different code would need to be written for different GPUs and CPUs to gain the best performance. However, a CPU OpenCL platform can function as a low-performance fallback for even GPU optimized code.
If you are happy writing conditional directives that execute depending on the target device (CPU or GPU) then that can help performance of OpenCL code on multiple devices.
Related
When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...
Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator
I'd like to use hardware performance counter, specifically x86 CPUs to obtain cache misses or branch mis-prediction. Performance counters are heavily used in advanced profilers like Intel VTune. Please don't be confused performance counters on Windows operating systems.
In order to use these counters in C/C++ program, one may use PAPI: http://icl.cs.utk.edu/papi/
This allows you to easily use performance counters, but on only Linux. PAPI once supported Windows, but not now.
Is there anyone who recently tried PAPI or other APIs to use hardware performance counters on Windows?
You can use RDPMC instruction or __readpmc MSVC compiler intrinsic, which is the same thing.
However, Windows prohibits user-mode applications to execute this instruction by setting CR4.PCE to 0. Presumably, this is done because the meaning of each counter is determined by MSR registers, which are only accessible in kernel mode. In other words, unless you're a kernel-mode module (e.g. a device driver), you are going to get "privileged instruction" trap if you attempt to execute this instruction.
If you're writing a user-mode application, your only option is (as #Christopher mentioned in comments) to write a kernel module which would execute this instruction for you (you'll incur user->kernel call penalty) and enable test signing on your machine so your presumably self-signed "driver" can be loaded. This means you can't easily distribute this app, but that'll work for in-house tuning.
What about this HCP Reference? Does it not provide what you want?
I'm working with optimizing a software and wants to measure the performance. So I am currently simulating an ARM platform with OVP (open virtual platform) and I get the statistics as simulation time and simulated instructions.
My question is, why is the simulated instructions different everytime I run the software (different, but close proximity)? Should it not be the same everytime? Is it not like this , the software that I write in C will be compiled into ARM assembler instructions, and each time the software runs, the simulated instructions will be how many time these ARM assembler instructions run? It should be the same everytime?
How should I measure performance? Take 10 samples of simulated instructions and get the average?
From my experience in a real (non-simulated) ARM, if I take cycle counts for a section of the code the number of cycles will vary, this is because:
There can be context switches in the middle of your executing code.
The initial state of the CPU may be different upon entering the code section. (e.g. the content of the pipeline, branch prediction etc.)
The cache state will be different on entry to the code section.
External factors such as other hardware accessing external memory.
Due to all these, taking an average (plus some other statistical measures) is really the only practical approach for real hardware and a real OS. In a good simulator some of these factors or potentially eliminated.
On some real chips (or if supported by the simulator) the ARM Performance Monitoring Unit can be useful.
If you're coding for the Cortex A8 this is a cool online cycle counter that can really help you squeeze more performance out of your code.
I was running cuda program on a machine which has cpu with four cores, how is it possible to change cuda c program to use all four cores and all gpu's available?
I mean my program also does things on host side before computing on gpus'...
thanks!
CUDA is not intended to do this. The purpose of CUDA is to provide access to the GPU for parallel processing. It will not use your CPU cores.
From the What is CUDA? page:
CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).
That should be handled via more traditional multi-threading techniques.
cuda code runs only on GPU.
so if you want parallelism on your CPU cores, you need to use threads such as Pthreads or OpenMP.
Convert your program to OpenCL :-)