If I don't utilize multithreaded paradigms when designing my code, will hyperthreading split the load automagically over the logical cores, or would my have to be specicially written to take advantage of the other cores like it would have to be for physical cores?
On suggestion of #us2012 I post this here from my comment...
There is no such magic. Superscalar CPUs, especially OOO (Out Of Order execution) processors do magic - but that is inside one core.
On the contrary, Hyperthreading can be thought of as (very simplified) two pipelines in front of one complete core.
AMD Bulldozer CPUs have a similar bit, but they went a step further: the integer core is split into two too, but the two pipelines + integer cores share one floating point unit. This whole is called a "module", having two threads.
TL;DR
Superscalar (from the Wiki)
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Out of order execution (from the Wiki)
In computer engineering, out-of-order execution (OoOE or OOE) is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which are able to run immediately.
Hyperthreading (from... you know where...)
Hyper-threading (officially Hyper-Threading Technology or HT Technology, abbreviated HTT or HT) is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on PC microprocessors. It first appeared in February 2002 on Xeon server processors and in November 2002 on Pentium 4 desktop CPUs.1 Later, Intel included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.
Bulldozer (not from not the wiki)
Bulldozer is the first major redesign of AMD’s processor architecture since 2003, when the firm launched its K8 processors, and also features two 128-bit FMA-capable FPUs which can be combined into one 256-bit FPU. This design is accompanied by two integer clusters, each with 4 pipelines (the fetch/decode stage is shared). Bulldozer will also introduce shared L2 cache in the new architecture. AMD's marketing service calls this design a "Module". A 16-threads processor design would feature eight of these "modules",[7] but the operating system will recognize each "module" as two logical cores.
Related
When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...
I have been asked a question but I am not sure if I answered it correctly.
"Is it possible to rely only on software timer?"
My answer was "yes, in theory".
But then I added:
"Just relying on hardware timer at the kernel loading (rtc) and then
software only is a mess to manage since we must be able to know
how many cpu cycles each instruction took + eventual cache miss +
branching cost + memory speed and put a counter after each one or
group (good luck with out-of-order cpu).
And do the calculation to derivate the current cpu cycle. That is
insane.
Not talking about the overall performance drop.
The best we could have is a brittle approximation of the time which
become more wrong over time. Even possibly on short laps."
But even if it seems logical to me, did my thinking go wrong?
Thanks
On current processors and hardware (e.g. Intel or AMD or ARM in laptops or desktops or tablets) with common operating systems (Linux, Windows, FreeBSD, MacOSX, Android, iOS, ...) processes are scheduled at random times. So cache behavior is non deterministic. Hence, instruction timing is non reproducible. You need some hardware time measurement.
A typical desktop or laptop gets hundreds, or thousands, of interrupts every second, most of them time related. Try running cat /proc/interrupts on a Linux machine twice, with a few seconds between the runs.
I guess that even with a single-tasked MS-DOS like operating system, you'll still get random behavior (e.g. induced by ACPI, or SMM). On some laptops, the processor frequency can be throttled by its temperature, which depends upon the CPU load and the external temperature...
In practice you really want to use some timer provided by the operating system. For Linux, read time(7)
So you practically cannot rely on a purely software timer. However, the processor has internal timers.... Even in principle, you cannot avoid timers on current processors ....
You might be able, if you can put your hardware in a very controlled environment (thermostatically) to run a very limited software (an OS-like free standing thing) sitting entirely in the processor cache and perhaps then get some determinism, but in practice current laptop or desktop (or tablet) hardware is non-deterministic and you cannot predict the time needed for a given small machine routine.
Timers are extremely useful in interesting (non-trivial) software, see e.g. J.Pitrat CAIA, a sleeping beauty blog entry for an interesting point. Also look at the many uses of watchdog timers in software (e.g. in the Parma Polyhedra Library)
Read also about Worst Case Execution Time (WCET).
So I would say that even in theory it is not possible to rely upon a purely software timer (unless of course that software uses the processor timers, which are hardware circuits). In the previous century (up to 1980s or 1990s) hardware was much more deterministic, and the amount of clock cycles or microsecond needed for each machine instruction was documented (but some instructions, e.g. division, needed a variable amount of time, depending on the actual data!).
If I am writing a multi-threaded C application on linux (using pthreads), can I take advantage of multi-core processor.
I mean what should an application programmer do to take advantage of multi-core processor. Or is it that the OS alone does so with its various scheduling algorithms
You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.
"Take advantage of multi-core" could be understood to mean "utilize multi-core."
Or it could mean "gaining a qualitative advantage from the utilization of multi-core."
Anyone can do the former. They often end up with software that runs slower than if it were single-threaded.
The latter is an entirely different proposition. It requires writing the software such that usage of and accessing computing resources shared by all cores (bus-locking, RAM and L3 cache) are economized upon and focusing on doing as much computing as possible primarily in the individual cores and their L1 caches. The L2 cache is usually shared by two cores so it falls somewhere in-between the two categories in that yes, it is a shared resource but it is shared by just two cores and it is much faster than the resources shared by all cores.
This is at the implementation level, writing and testing the code.
The decisions made at earlier stages - specifically the system's software architecture phase - are usually much more important to the system's long-term quality and performance.
Some posts: 1 2 3. There are many more.
Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator
I'm working with optimizing a software and wants to measure the performance. So I am currently simulating an ARM platform with OVP (open virtual platform) and I get the statistics as simulation time and simulated instructions.
My question is, why is the simulated instructions different everytime I run the software (different, but close proximity)? Should it not be the same everytime? Is it not like this , the software that I write in C will be compiled into ARM assembler instructions, and each time the software runs, the simulated instructions will be how many time these ARM assembler instructions run? It should be the same everytime?
How should I measure performance? Take 10 samples of simulated instructions and get the average?
From my experience in a real (non-simulated) ARM, if I take cycle counts for a section of the code the number of cycles will vary, this is because:
There can be context switches in the middle of your executing code.
The initial state of the CPU may be different upon entering the code section. (e.g. the content of the pipeline, branch prediction etc.)
The cache state will be different on entry to the code section.
External factors such as other hardware accessing external memory.
Due to all these, taking an average (plus some other statistical measures) is really the only practical approach for real hardware and a real OS. In a good simulator some of these factors or potentially eliminated.
On some real chips (or if supported by the simulator) the ARM Performance Monitoring Unit can be useful.
If you're coding for the Cortex A8 this is a cool online cycle counter that can really help you squeeze more performance out of your code.