How to find Cycles per instruction of an i7 processor - intel-vtune

I was trying to see a CPI value of a program on a i7 processor with vtune amplifier XE 2011.(on win8 x64)
according to the tutorial, viewpoint of
Hardware Event Counts
Hardware Event Sample Counts
Lightweight Hotspots
Hardware Issues
will sow the CPI value. But on my version i have only Lightweight Hotspots. thing is when i trying to analyse it gives a massage "unsupported architecture type".
can anyone tell me
how can i see the CPI of a program on i7 x64bit win8 and using vtune 2011? if impossible, why?
or
which version(or any other way) can measure the CPI on above system?

You need to upgrade to the latest VTune version (2015 Update 1) to get CPI for new CPUs.

2011 version was out-of-life with technical support. Version 2011 is too early(support Core(TM) 2 processors).2013 version might support i7 processor.
try latest trial version

"Cycles per instruction" hasn't been a metric since instruction pipelining and superscalar architectures were introduced - the concept becomes meaningless.
As an analogy, consider Ford's car factory after the introduction of the assembly line - except it's making 20 different models of cars with a huge variety of complexity, then trying to determine how many workers are used to make each car - you're making huge (false) assumptions about the nature of execution.

Related

How to use qemu to do profiling on a algorithm

I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.
I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?
There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:
branch mispredictions
cache misses
TLB misses
memory latency
as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.
For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.
The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.
On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).
In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.
I post my solution here, in case someone just want a rough estimation as me.
Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.
The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.
However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu):
In the first terminal:
qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c
Then in the second terminal:
arm-none-eabi-gdb blinky_c.elf
and below is the command history I input in the gdb terminal
(gdb) show commands
1 target remote :1235
2 load
3 info register
4 set $sp = 0x20020000
5 info register
6 b main
7 c
Then you can use the gdb to count instruction as in Counting machine instructions using gdb.
One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.
One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c
This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.
From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).

Choosing WebView2 Fixed Version for Distribution

We are moving from CefSharp to WebView2. Because of certain requirements, we are thinking of going ahead with the fixed version where the updates can be controlled by us. Now, on Microsoft's official distribution page we have 3 options available - x86, x64 and ARM64. We have users who use different combinations of OS and CPU architecture. One example is 32 bit Windows 10 Pro running on a 64 bit Intel processor. Here is where I am confused. Which one to ship to agents depending on their combinations of OS and CPU architecture. Can anybody help here? Here are the combinations -
I have not tried out and hence may be a blunt question - can x86 distributable be a safe bait for all these combinations? If yes, then what are the trade-offs?
I think x86 distribution is safe. If 32-bit OS is running, the entire system acts as purely 32-bit. It's impossible to use any 64-bit piece of code, so 64-bit applications won't work. You can also check this thread: If you want to run 64-bit app on 32-bit OS, you have to install a VM or something. I think that's not what you want.
In conclusion, I think you should choose the WebView2 Fixed Version according to the OS version.

What's the advantage of running OpenCL code on aCPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am learning OpenCL programming and am noticing something weird.
Namely, when I list all OpenCL enabled devices on my machine (Macbook Pro), I get the following list:
Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
Iris Pro
GeForce GT 750M
The first is my CPU, the second is the onboard graphics solution by Intel and the third is my dedicated graphics card.
Research shows that Intel has made their hardware OpenCL compatible so that I can tap into the power of the onboard graphics unit. That would be the Iris Pro.
With that in mind, what is the purpose of the CPU being OpenCL compatible? Is it merely for convenience so that kernels can be run on a CPU as backup should no other cards be found or is there any kind of speed advantage when running code as OpenCL kernels instead of regular (C, well threaded) programs on top of a CPU?
See https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf for basic info.
Basically the Intel OpenCL compiler performs horizontal autovectorization for certain types of kernels. That means with SSE4 you get 8 threads running in parallel in a single core in similar fashion as Nvidia GPU runs 32 threads in a single 32 wide simd unit.
There are 2 major benefits on this approach: What happens if in 2 years they increase the SSE vector width to 16? Then you will instantly get autovectorization for 16 threads when you run on that CPU. No need to recompile your code. The second benefit is that it's far easier to write an OpenCL kernel that autovectorizes easily compared to writing it in ASM or C and getting your compiler to produce efficient code.
As OpenCL implementations mature, it's possible to achieve good levels of performance portability for your kernels across a wide range of devices. Some recent work in my research group shows that, in some cases, OpenCL codes achieve a similar fraction of hardware peak performance on the CPU and the GPU. On the CPU, the OpenCL kernels were being very effectively auto-vectorised by Intel's OpenCL CPU implementation. On the GPU, efficient code was being generated for HPC and desktop devices from Nvidia (who's OpenCL still works surprisingly well) and AMD.
If you want to develop your OpenCL code anyway in order to exploit the GPU, then you're often getting a fast multi-core+SIMD version "for free" by running the same code on the CPU.
For two recent papers from my group detailing the performance portability results we've achieved across four different real applications with OpenCL, see:
"On the performance portability of structured grid codes on many-core computer architectures", S.N. McIntosh-Smith, M. Boulton, D. Curran and J.R. Price. ISC, Leipzig, June 2014. DOI: 10.1007/978-3-319-07518-1_4
"High Performance in silico Virtual Drug Screening on Many-Core Processors", S. McIntosh-Smith, J. Price, R.B. Sessions, A.A. Ibarra, IJHPCA 2014. DOI: 10.1177/1094342014528252
I have considered this for a while. You can get most of the advantages of OpenCL for the CPU without using OpenCL and without too much difficulty in C++. To do this you need:
Something for multi-threading - I use OpenMP for this
A SIMD library - I use Agner Fog's Vector Library Class (VCL) for this which covers SSE2-AVX512.
A SIMD math library. Once again I use Anger Fog's VCL for this.
A CPU dispatcher. Agner Fog's VCL has an example to do this.
Using the CPU dispatcher you determine what hardware is available and choose the best code path based on the hardware. This provides one of the advantages of OpenCL.
This gives you most of the advantages of OpenCL on the CPU without all its disadvantages. You never have to worry that a vendor stops supporting a driver. Nvidia has only a minimal amount of support for OpenCL - including several year old bugs it will likely never fix (which I wasted too much time on). Intel only has Iris Pro OpenCL drivers for Windows. Your kernels using my suggested method can use all C++ features, including templates, instead of OpenCL's restricted and extended version of C (though I do like the extensions). You can be sure your code does what you want this way and are not at the whim of some device driver.
The one disadvantage with my suggested method is that you can't just install a new driver and have it optimize for new hardware. However, the VCL already supports AVX512 so it's already built for hardware that is not out yet and won't be superseded for several years. And in any case to get the most use of your hardware you will almost certainly have to rewrite your kernel in OpenCL for that hardware - a new driver can only help so much.
More info on the SIMD math library. You could use Intel's expensive closed source SVML for this (which is what the Intel OpenCL drivers uses if you search of svml after you install the Intel OpenCL drivers - don't confuse the SDK with the drivers). Or you could use AMD's free but closed source LIBM. However,neither of these work well on the competitors processor. Agner Fog's VCL works well on both processors, is open source, and free.

Can the announced Tegra K1 be a contender against x86 and x64 chips in supercomputing applications?

To clarify, can this RISC base processor (the Tegra K1) be used without significant changes to today's supercomputer programs, and perhaps be a game changer because if it's power, size, cost, and energy usage? I know it's going up against some x64 or x86 processors. Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips? Thanks.
Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips?
It depends what you call "supercomputers code". Usually supercomputers run high-level functional code (usually fully compiled code like C++, sometimes VMs-dependent code like Java) on top of other low-lewel code and technologies such as OpenCL or CUDA for accelerators or MPICH for communication between nodes.
All these technologies have ARM implementations so the real thing is to make the functional code is ARM-compatible. This is usually straightforward as code written in high level language is mostly hardware-independent. So the short answer is: yes.
However, what may be more complicated is to scale this code to these new processors.
Tegra K1 is nothing like the GPUs embedded in supercomputers. It has far less memory, runs slightly slower and has only 192 cores.
Its price and power consumption make it possible, however, to build supercomputers with hundreds of them inside.
So code which have been written for traditionnal supercomputers (a few high-performance GPUs enbedded) will not reach the peak performance of 'new' supercomputers (built with a lot of cheap and weak GPUs). There will be a price to pay to existing code on these new architectures.
For modern supercomputing needs, you'd need to answer if a processor can perform well for the energy it consumes. Current architecture of Intel along with GPUs fulfill those needs and Tegra architecture do not perform as well in terms of power-performance to Intel processors.
The question is should it? Intel keeps proving that ARM is inferior and the only factor speaking for using RISC base processors is their price, which I highly doubt is a concern when building super computer.

Intel Core for a C programmer

First Question
From a C programmer's point of view, what are the differences between Intel Core processors and their AMD equivalents ?
Related Second Question
I think that there are some instructions that differentiate between the Intel Core from the other processors and vis-versa. How important are those instructions ? Are they being taken into account by compilers ? Would performances be better if there was some special Intel compiler only for the Core family ?
If you are programming user-level code and most driver code, there aren't many differences (one exception is the availability of certain instruction sets - which may differ for different processors, see below). If you are writing kernel code dealing with CPU-specific features (profiling using internal counters, memory management, power management, virtualization), the architectures differ in implementation, sometimes greatly.
Most compilers do not automatically take advantage of SSE instructions. However, most do provide SSE-based intrinsics, which will allow you to write SSE-aware code. The subset of all SSE levels available differs for each processor architecture and maker.
See this page for instruction listings. Follow the links to see which architectures the specific instructions are supported on. Also, read the Intel and AMD architecture development manuals for exact details about support and implementation of any and all instruction sets.
First Question From a C programmer's point of view, what are the differences between
Intel Core processors and their AMD equivalents ?
The most significant differences are likely to show up only in highly specialized code that makes use of new generation instructions, such as vector maths, parallelization, SSE.
Would performances be better if there was some special Intel compiler only for the Core family ?
Not sure if you are aware of it, but there's a compiler specifically for Intel cores: icc. It's generally considered to be the best compiler from an optimization point of view.
You might want to check out its wikipedia article.
According to the Intel Core Wikipedia article, there were notable
improvements to SSE, SSE2, and SSE3 instructions. These instructions are SIMD (same instruction, multiple data), meaning that they are designed for applying a single arithmetic operation to a vector of values. They are certainly important, and have been made used by compilers such as GCC for quite awhile.
Of course, recent AMD processors have adopted the newest Intel instructions, and vice-versa. This is an ongoing trend.

Resources