How to access CPU's heat sensors?

How to access CPU's heat sensors? - c

I am working on software in which I need to access the temperature sensors in the CPU and get control over them.
I don't know much hardware interfacing; I just know how to interface with the mouse. I have googled a lot about it but failed to find any relevant information or piece of code.
I really need to add this in my software. Please guide me how to have the control over the sensors using C or C++ or ASM.

Without a specific kernel driver, it's difficult to query the temperature, other than through WMI. Here is a piece of C code that does it, based on WMI's MSAcpi_ThermalZoneTemperature class:
HRESULT GetCpuTemperature(LPLONG pTemperature)
{
if (pTemperature == NULL)
return E_INVALIDARG;
*pTemperature = -1;
HRESULT ci = CoInitialize(NULL); // needs comdef.h
HRESULT hr = CoInitializeSecurity(NULL, -1, NULL, NULL, RPC_C_AUTHN_LEVEL_DEFAULT, RPC_C_IMP_LEVEL_IMPERSONATE, NULL, EOAC_NONE, NULL);
if (SUCCEEDED(hr))
{
IWbemLocator *pLocator; // needs Wbemidl.h & Wbemuuid.lib
hr = CoCreateInstance(CLSID_WbemAdministrativeLocator, NULL, CLSCTX_INPROC_SERVER, IID_IWbemLocator, (LPVOID*)&pLocator);
if (SUCCEEDED(hr))
{
IWbemServices *pServices;
BSTR ns = SysAllocString(L"root\\WMI");
hr = pLocator->ConnectServer(ns, NULL, NULL, NULL, 0, NULL, NULL, &pServices);
pLocator->Release();
SysFreeString(ns);
if (SUCCEEDED(hr))
{
BSTR query = SysAllocString(L"SELECT * FROM MSAcpi_ThermalZoneTemperature");
BSTR wql = SysAllocString(L"WQL");
IEnumWbemClassObject *pEnum;
hr = pServices->ExecQuery(wql, query, WBEM_FLAG_RETURN_IMMEDIATELY | WBEM_FLAG_FORWARD_ONLY, NULL, &pEnum);
SysFreeString(wql);
SysFreeString(query);
pServices->Release();
if (SUCCEEDED(hr))
{
IWbemClassObject *pObject;
ULONG returned;
hr = pEnum->Next(WBEM_INFINITE, 1, &pObject, &returned);
pEnum->Release();
if (SUCCEEDED(hr))
{
BSTR temp = SysAllocString(L"CurrentTemperature");
VARIANT v;
VariantInit(&v);
hr = pObject->Get(temp, 0, &v, NULL, NULL);
pObject->Release();
SysFreeString(temp);
if (SUCCEEDED(hr))
{
*pTemperature = V_I4(&v);
}
VariantClear(&v);
}
}
}
if (ci == S_OK)
{
CoUninitialize();
}
}
}
return hr;
}
and some test code:
HRESULT GetCpuTemperature(LPLONG pTemperature);
int _tmain(int argc, _TCHAR* argv[])
{
LONG temp;
HRESULT hr = GetCpuTemperature(&temp);
printf("hr=0x%08x temp=%i\n", hr, temp);
}

I assume you are interested in a IA-32 (Intel Architecture, 32-bit) CPU and Microsoft Windows.
The Model Specific Register (MSR) IA32_THERM_STATUS has 7 bits encoding the "Digital Readout (bits 22:16, RO) — Digital temperature reading in 1 degree Celsius relative to the TCC activation temperature." (see "14.5.5.2 Reading the Digital Sensor" in "Intel® 64 and IA-32 Architectures - Software Developer’s Manual - Volume 3 (3A & 3B): System Programming Guide" http://www.intel.com/Assets/PDF/manual/325384.pdf).
So IA32_THERM_STATUS will not give you the "CPU temperature" but some proxy for it.
In order to read the IA32_THERM_STATUS register you use the asm instruction rdmsr, now rdmsr cannot be called from user space code and so you need some kernel space code (maybe a device driver?).
You can also use the intrinsic __readmsr (see http://msdn.microsoft.com/en-us/library/y55zyfdx(v=VS.100).aspx) which has anyway the same limitation: "This function is only available in kernel mode".
Every CPU cores has its own Digital Thermal Sensors (DTS) and so some more code is needed to get all the temperatures (maybe with the affinity mask? see Win32 API SetThreadAffinityMask).
I did some tests and actually found a correlation between the IA32_THERM_STATUS DTS readouts and the Prime95 "In-place large FFTs (maximum heat, power consumption, some RAM tested)" test. Prime95 is ftp://mersenne.org/gimps/p95v266.zip
I did not find a formula to get the "CPU temperature" (whatever that may mean) from the DTS readout.
Edit:
Quoting from an interesting post TJunction Max? #THERMTRIP? #PROCHOT? by "fgw" (December 2007):
there is no way to find tjmax of a certain processor in any register.
thus no software can read this value. what various software developers
are doing, is they simply assume a certain tjunction for a certain
processor and hold this information in a table within the program.
besides that, tjmax is not even the correct value they are after. in
fact they are looking for TCC activacion temperature threshold. this
temperature threshold is used to calculate current absolute
coretemperatures from. theoretical you can say: absolute
coretemperature = TCC activacion temperature threshold - DTS i had to
say theoretically because, as stated above, this TCC activacion
temperature threshold cant be read by software and has to be assumed
by the programmer. in most situations (coretemp, everest, ...) they
assume a value of 85C or 100C depending on processor family and
revision. as this TCC activacion temperature threshold is calibrated
during manufacturing individually per processor, it could be 83C for
one processor but may be 87C for the other. taking into account the
way those programms are calculating coretemperatures, you can figure
out at your own, how accurate absolute coretemperatures are! neither
tjmax nor the "most wanted" TCC activacion temperature threshold can
be found in any public intel documents. following some discussions
over on the intel developer forum, intel shows no sign to make this
information available.

You can read it from the MSAcpi_ThermalZoneTemperature in WMI
Using WMI from C++ is a bit involved, see MSDN explanantion and examples
note: changed original unhelpful answer

Related

It's like OpenCL kernel instance ends abruptly

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows

OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.

As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.

Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

How does Qemu emulate PCIe devices?

I'm writing an open source document about qemu internals so if you help me you're helping the growth of Qemu project
The closest answer I found was: In which conditions the ioctl KVM_RUN returns?
This is the thread loop for a single CPU running on KVM:
static void *qemu_kvm_cpu_thread_fn(void *arg)
{
CPUState *cpu = arg;
int r;
rcu_register_thread();
qemu_mutex_lock_iothread();
qemu_thread_get_self(cpu->thread);
cpu->thread_id = qemu_get_thread_id();
cpu->can_do_io = 1;
current_cpu = cpu;
r = kvm_init_vcpu(cpu);
if (r < 0) {
error_report("kvm_init_vcpu failed: %s", strerror(-r));
exit(1);
}
kvm_init_cpu_signals(cpu);
/* signal CPU creation */
cpu->created = true;
qemu_cond_signal(&qemu_cpu_cond);
qemu_guest_random_seed_thread_part2(cpu->random_seed);
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
qemu_kvm_destroy_vcpu(cpu);
cpu->created = false;
qemu_cond_signal(&qemu_cpu_cond);
qemu_mutex_unlock_iothread();
rcu_unregister_thread();
return NULL;
}
You can see here:
do {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(cpu);
}
}
qemu_wait_io_event(cpu);
} while (!cpu->unplug || cpu_can_run(cpu));
that every time the KVM returns, it gives an opportunity for Qemu to emulate things. I suppose that when the kernel on the guest tries to access a PCIe device, KVM on the host returns. I don't know how KVM knows how to return. Maybe KVM maintains the addresses of the PCIe device and tells Intel's VT-D or AMD's IOV which addresses should generate an exception. Can someone clarify this?
Well, by the look of the qemu_kvm_cpu_thread_fn, the only place where a PCIe access could be emulated, is qemu_wait_io_event(cpu), which is defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1266 and which calls qemu_wait_io_event_common defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1241 which calls process_queued_cpu_work defined here: https://github.com/qemu/qemu/blob/stable-4.2/cpus-common.c#L309
Let's see the code which executes the queue functions:
while (cpu->queued_work_first != NULL) {
wi = cpu->queued_work_first;
cpu->queued_work_first = wi->next;
if (!cpu->queued_work_first) {
cpu->queued_work_last = NULL;
}
qemu_mutex_unlock(&cpu->work_mutex);
if (wi->exclusive) {
/* Running work items outside the BQL avoids the following deadlock:
* 1) start_exclusive() is called with the BQL taken while another
* CPU is running; 2) cpu_exec in the other CPU tries to takes the
* BQL, so it goes to sleep; start_exclusive() is sleeping too, so
* neither CPU can proceed.
*/
qemu_mutex_unlock_iothread();
start_exclusive();
wi->func(cpu, wi->data);
It looks like that the only power the VCPU thread qemu_kvm_cpu_thread_fn has when KVM returns, is to execute the queued functions:
wi->func(cpu, wi->data);
This means that a PCIe device would have to constantly queue itself as a function for qemu to execute. I don't see how it would work.
The functions that are able to queue work on this cpu have run_on_cpu on its name. By searching it on VSCode I found some functions that queue work but none related to PCIe or even emulation. The nicest function I found was this one that apparently patches instructions: https://github.com/qemu/qemu/blob/stable-4.2/hw/i386/kvmvapic.c#L446. Nice, I wanted to know that also.

Device emulation (of all devices, not just PCI) under KVM gets handled by the "case KVM_EXIT_IO" (for x86-style IO ports) and "case KVM_EXIT_MMIO" (for memory mapped IO including PCI) in the "switch (run->exit_reason)" inside kvm_cpu_exec(). qemu_wait_io_event() is unrelated.
Want to know how execution gets to "emulate a register read on a PCI device" ? Run QEMU under gdb, set a breakpoint on, say, the register read/write function for the ethernet PCI card you're using, and then when you get dropped into the debugger look at the stack backtrace. (Compile QEMU --enable-debug to get better debug info for this kind of thing.)
PS: If you're examining QEMU internals for educational purposes, you'd be better to use the current code, not a year-old release of it.

PAPI_num_counters() shows the system doesn't have available counters

I have a question regarding PAPI (Performance Application Programming Interface). I downloaded and installed PAPI library. Still not sure how to use it correctly and what additional things I need, to make it work. I am trying to use it in C. I have this simple program:
int retval;
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT && retval > 0) {
printf("PAPI error: 1\n");
exit(1);
}
if (retval < 0)
printf("PAPI error: 2\n");
retval = PAPI_is_initialized();
if (retval != PAPI_LOW_LEVEL_INITED)
printf("PAPI error: 2\n");
int num_hwcntrs = 0;
if ((num_hwcntrs = PAPI_num_counters()) <= PAPI_OK)
printf("This system has %d available counters. \n", num_hwcntrs);
I have included papi.h library and I am compiling with gcc -lpapi flag. I added library in path so it is able to compile and run, but as a result I get this:
This system has 0 available counters.
Thought initialization seems to work as it doesn't give error code.
Any advice or suggestion would be helpful to determine what I have not done right or missed to run it correctly. I mean, I should have available counters in my system, more precisely I need cache miss and cache hit counters.
I tried to count available counters after I run this another simple program and it gave error code -25:
int numEvents = 2;
long long values[2];
int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};
printf("PAPI error: %d\n", PAPI_start_counters(events, numEvents));
UPDATE: I just tried to check from terminal hardware information with command: papi_avail | more; and I got this:
Available PAPI preset and user defined events plus hardware information.
PAPI version : 5.7.0.0
Operating system : Linux 4.15.0-45-generic
Vendor string and code : GenuineIntel (1, 0x1)
Model string and code : Intel(R) Core(TM) i5-6200U CPU # 2.30GHz (78, 0x4e)
CPU revision : 3.000000
CPUID : Family/Model/Stepping 6/78/3, 0x06/0x4e/0x03
CPU Max MHz : 2800
CPU Min MHz : 400
Total cores : 4
SMT threads per core : 2
Cores per socket : 2
Sockets : 1
Cores per NUMA region : 4
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): no
PAPI Preset Events
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 No No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 No No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses
PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache misses
.......
So because Number Hardware Counters is 0, I can't use this tool to count cache misses with PAPI's preset events? Is there any configuration that can be useful or should I forget about it till I change my laptop?

How can i create a truly random number in C (IDE: MPLAB)

I am making a game on a PIC18F2550 and I need to create a random number between 1 and 4.
I've found out already that the rand() function sucks for truly random numbers.
I've tried the srand(time(NULL)), this is the main file
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
void main(void) {
srand(time(NULL));
init();
while(timed_to_1ms()) {
fsm_game();
}
}
this is the fsm_game.c file
void randomSpawn(int time2) {
if(counter%time2==0) {
int x = rand()%4 +1; // Makes number between 1 and 4
}
}
When I try to build this I get an error saying:
":0: error: (499) undefined symbol:
_time(dist/default/production\Dontcrash.production.obj) "
I think the problem may be in the fact that a microProcessor doesn't 'know the time', so if that's the case, what can I do to solve this problem?

To initialize the random number generator you could read the voltage present on a floating pin with the PIC's ADC, and set the seed with srand().
Additionally, you could save the seed to EEPROM every time the program starts, and read the previous seed value from EEPROM, combine it with the ADC value to make things less predictable. - I think this would be good enough for a game, otherwise that would be too crude I guess.
unsigned int seed = read_seed_from_adc();
seed ^= read_seed_from_eeprom(); /* use something else than XOR here */
srand(seed);
write_seed_to_eeprom(seed);
I think the problem may be in the fact that a microProcessor doesn't
'know the time', so if that's the case, what can I do to solve this
problem?
Time is usually measured with an RTC on a microcontroller, so it depends on your hardware if the RTC can be used (the RTC usually needs an quartz resonator and a backup battery to keep it running all the time, some micros use an external RTC). Since most often only a small C library is used on microcontrollers, time() usually will not be available, and you would need to read out the RTC registers yourself. - It could also be used to initialize the PRNG.

Since your PIC doesn't have a hardware RNG, you'll have to settle for a software PRNG. Yes, the one in many C libraries is not very good (though many modern compilers are getting better). For small embedded systems, I like to use Marsaglia's XOR shift:
static uint32_t rng_state;
uint32_t rng_next() {
rng_state ^= (rng_state << 13);
rng_state ^= (rng_state >> 17);
rng_state ^= (rng_state << 5);
return rng_state - 1;
}
You seed the above by just setting rng_state to an initial value (other than 0).

"Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin." - John von Neumann
There are sources of true random numbers on the internet, such as LavaRnd.
It is also possible to get some physical random input from your computer's microphone socket, as long as there is no microphone connected. That will deliver thermal noise which can be used as a basis for a TRNG. Best to gather a large quantity of the data, at least fifty times as many bits or more for security, and extract the entropy with a good quality cryptographic hash function. You will need to experiment, depending on how much entropy your equipment delivers.
More complex would be a full implementation of Fortuna or similar, which gather entropy from a number of sources.
Most expensive would be to purchase a true random source on a card and install that.

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code

I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?

I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.

From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight