Controlling OS X volume in Snow Leopard - osx-snow-leopard

This is a follow up to Controlling volume of running applications in Mac OS X via Objective-C, which explains how to set the volume for 10.5 or earlier. The AudioXXXXXGetProperty, and AudioXXXXXSetProperty (and related) functions are deprecated in 10.6, per Technical Note TN2223.
I'm not an expert in OS X or CoreAudio programming, so I'm hoping someone has muddled through what's required in Snow Leopard and can help me (and others) out here.

Here's an example to set volume to 50%:
Float32 volume = 0.5;
UInt32 size = sizeof(Float32);
AudioObjectPropertyAddress address = {
kAudioDevicePropertyVolumeScalar,
kAudioDevicePropertyScopeOutput,
1 // use values 1 and 2 here, 0 (master) does not seem to work
};
OSStatus err;
err = AudioObjectSetPropertyData(device, &address, 0, NULL, size, &volume);

Related

It's like OpenCL kernel instance ends abruptly

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows
OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.
As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.
Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

Why would D3D11CreateDeviceAndSwapChain fail when back buffer pixel format set to DXGI_FORMAT_B8G8R8X8_UNORM?

I am trying to port a piece of Direct3D 9 code to Direct3D 11. The original code uses an adapter format of D3DFMT_X8R8G8B8. I searched MSDN and found the equivalent in Direct3D 11 is DXGI_FORMAT_B8G8R8X8_UNORM. Here is my modified code for Direct3D 11 after creating the window:
DXGI_SWAP_CHAIN_DESC swap_chain_description;
ZeroMemory(&swap_chain_description, sizeof swap_chain_description);
swap_chain_description.BufferDesc.Width = window_width;
swap_chain_description.BufferDesc.Height = window_height;
swap_chain_description.BufferDesc.RefreshRate.Denominator = 1;
swap_chain_description.BufferDesc.Format = DXGI_FORMAT_B8G8R8X8_UNORM;
swap_chain_description.SampleDesc.Count = 1;
swap_chain_description.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT;
swap_chain_description.BufferCount = 1;
swap_chain_description.OutputWindow = hWnd;
swap_chain_description.Windowed = TRUE;
swap_chain_description.SwapEffect = DXGI_SWAP_EFFECT_DISCARD;
swap_chain_description.Flags = 0;
HRESULT hr = D3D11CreateDeviceAndSwapChain(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, 0, NULL, 0, D3D11_SDK_VERSION, &swap_chain_description, &dxgi_swap_chain, &d3d11_device, supported_feature_level, &d3d11_device_context);
if (FAILED(hr)) {
return EXIT_FAILURE;
}
My code failed miserably when D3D11CreateDeviceAndSwapChain returned E_INVALIDARG. I changed the BufferDesc.Format to DXGI_FORMAT_R8G8B8A8_UNORM and the function returned S_OK as intended. Why would this happen? Is it DXGI_FORMAT_B8G8R8X8_UNORM being deprecated or something else beyond my knowledge?
For Direct3D 11, you should familiarize yourself with Direct3D hardware feature levels--see this blog post for some background. Modern Direct3D only supports a specific set of formats for 'display out' (i.e. a backbuffer) indicated by D3D11_FORMAT_SUPPORT_DISPLAY. These are supported by all Direct3D 11 compatible hardware (Feature levels 9.1 - 12.1):
DXGI_FORMAT_R8G8B8A8_UNORM
DXGI_FORMAT_B8G8R8A8_UNORM
DXGI_FORMAT_R8G8B8A8_UNORM_SRGB and B8G8R8A8_UNORM_SRGB are also supported as backbuffer formats when using 'older' presentation styles, but aren't directly supported for new flip style models DXGI_SWAP_EFFECT_FLIP_*. In those cases, you provide the *_SRGB format only for the render target view, not the backbuffer itself.
Direct3D Hardware Feature Level 10.0 or better also supports:
DXGI_FORMAT_R16G16B16A16_FLOAT
DXGI_FORMAT_R10G10B10A2_UNORM
See Anatomy of Direct3D 11 Create Device
If you are running Windows 8.1 or Windows 10 for your development machine, you should also look at how to enable the DXGI Debug Layer as this would have provided more information for this failure.
DXGI WARNING: IDXGIFactory::CreateSwapChain: Blt-model swap effects (DXGI_SWAP_EFFECT_DISCARD and DXGI_SWAP_EFFECT_SEQUENTIAL)
are legacy swap effects that are predominantly superceded by their
flip-model counterparts (DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL and DXGI_SWAP_EFFECT_FLIP_DISCARD).
Please consider updating your application to leverage flip-model swap effects
to benefit from modern presentation enhancements. More information
is available at http://aka.ms/dxgiflipmodel. [ MISCELLANEOUS WARNING #294: ]
DXGI ERROR: IDXGIFactory::CreateSwapChain: Flip model swapchains (DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL
and DXGI_SWAP_EFFECT_FLIP_DISCARD) only support the following Formats:
(DXGI_FORMAT_R16G16B16A16_FLOAT, DXGI_FORMAT_B8G8R8A8_UNORM,
DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_FORMAT_R10G10B10A2_UNORM), assuming
the underlying Device does as well.
DXGI_SWAP_CHAIN_DESC{ SwapChainType = ..._HWND, BufferDesc = DXGI_MODE_DESC1{Width = 800, Height = 600, RefreshRate = DXGI_RATIONAL{ Numerator = 0, Denominator = 0 }, Format = B8G8R8X8_UNORM, ScanlineOrdering = ..._UNSPECIFIED, Scaling = ..._UNSPECIFIED, Stereo = FALSE }, SampleDesc = DXGI_SAMPLE_DESC{ Count = 1, Quality = 0 }, BufferUsage = 0x20, BufferCount = 2, OutputWindow = 0x0023094C, Scaling = ..._STRETCH, Windowed = TRUE, SwapEffect = ..._FLIP_DISCARD, AlphaMode = ..._IGNORE, Flags = 0x0 }[ MISCELLANEOUS ERROR #101: ]
Now I never coded in Direct3D but the way I see this is generic.
D3DFMT_X8R8G8B8 is according to MSDN a 32-bit RGB pixel format, where 8 bits are reserved for each color. This means there is no alpha, the first byte is not used.
Your selected counterpart is similar in that respect - no alpha. MSDN: A four-component, 32-bit unsigned-normalized-integer format that supports 8 bits for each color channel and 8 bits unused.
The latter format is not actually a counterpart as it places colours in different bytes but this should not matter as long as you properly render all textures.
I think the problem is with your hardware not supporting a non-alpha format. Did you try running the old code on the same hardware? And are you sure the old DX3D does not add the alpha channel silently? Typically API will ask the driver for supported formats and will try to match with yours. On failure it will produce an error like that. Also some drivers will not accept a non-alpha format for hardware buffers. This is why the R8G8B8A8 format works because it supports alpha channel.

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code

I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.
From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

cudaHostRegister returns cudaErrorInvalidValue on GPUs with compute capability 1.1

I have a simple program that allocates an unsigned __int64 (8 bytes on the stack) and then attempts to register that memory on the GPU using cudaHostRegister. The section of the program making this call is shown below:
unsigned __int64 mem;
unsigned __int64 *pMem = &mem;
cudaError_t result;
result = cudaHostRegister(pMem, sizeof(unsigned __int64), cudaHostRegisterMapped);
if(result != cudaSuccess) {
printf("Error in cudaHostRegister: %s.\n", cudaGetErrorString(result));
return -1;
}
I am compiling in Visual Studio 2010 Premium using the nvcc flags compute_11 and sm_11, and everything works correctly on my laptop running a Quadro K1000m with a cuda capability version of 3.0.
I recently switched to my desktop where I tried running with a GeForce 8600 GT and a GeForce 9500 GT, both of which have a cuda capability version of 1.1.
According to NVIDIA's documentation for cudaHostRegister, cards with a cuda capability of 1.1 and above should allow the use of cudaHostRegisterMapped:
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer(). This feature is available only on GPUs with compute capability greater than or equal to 1.1.
After some searching, it seemed that cudaHostRegisterMapped may require page-aligned memory. I thought that may be the difference between my 3.0 card and my 1.1 cards, so I masked off the address to get a page-aligned address and used the size of a page (4096 bytes) in the size field, as shown below:
unsigned __int64 mem;
unsigned __int64 *pMem = &mem;
unsigned __int64 memAddr = (unsigned __int64)pMem;
cudaError_t result;
pMem = (unsigned __int64 *)(memAddr & 0xFFFFFFFFFFFFF000);
result = cudaHostRegister(pMem, 4096, cudaHostRegisterMapped);
if(result != cudaSuccess) {
printf("Error in cudaHostRegister: %s.\n", cudaGetErrorString(result));
return -1;
}
This code also works on my 3.0 card, but fails with the same result as before on my 1.1 cards. The cudaHostRegister function returns with the error cudaErrorInvalidValue, indicating that:
one or more of the parameters passed to the API call is not within an acceptable range of values
I haven't been able to find much more about why this function might fail like this. Thanks for any help anyone can provide.
[Edit]
Based on talonmies response, I verified at least one of my cards (9500 GT, I didn't run it on the 8600 GT) does support memory mapping according to NVIDIA's deviceQuery executable that comes with the SDK.
Mapped memory is supported on some compute capability 1.1 devices, but not all of them. The MCP79 family of integrated chipsets (so Ion, and 9300M/9400M) do support mapped memory. Older compute capability 1.1 devices like your 8600GT and 9500GT, however, do not support mapped memory.
You can check for this programmatically using the cudaGetDeviceProperties API call; canMapHostMemory will tell you whether a given device supports mapped memory or not.

How to access CPU's heat sensors?

I am working on software in which I need to access the temperature sensors in the CPU and get control over them.
I don't know much hardware interfacing; I just know how to interface with the mouse. I have googled a lot about it but failed to find any relevant information or piece of code.
I really need to add this in my software. Please guide me how to have the control over the sensors using C or C++ or ASM.
Without a specific kernel driver, it's difficult to query the temperature, other than through WMI. Here is a piece of C code that does it, based on WMI's MSAcpi_ThermalZoneTemperature class:
HRESULT GetCpuTemperature(LPLONG pTemperature)
{
if (pTemperature == NULL)
return E_INVALIDARG;
*pTemperature = -1;
HRESULT ci = CoInitialize(NULL); // needs comdef.h
HRESULT hr = CoInitializeSecurity(NULL, -1, NULL, NULL, RPC_C_AUTHN_LEVEL_DEFAULT, RPC_C_IMP_LEVEL_IMPERSONATE, NULL, EOAC_NONE, NULL);
if (SUCCEEDED(hr))
{
IWbemLocator *pLocator; // needs Wbemidl.h & Wbemuuid.lib
hr = CoCreateInstance(CLSID_WbemAdministrativeLocator, NULL, CLSCTX_INPROC_SERVER, IID_IWbemLocator, (LPVOID*)&pLocator);
if (SUCCEEDED(hr))
{
IWbemServices *pServices;
BSTR ns = SysAllocString(L"root\\WMI");
hr = pLocator->ConnectServer(ns, NULL, NULL, NULL, 0, NULL, NULL, &pServices);
pLocator->Release();
SysFreeString(ns);
if (SUCCEEDED(hr))
{
BSTR query = SysAllocString(L"SELECT * FROM MSAcpi_ThermalZoneTemperature");
BSTR wql = SysAllocString(L"WQL");
IEnumWbemClassObject *pEnum;
hr = pServices->ExecQuery(wql, query, WBEM_FLAG_RETURN_IMMEDIATELY | WBEM_FLAG_FORWARD_ONLY, NULL, &pEnum);
SysFreeString(wql);
SysFreeString(query);
pServices->Release();
if (SUCCEEDED(hr))
{
IWbemClassObject *pObject;
ULONG returned;
hr = pEnum->Next(WBEM_INFINITE, 1, &pObject, &returned);
pEnum->Release();
if (SUCCEEDED(hr))
{
BSTR temp = SysAllocString(L"CurrentTemperature");
VARIANT v;
VariantInit(&v);
hr = pObject->Get(temp, 0, &v, NULL, NULL);
pObject->Release();
SysFreeString(temp);
if (SUCCEEDED(hr))
{
*pTemperature = V_I4(&v);
}
VariantClear(&v);
}
}
}
if (ci == S_OK)
{
CoUninitialize();
}
}
}
return hr;
}
and some test code:
HRESULT GetCpuTemperature(LPLONG pTemperature);
int _tmain(int argc, _TCHAR* argv[])
{
LONG temp;
HRESULT hr = GetCpuTemperature(&temp);
printf("hr=0x%08x temp=%i\n", hr, temp);
}
I assume you are interested in a IA-32 (Intel Architecture, 32-bit) CPU and Microsoft Windows.
The Model Specific Register (MSR) IA32_THERM_STATUS has 7 bits encoding the "Digital Readout (bits 22:16, RO) — Digital temperature reading in 1 degree Celsius relative to the TCC activation temperature." (see "14.5.5.2 Reading the Digital Sensor" in "Intel® 64 and IA-32 Architectures - Software Developer’s Manual - Volume 3 (3A & 3B): System Programming Guide" http://www.intel.com/Assets/PDF/manual/325384.pdf).
So IA32_THERM_STATUS will not give you the "CPU temperature" but some proxy for it.
In order to read the IA32_THERM_STATUS register you use the asm instruction rdmsr, now rdmsr cannot be called from user space code and so you need some kernel space code (maybe a device driver?).
You can also use the intrinsic __readmsr (see http://msdn.microsoft.com/en-us/library/y55zyfdx(v=VS.100).aspx) which has anyway the same limitation: "This function is only available in kernel mode".
Every CPU cores has its own Digital Thermal Sensors (DTS) and so some more code is needed to get all the temperatures (maybe with the affinity mask? see Win32 API SetThreadAffinityMask).
I did some tests and actually found a correlation between the IA32_THERM_STATUS DTS readouts and the Prime95 "In-place large FFTs (maximum heat, power consumption, some RAM tested)" test. Prime95 is ftp://mersenne.org/gimps/p95v266.zip
I did not find a formula to get the "CPU temperature" (whatever that may mean) from the DTS readout.
Edit:
Quoting from an interesting post TJunction Max? #THERMTRIP? #PROCHOT? by "fgw" (December 2007):
there is no way to find tjmax of a certain processor in any register.
thus no software can read this value. what various software developers
are doing, is they simply assume a certain tjunction for a certain
processor and hold this information in a table within the program.
besides that, tjmax is not even the correct value they are after. in
fact they are looking for TCC activacion temperature threshold. this
temperature threshold is used to calculate current absolute
coretemperatures from. theoretical you can say: absolute
coretemperature = TCC activacion temperature threshold - DTS i had to
say theoretically because, as stated above, this TCC activacion
temperature threshold cant be read by software and has to be assumed
by the programmer. in most situations (coretemp, everest, ...) they
assume a value of 85C or 100C depending on processor family and
revision. as this TCC activacion temperature threshold is calibrated
during manufacturing individually per processor, it could be 83C for
one processor but may be 87C for the other. taking into account the
way those programms are calculating coretemperatures, you can figure
out at your own, how accurate absolute coretemperatures are! neither
tjmax nor the "most wanted" TCC activacion temperature threshold can
be found in any public intel documents. following some discussions
over on the intel developer forum, intel shows no sign to make this
information available.
You can read it from the MSAcpi_ThermalZoneTemperature in WMI
Using WMI from C++ is a bit involved, see MSDN explanantion and examples
note: changed original unhelpful answer

Resources