What's the advantage of running OpenCL code on aCPU? [closed] - c

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am learning OpenCL programming and am noticing something weird.
Namely, when I list all OpenCL enabled devices on my machine (Macbook Pro), I get the following list:
Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
Iris Pro
GeForce GT 750M
The first is my CPU, the second is the onboard graphics solution by Intel and the third is my dedicated graphics card.
Research shows that Intel has made their hardware OpenCL compatible so that I can tap into the power of the onboard graphics unit. That would be the Iris Pro.
With that in mind, what is the purpose of the CPU being OpenCL compatible? Is it merely for convenience so that kernels can be run on a CPU as backup should no other cards be found or is there any kind of speed advantage when running code as OpenCL kernels instead of regular (C, well threaded) programs on top of a CPU?

See https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf for basic info.
Basically the Intel OpenCL compiler performs horizontal autovectorization for certain types of kernels. That means with SSE4 you get 8 threads running in parallel in a single core in similar fashion as Nvidia GPU runs 32 threads in a single 32 wide simd unit.
There are 2 major benefits on this approach: What happens if in 2 years they increase the SSE vector width to 16? Then you will instantly get autovectorization for 16 threads when you run on that CPU. No need to recompile your code. The second benefit is that it's far easier to write an OpenCL kernel that autovectorizes easily compared to writing it in ASM or C and getting your compiler to produce efficient code.

As OpenCL implementations mature, it's possible to achieve good levels of performance portability for your kernels across a wide range of devices. Some recent work in my research group shows that, in some cases, OpenCL codes achieve a similar fraction of hardware peak performance on the CPU and the GPU. On the CPU, the OpenCL kernels were being very effectively auto-vectorised by Intel's OpenCL CPU implementation. On the GPU, efficient code was being generated for HPC and desktop devices from Nvidia (who's OpenCL still works surprisingly well) and AMD.
If you want to develop your OpenCL code anyway in order to exploit the GPU, then you're often getting a fast multi-core+SIMD version "for free" by running the same code on the CPU.
For two recent papers from my group detailing the performance portability results we've achieved across four different real applications with OpenCL, see:
"On the performance portability of structured grid codes on many-core computer architectures", S.N. McIntosh-Smith, M. Boulton, D. Curran and J.R. Price. ISC, Leipzig, June 2014. DOI: 10.1007/978-3-319-07518-1_4
"High Performance in silico Virtual Drug Screening on Many-Core Processors", S. McIntosh-Smith, J. Price, R.B. Sessions, A.A. Ibarra, IJHPCA 2014. DOI: 10.1177/1094342014528252

I have considered this for a while. You can get most of the advantages of OpenCL for the CPU without using OpenCL and without too much difficulty in C++. To do this you need:
Something for multi-threading - I use OpenMP for this
A SIMD library - I use Agner Fog's Vector Library Class (VCL) for this which covers SSE2-AVX512.
A SIMD math library. Once again I use Anger Fog's VCL for this.
A CPU dispatcher. Agner Fog's VCL has an example to do this.
Using the CPU dispatcher you determine what hardware is available and choose the best code path based on the hardware. This provides one of the advantages of OpenCL.
This gives you most of the advantages of OpenCL on the CPU without all its disadvantages. You never have to worry that a vendor stops supporting a driver. Nvidia has only a minimal amount of support for OpenCL - including several year old bugs it will likely never fix (which I wasted too much time on). Intel only has Iris Pro OpenCL drivers for Windows. Your kernels using my suggested method can use all C++ features, including templates, instead of OpenCL's restricted and extended version of C (though I do like the extensions). You can be sure your code does what you want this way and are not at the whim of some device driver.
The one disadvantage with my suggested method is that you can't just install a new driver and have it optimize for new hardware. However, the VCL already supports AVX512 so it's already built for hardware that is not out yet and won't be superseded for several years. And in any case to get the most use of your hardware you will almost certainly have to rewrite your kernel in OpenCL for that hardware - a new driver can only help so much.
More info on the SIMD math library. You could use Intel's expensive closed source SVML for this (which is what the Intel OpenCL drivers uses if you search of svml after you install the Intel OpenCL drivers - don't confuse the SDK with the drivers). Or you could use AMD's free but closed source LIBM. However,neither of these work well on the competitors processor. Agner Fog's VCL works well on both processors, is open source, and free.

Related

How To Do Multiprocessing in C without any non-standard libraries [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Say you wanted to write your own version of Opencl from scratch in C. How would you go about doing it? How does OpenCL accomplish parallel programming "under the hood"? Is it just pthreads?
OpenCL covers much functionality, including a runtime API library, a programming language based on C, a library environment for that language, and likely a loader library for supporting multiple implementations. If you want to look at an open source example of how it could be implemented, Pocl, Clover, Beignet and ROCm exist. At least Pocl's CPU target does indeed use pthreads, but OpenCL is designed to support offloading tasks to coprocessors such as GPUs, as well as using vector operations, so one thread does not necessarily run one work item.
The title does not refer to OpenCL, but does request to use "standard" libraries. The great thing about standards is that there are so many to choose from; for instance, the C standard provides no multithreading and no guarantee of multitasking. Multiprocessing frequently refers to running in multiple processes (in e.g. CPython, this is the only way to get concurrent execution of Python code because of the global interpreter lock). That can be done with the Unix standard function fork. Multithreading can be done using POSIX threads (POSIX.1c standard extension) or OpenMP. Recent versions of OpenMP also support accelerator offloading, which is what OpenCL was designed for. Since OpenMP and OpenCL provide restricted and abstracted environments, they could in principle be implemented on top of many of the others, for instance CUDA.
Implementing parallel execution itself requires hardware knowledge and access, and is typically the domain of the operating system; POSIX threads is often an abstraction layer on this, using e.g. clone on Linux.
OpenMP is frequently the easiest way to convert a C program to parallel execution, as it is supported by many compilers; you annotate branching points using pragmas and compile with e.g. -fopenmp for GCC. Such programs will still work as before if compiled without OpenMP.
First off: OpenCL != parallel processing. That is one of its strengths, but there's a lot more to it.
Focusing on one part of your question:
Say you wanted to write your own version of Opencl from scratch in C.
For one: get familiar with driver development. Our GPU CL runtime is pretty intimately involved with the drivers. If you want to start from scratch, you're going to need to get very familiar with the PCIe protocols and dig up some memories about toggling pins. This is doable, but it exemplifies "nontrivial."
Multithreading at the CPU level is an entirely different matter that's been documented out the yin-yang. The great thing about using an OS that you didn't have to write yourself is that this is already handled for you.
Is it just pthreads?
How do you think those are implemented? Their functionality is part of the spec, but their implementation is entirely platform-dependent, which you may call "non standard." The underlying implementation of a thread depends on the OS (if there is one, which is not a given), compiler, and a ton of other factors.
This is a great question.

Can the announced Tegra K1 be a contender against x86 and x64 chips in supercomputing applications?

To clarify, can this RISC base processor (the Tegra K1) be used without significant changes to today's supercomputer programs, and perhaps be a game changer because if it's power, size, cost, and energy usage? I know it's going up against some x64 or x86 processors. Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips? Thanks.
Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips?
It depends what you call "supercomputers code". Usually supercomputers run high-level functional code (usually fully compiled code like C++, sometimes VMs-dependent code like Java) on top of other low-lewel code and technologies such as OpenCL or CUDA for accelerators or MPICH for communication between nodes.
All these technologies have ARM implementations so the real thing is to make the functional code is ARM-compatible. This is usually straightforward as code written in high level language is mostly hardware-independent. So the short answer is: yes.
However, what may be more complicated is to scale this code to these new processors.
Tegra K1 is nothing like the GPUs embedded in supercomputers. It has far less memory, runs slightly slower and has only 192 cores.
Its price and power consumption make it possible, however, to build supercomputers with hundreds of them inside.
So code which have been written for traditionnal supercomputers (a few high-performance GPUs enbedded) will not reach the peak performance of 'new' supercomputers (built with a lot of cheap and weak GPUs). There will be a price to pay to existing code on these new architectures.
For modern supercomputing needs, you'd need to answer if a processor can perform well for the energy it consumes. Current architecture of Intel along with GPUs fulfill those needs and Tegra architecture do not perform as well in terms of power-performance to Intel processors.
The question is should it? Intel keeps proving that ARM is inferior and the only factor speaking for using RISC base processors is their price, which I highly doubt is a concern when building super computer.

Intel based hardware speed ups for DCT?

We are writing an image processing algorithm targeting some Intel hardware. Generally we prefer generic C implementations, but we have identified an algorithm that at its core does a ton of Discrete Cosine Transforms (DCT's) that works extremely well. Unfortunately, our throughput requirements are such that a generic C implementation is about 2 orders of magnitude too slow. I can get one order of magnitude through some other tricks, so if I can improve my DCT's by about an order of magnitude I have a path towards success.
Is the Intel MMX a way to get at hardware acceleration to do these DCT's? Is there other intel specific libraries and/or hardware that I can exploit to speed these bad boys up?
Where do I start to look? This is a new job for me, and my first time digging hard into Intel hardware, so any pointers would be most appreciated.
Take a look at Intel's Integrated Performance Primitives library. It contains a wealth of routines that are optimized heavily to take use of the Intel architecture, specifically MMX and SSE. Among many other things, IPP also contains routines for the DCT (documentation here).

Loading Code onto GPU (Intel Sandy Bridge)

My question is not about GPGPU. I understand GPGPU pretty decently and that is not what I am looking for. Intel's Sand Bridge has supposedly some features that allow you to directly perform computations on the GPU.
Is that really true?
The code I am planning to write is going to be in inline assembly (in C). Are there assembly instructions that instead of executing on the CPU push stuff out to the GPU?
Some related documentation :
http://intellinuxgraphics.org/documentation.html
http://intellinuxgraphics.org/documentation/SNB/IHD_OS_Vol4_Part2.pdf
The PDF has the instruction set.
I don't believe that the instruction set detailed in the PDF you linked can be directly used from "user space". It's what the GPU driver on your OS may* use to implement higher-level interfaces like OpenGL and DirectX.
For what it's worth, the Sandy Bridge GPU is pretty weak. It doesn't support OpenCL**, which is a standard GPGPU library which is supported by ATI / nVidia. I'd recommend that you program to that library (on hardware that supports it), as it's far more portable (and easier to use!) than trying to program to the bare-metal interface that you're looking at.
*: It's possible, although unlikely, that there's a different interface than what's described in that PDF which is used in Intel's closed-source drivers.
**: Not the same as OpenGL, although it was designed by the same group.
Answering your first question: No it is not true.
Let me quote from the resources you have linked:
The Graphics Processing Unit is controlled by the CPU through a direct interface of memory-mapped IO
registers, and indirectly by parsing commands that the CPU has placed in memory. (Chapter 2.2 from the SB GPU manual)
So no direct execution of GPU code in the cpu context.
For your second question: "Pushing stuff out to the GPU" is done with the mov instruction. Target is a mem-mapped IO register, source the stuff you want to write. You might need to insert some "sfence" or similar instructions to make sure no weak memory reordering does happen.

Developing a non-x86 Operating system [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have to choose a thesis topic soon and I was considering implementing an operating system for an architecture that is not x86 (I'm leaning towards ARM or AVR). The reason I am avoiding x86 is because I would like to gain some experience with embedded platforms and I (possibly incorrectly) believe that the task may be easier when carried out on a smaller scale. Does anyone have any pointers to websites or resources where there are some examples of this. I have read through most if not all of the OSDev questions on stack overflow, and I am also aware of AvrFreaks and OSDev. Additionally if anyone has had experience in this area and wanted to offer some advice in regards to approach or platform it would be much appreciated.
Thanks
Developing an (RT)OS is not a trivial task. It is very educational though. My advice to you is to start hardware independent. PC is a good starting point as it comes with plenty of I/O possibilities and good debugging. If you create a kind-of-virtual machine application, you can create something with simple platform capabilities (console output, some buttons/indicators are a good start). Also, you can use files for instance, to output timing (schedules) If you start on 'bare metal' you'll have to start from scratch. Debugging on a LED (on/off/blinking) is very hard and time consuming. My second advice is to define your scope early: is it the scheduler, the communication mechanisms or the file systems you're interested at... ? Doing all can easily end up in a life long project.
Samek, Miro, Practical UML Statecharts in C/C++ contains some interesting sections on a microkernel. It's one of my favorite books.
Advanced PIC Microcontroller Projects in C: From USB to RTOS with the PIC 18F Series
seems to cover some of your interests; I haven't read it yet though. Operating Systems: Internals and Design Principles may also bring good insights. It covers all aspects from scheduler to network stack. Good luck!
Seems like you should get a copy of Jean Labrosse's book MicroC/OS.
It looks like he may have just updated it too.
http://micrium.com/page/press_room/news/id:40
http://micrium.com/page/home
This is a well documented book describing the inner workings of an RTOS written in C and ported to many embedded processors. You could also run it on a x86, and then cross compile to another processor.
Contiki might be a good thing to research. It's very small, runs on microcontrollers, and is open source. It has a heavy bias towards networking and communications, but perhaps you can skip those parts and focus on the kernel.
If you choose ARM, pick up a copy of the ARM System Developer's Guide (Sloss, Symes, Wright). Link to Amazon
Chapter 11 discusses the implementation of a simple embedded operating system, with great explanations and sample code.
ARM and AVR are chalk and cheese - you've scoped this very wide!
You could produce a very different and more sophisticated OS for ARM than AVR (unless you are talking about AVR32 perhaps - which is a completely different architecture?).
AVR would be far more constraining to the point that the task may be just to trivial for the scope of your thesis. Even specifying ARM does not narrow it down much; low-end ARM parts have small on-chip memories, no MMU and simple peripherals; higher end parts have an MMU, data/instruction caches, often a GPU, sometimes an FPU, hardware Java bytecode execution, and many other complex peripherals. The term 'ARM' covers ARM7, ARM9, ARM11, Cortex M3, Cortex M8, plus a number of architectures intended for use on ASICs and FPGAs - so you need to narrow it down a bit perhaps?
If you choose ARM, take a look at these resources. Especially the Insider's Guides from Hitex, and the "Building bare-metal ARM with GNU", they will help you get your board 'up' and form starting point for your OS.
Silly as it may sound, I was recently interested in the Arduino platform to learn some hacking tricks with the help of more experienced friends. There was also this thread for a guy interested in writing an OS for it (although not his primary intention).
I think the Arduino is very basic and straightforward as an educational tool for such endeavors. It may worth the try checking it out if it fits the bill.
The first thing I recommend is to narrow your thesis topic considerably. OSs are ubiquitous, well researched and developed. What novel idea do you hope to pursue?
That said, the AvrX is a very small microkernel that I've used professionally on AVR microcontrollers. It is written in assembly. One person started to port it to C, but hasn't finished the port. Either finalizing the port to C and/or making a C port to the AVR32 architecture would be valuable.
An OS shall not be tightly coupled to any processor so ARM or x86 doesn't matter.
It will be a bigger topic, if we start discussing if ARM is embedded and x86 is not. Anyway, there are many many places in which x86 processors are used for embedded software development.
I guess most of the kernel code will be just plain C lanugage. There are many free OS that are already available, like for example, embedded linux, Free version of Itron, minix, etc ... It will be a daunting task.
But on the other hand, what you can try is, port embedded linux to platforms in which it is not yet working. This will be really useful to the world.
An RTOS is almost never architecture specific. Refer to any RTOS architecture available on the net and you will notice that a CPU/Hardware abstraction layer abstracts out the CPU. The board specific portions (that deal with peripherals such as com ports, timers etc.) are abstracted by a board support package.
To begin with, get an understanding of how multi-threading works in a RTOS try implementing a simple context switch code for the CPU of your choice; this will involve code for creating a thread context, saving a context and restoring a saved context. This code will form the basis of your hardware abstraction layer. The initial development can easily be accomplished using a software simulator for the selected CPU.
I agree with the poster who suggested reading the book, uCOS-II by Jean Labrosse. Samples of context switch code, especially for x86, should be just a google search away!
http://www.amazon.com/Operating-Systems-Design-Implementation-3rd/dp/0131429388
Pretty solid stuff.

Resources