GLSL instructions limit for different chipsets - mobile

I'm porting an engineering application to all major mobile platforms.
It is important to confirm the VS/FS instruction limit for many of the shaders I'm gonna write.
I'm looking VS/FS instruction limit for the following chipsets.
Nvidia Tegra 2 -
Adreno 205/220 -
PowerVR SGX series -

Here's a comparison of capabilities by shader version: http://en.wikipedia.org/wiki/High-level_shader_language#Pixel_shader_comparison
This is for HLSL but the same numbers should apply to GLSL equivalent versions. I know this doesn't answer about the specific chipsets you mentioned, but you should be able to cross-reference their supported shader version against these capabilities.

Related

What's the advantage of running OpenCL code on aCPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am learning OpenCL programming and am noticing something weird.
Namely, when I list all OpenCL enabled devices on my machine (Macbook Pro), I get the following list:
Intel(R) Core(TM) i7-4850HQ CPU # 2.30GHz
Iris Pro
GeForce GT 750M
The first is my CPU, the second is the onboard graphics solution by Intel and the third is my dedicated graphics card.
Research shows that Intel has made their hardware OpenCL compatible so that I can tap into the power of the onboard graphics unit. That would be the Iris Pro.
With that in mind, what is the purpose of the CPU being OpenCL compatible? Is it merely for convenience so that kernels can be run on a CPU as backup should no other cards be found or is there any kind of speed advantage when running code as OpenCL kernels instead of regular (C, well threaded) programs on top of a CPU?
See https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf for basic info.
Basically the Intel OpenCL compiler performs horizontal autovectorization for certain types of kernels. That means with SSE4 you get 8 threads running in parallel in a single core in similar fashion as Nvidia GPU runs 32 threads in a single 32 wide simd unit.
There are 2 major benefits on this approach: What happens if in 2 years they increase the SSE vector width to 16? Then you will instantly get autovectorization for 16 threads when you run on that CPU. No need to recompile your code. The second benefit is that it's far easier to write an OpenCL kernel that autovectorizes easily compared to writing it in ASM or C and getting your compiler to produce efficient code.
As OpenCL implementations mature, it's possible to achieve good levels of performance portability for your kernels across a wide range of devices. Some recent work in my research group shows that, in some cases, OpenCL codes achieve a similar fraction of hardware peak performance on the CPU and the GPU. On the CPU, the OpenCL kernels were being very effectively auto-vectorised by Intel's OpenCL CPU implementation. On the GPU, efficient code was being generated for HPC and desktop devices from Nvidia (who's OpenCL still works surprisingly well) and AMD.
If you want to develop your OpenCL code anyway in order to exploit the GPU, then you're often getting a fast multi-core+SIMD version "for free" by running the same code on the CPU.
For two recent papers from my group detailing the performance portability results we've achieved across four different real applications with OpenCL, see:
"On the performance portability of structured grid codes on many-core computer architectures", S.N. McIntosh-Smith, M. Boulton, D. Curran and J.R. Price. ISC, Leipzig, June 2014. DOI: 10.1007/978-3-319-07518-1_4
"High Performance in silico Virtual Drug Screening on Many-Core Processors", S. McIntosh-Smith, J. Price, R.B. Sessions, A.A. Ibarra, IJHPCA 2014. DOI: 10.1177/1094342014528252
I have considered this for a while. You can get most of the advantages of OpenCL for the CPU without using OpenCL and without too much difficulty in C++. To do this you need:
Something for multi-threading - I use OpenMP for this
A SIMD library - I use Agner Fog's Vector Library Class (VCL) for this which covers SSE2-AVX512.
A SIMD math library. Once again I use Anger Fog's VCL for this.
A CPU dispatcher. Agner Fog's VCL has an example to do this.
Using the CPU dispatcher you determine what hardware is available and choose the best code path based on the hardware. This provides one of the advantages of OpenCL.
This gives you most of the advantages of OpenCL on the CPU without all its disadvantages. You never have to worry that a vendor stops supporting a driver. Nvidia has only a minimal amount of support for OpenCL - including several year old bugs it will likely never fix (which I wasted too much time on). Intel only has Iris Pro OpenCL drivers for Windows. Your kernels using my suggested method can use all C++ features, including templates, instead of OpenCL's restricted and extended version of C (though I do like the extensions). You can be sure your code does what you want this way and are not at the whim of some device driver.
The one disadvantage with my suggested method is that you can't just install a new driver and have it optimize for new hardware. However, the VCL already supports AVX512 so it's already built for hardware that is not out yet and won't be superseded for several years. And in any case to get the most use of your hardware you will almost certainly have to rewrite your kernel in OpenCL for that hardware - a new driver can only help so much.
More info on the SIMD math library. You could use Intel's expensive closed source SVML for this (which is what the Intel OpenCL drivers uses if you search of svml after you install the Intel OpenCL drivers - don't confuse the SDK with the drivers). Or you could use AMD's free but closed source LIBM. However,neither of these work well on the competitors processor. Agner Fog's VCL works well on both processors, is open source, and free.

OpenCL which SDK is best?

I am a beginner in OpenCL programming. My PC has windows 8.1 with both intel graphics and AMD Radeon 7670. When I searched to download an OpenCL SDK and sample helloworld programs, I found that there are separate SDKs and programs in entirely different formats available. I have to use C not C++. Can anyone suggest which SDK I should install? Please help.
At the lowest level, the various OpenCL SDKs are the same; they all include cl.h from the Khronos website. Once you've included that header you can write to the OpenCL API, and then you need to link to OpenCL.lib, which is also supplied in the SDK. At runtime, your application will load the OpenCL.dll that your GPU vendor has installed in /Windows/System32.
Alternatively, you can include cl.hpp and use the C++ wrapper, but since you said you're a C programmer, and because most of the books use the C API, stick with cl.h. I think this might account for the "programs in entirely different formats" observation you made which is why I bring it up here.
The benefit of one SDK over another typically is for profiling and debugging. The AMD SDK, for example, includes APP Profiler (or now CodeXL) which will help you figure out how to make your kernels faster. NVIDIA supplies Parallel Nsight for the same purpose, and Intel also has performance tools.
So you might choose your SDK based on the hardware in your machine, but understand that once you've coded to the OpenCL API, your application can run on other GPUs from other vendors -- that is the benefit of OpenCL. You should even be able to get samples from one vendor to execute on hardware from another.
One thing to be careful of is versions: If you code to an OpenCL 1.2 SDK you might not run on OpenCL 1.1 hardware.
For me the best thing with OpenCL is that you do not need an SDK at all because it abstracts different Vendor implementations behind a common Interface (see Answer in this Thread: Do I really need an OpenCL SDK?).

Can I use GPU as GPGPU on any system

I wish to use the GPU of a system as GPGPU. The machine is remote, I don't have administrative rights and I don't know anything about its drivers. What I know is that it has has a Matrox VGA card. Can I use as GPGPU with C code and gcc compiler or do I need to have some kind of drivers? Or can I only use OpenGL and twist the logic to suit my purpose?
There is no easy way to do this using OpenGL. It would be easy to do this if you know the graphics card supports GPGPU functionality. i.e. CUDA, OpenCL, or AMD stream. Then you can use one of these APIs to write a program which uses GPU for computation. For, this you will need corresponding SDKs. But, even using this APIs it is non-trivial to use GPU for complex calculations.
A few Matrox video cards support OpenGL and/or DirectX, so you might get away with what you want to do through shaders written in OpenGL/GLSL or DirectX/HLSL.
Check the specification of your video card.
Warning: these cards are not known to have particularly good GPUs.
Using the OpenCL/CUDA/Stream capabilities of a Graphics card requires drivers that expose the functionality. Aside from that, older cards (like say the ATI X800 series) do not have the required hardware to efficiently do what GPGPU requires, and thus are unusable for such purposes.
I doubt Matrox VGA cards have any support for GPGPU whatsoever.

Smartphone 3D Capabilities Database

Is there a website or downloadable document that contains information about the 3D capabilities (Fillrate, Features, Shader Units etc) of the 3D Hardware used in many of today's smartphones such as IPhone 3G, IPhone 4, the more popular Android Devices, Windows Phone 7 etc?
If a resource like this existed that sure would have saved me a lot of time! Back when I first started my current job, my first task was to make a spreadsheet of information about the 3D capabilities of all the smartphones out there. It was waaaay harder than I expected, because the various device manufacturers are surprisingly cagey about specific info other than what's in their marketing brochures. I didn't even get every device out there (just a solid sampling of the best known ones) nor did I investigate them to quite the depth you're asking for.
In general however I found a few common patterns and most useful resources that, taken together, told me most of what I needed to know.
First off, pretty much all the smartphones out there use either a PowerVR chip or a Qualcomm chip (formerly/alternatively known as "Adreno"; it's a long story, read the Wikipedia article) for the GPU. For example, all iPhones use a PowerVR GPU (different generations though) and Microsoft has mandated that all WinPhone 7 devices use the exact same Snapdragon chipset from Qualcomm. Motorola Droids use PowerVR, HTC Android phones use Qualcomm, etc.
Second, I relied heavily on the sites GLBenchmark, PDAdb.net, and good ole Wikipedia. For example, going to the "Results" tab on GLBenchmark brings up a list of all the smartphones they've tested, then go the the iPhone 3Gs results, and then go to the "GL Environment" tab:
http://www.glbenchmark.com/phonedetails.jsp?benchmark=glpro20&D=Apple+iPhone+3G+S&testgroup=gl
Oh hey look, it has a PowerVR SGX 535

How to activate nVidia cards programmatically on new MacBookPros for CUDA programming?

The new MacBookPros come with two graphic adapters, the Intel HD Graphics, and the NVIDIA GeForce GT 330M. OS X switches back and forth between them, depending on the workload, detection of an external monitor, or activation of Rosetta.
I want to get my feet wet with CUDA programming, and unfortunately the CUDA SDK doesn't seem to take care of this back-and-forth switching. When Intel is active, no CUDA device gets detected, and when the NVidia card is active, it gets detected. So my current work-around is to use the little tool gfxCardStatus (http://codykrieger.com/gfxCardStatus/) to force the card on or off, just as I need it, but that's not satisfactory.
Does anybody here know what the Apple-blessed, Apple-recommended way is to (1) detect the presence of a CUDA card, (2) to activate this card when present?
Well, supposedly MacOsX should switch back and forth when needed, and apparently it doesn't consider CUDA.
In Snow Leopard Apple introduced OpenCL, which is supposed to be used to program the GPU by any application, this is probably Apple's recommended way of achieving that, instead of CUDA.
I am testing CUDA and OpenCL on the NVidia-Platform. All my application (i have to write it with cuda and opencl framework) achieve the same performance (measured in MFlops).
BUT: if you use local memory optimization for NVidia, than there is some problems to run this application with ATI-GPU. So this is not really cross platform:(

Resources