Arm Mali T-624 gpu arithmetic pipeline depth kernel - arm

I am doing a research on an Arm Mali T-624 board and i want to discover how many stages does the gpu arithmetic pipeline has and i am doing the running the following kernel
__kernel void arithmetic_pipeline_depth(__global int * list)
{
for(int j=1000000;j!=0;j--){}
}
And i am running this kernel from 1 to 512 workgroups while i keep the number of work-items stable to number 1
enter image description here
So the results for this expirement is that the arithmetic pipeline is about 38 stages long
Do you think it is correct?

Related

ARM CMSIS DSP throughput for average of N float samples

I wrote a simple C-code to compute average of N floats that exist in an array. I got 10.5 ClockCyles per float as throughput for large N.
arm_mean_f32() is actually poorer in performance.
Isn't this too many CCs/float?
The 3 operations
load-from-memory
accumulation-of-loaded-values
increment of pointer
can happen in parallel.
Does ARM Cortex M4F do this?
The project was run on custom board with Freescale K24 processor having ARM Cortex M4F.
ARM implementation is very traditional you can check it on Github , they just do loop unrolling to reduce loop overhead then accumulate the sum of 4 samples each loop and finally divide the number of samples, i tried it with M4F and i got 5.3 cycles per float.
Here is the code i used
#include "arm_math.h"
#define MAX_BLOCKSIZE 32
float32_t src_buf_f32[MAX_BLOCKSIZE] =
{
-0.4325648115282207, -1.6655843782380970, 0.1253323064748307,
0.2876764203585489, -1.1464713506814637, 1.1909154656429988,
1.1891642016521031, -0.0376332765933176, 0.3272923614086541,
0.1746391428209245, -0.1867085776814394, 0.7257905482933027,
-0.5883165430141887, 2.1831858181971011, -0.1363958830865957,
0.1139313135208096, 1.0667682113591888, 0.0592814605236053,
-0.0956484054836690, -0.8323494636500225, 0.2944108163926404,
-1.3361818579378040, 0.7143245518189522, 1.6235620644462707,
-0.6917757017022868, 0.8579966728282626, 1.2540014216025324,
-1.5937295764474768, -1.4409644319010200, 0.5711476236581780,
-0.3998855777153632, 0.6899973754643451
};
float32_t result_f32;
int main(void)
{
arm_mean_f32(src_buf_f32, MAX_BLOCKSIZE, &result_f32);
return 0;
}
I think this is the best performance you can get with floating point, your poor performance can be due to that your are measuring the number of cycles incorrectly or your silicon. you can also try to increase compiler optimization.

CPU runs faster than GPU (OpenCL code)

I wrote a code in OpenCL to find the first 5000 prime numbers. Here's that code:
__kernel void dataParallel(__global int* A)
{
A[0]=2;
A[1]=3;
A[2]=5;
int pnp;//pnp=probable next prime
int pprime;//previous prime
int i,j;
for(i=3;i<5000;i++)
{
j=0;
pprime=A[i-1];
pnp=pprime+2;
while((j<i) && A[j]<=sqrt((float)pnp))
{
if(pnp%A[j]==0)
{
pnp+=2;
j=0;
}
j++;
}
A[i]=pnp;
}
}
Then I found out the execution time of this kernel code using OpenCL profiling. Here's the code:
cl_event event;//link an event when launch a kernel
ret=clEnqueueTask(cmdqueue,kernel,0, NULL, &event);
clWaitForEvents(1, &event);//make sure kernel has finished
clFinish(cmdqueue);//make sure all enqueued tasks finished
//get the profiling data and calculate the kernel execution time
cl_ulong time_start, time_end;
double total_time;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
//total_time = (cl_double)(time_end - time_start)*(cl_double)(1e-06);
printf("OpenCl Execution time is: %10.5f[ms] \n",(time_end - time_start)/1000000.0);
I ran these codes on various devices and this is what I got:
Platform:Intel(R) OpenCL
Device:Intel(R) Xeon(R) CPU X5660 # 2.80GHz
OpenCl Execution time is: 3.54796[ms]
Platform:AMD Accelerated Parallel Processing
Device:Pitcairn (AMD FirePro W7000 GPU)
OpenCl Execution time is: 194.18133[ms]
Platform:AMD Accelerated Parallel Processing
Device:Intel(R) Xeon(R) CPU X5660 # 2.80GHz
OpenCl Execution time is: 3.58488[ms]
Platform:NVIDIA CUDA
Device:Tesla C2075
OpenCl Execution time is: 125.26886[ms]
But aren't GPUs supposed to be faster than CPUs? Or, is there anything wrong with my code/implementation?
Please explain this behaviour.
clEnqueueTask() So basically, you are running 1 single "thread"(work items) in the GPU. A GPU will never beat a CPU in single thread performance.
You need to convert your code, such that you divide each prime calculation to a thread and then you run 5000+ work items(ideally millions). Then, the GPU will beat the CPU simply because it will run all that in parallel and CPU can't.
In order to use multiple work items, call your kernel with clEnqueueNDRangeKernel()
The provided code is a sequential algorithm that relies on previous values.
If you are running it with a global_work_size > 1, you are just performing same calculation over and over.
The opencl implementation should compute primes less than N sequentially, then run a test in parallel for numbers [N+1; N*N] if they are divisible by any of those primes and fill in the sieve array with 0 if the number is not prime and 1 if the number is prime.
E.g. not my code, someone's homework, and i did not check if it really works
If you need more than N^2 elements, calculate a prefix sum of the sieve array (exclusive scan).AMD APP SDK contains a sample of this operation.
This will give you offsets of the prime numbers for copying into prime number array and you'll be able to populate it:
__kernel scatter(uint* numbers, uint* sieve_prefix_sum, uint* sieve, uint offset, uint* prime_numbers)
{
if (sieve[get_global_id(0)])
prime_numbers[offset + sieve_prefix_sum[get_global_id(0)] = numbers[get_global_id(0)];
}
This algorithm works like a tree - you compute primes up to N sequentially and then evaluate K blocks in [N+1, N*N] range, then you repeat and grow next set of branches for [N^2, N^4] etc.

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code

I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.
From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

MKL Performance on Intel Phi

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
cblas_dgemm(...);
...
dgesv(...);
...
}
return result;
}
int main(int argc, char **argv) {
...
c_start = 1; c_stop = nmodel;
for(int c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):
int main(int argc, char **argv) {
...
int numthreads=omp_max_num_threads();
int c;
#pragma omp parallel for private(c)
for(int t=0; t<numthreads; t++) {
// assuming nmodel divisible by numthreads...
c_start = t*nmodel/numthreads+1;
c_end = (t+1)*nmodel/numthreads;
for(c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
}
When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:
Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].
Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.
Thanks,
Andrew
The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm() and dgesv. E.g. see this example.
This version is supported and explained by Jim Dempsey at the Intel forum.
What about using MKL:sequential library? If you link MKL library with sequential option, it doesn't generate OpenMP threads inside of the MKL itself. I guess you may get better results than now.

OpenCL code runs faster on MBP than on NVIDIA GTX 480

I'm I have come across a strange problem. I'm implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:
__kernel void matrix_mult(__global float* a,
__global float* b,
__global float* c,
const int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < N; i++) {
sum += a[row*N+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
I test the hardware by running the code 100 times like this:
clock_t begin=clock();
const unsigned int repeats = 100;
for(int i = 0; i != repeats; i++){
runCL(a, b, results,N, N*N);
}
clock_t end=clock();
On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would't expect the expensive GTX card to be a little faster than the laptop.
As far as I can see either my code is 'wrong' of I'm timing in some wrong way.
I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.
cl_event event = {0};
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);
cl_int err = clWaitForEvents (1,&event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;
Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps up the same performance. Can anyone explain why this is?
In timing the original problem, what you're seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.
The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can't be coallesced because of the stride.
The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there's only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you're seeing.
The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.
The profiling results you're seeing, though, I have no idea about. Perhaps the 330M doesn't have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I'd just use the simpler timing approach for now, which since you're not using asynchronous kernels or transfer, should be fine.
I think you're pushing the limits on the timer resolution for Nvidia. Try clGetDeviceInfo() with CL_DEVICE_PROFILING_TIMER_RESOLUTION to check it. With those tiny times I wouldn't really conclude anything.
A few ms could be the difference between initialization routines for each code path, especially when both testing systems have different hardware.
I recommend starting by testing a larger set which requires at least several seconds on both the laptop and the nVidia card.

Resources