ARM CMSIS DSP throughput for average of N float samples - arm

I wrote a simple C-code to compute average of N floats that exist in an array. I got 10.5 ClockCyles per float as throughput for large N.
arm_mean_f32() is actually poorer in performance.
Isn't this too many CCs/float?
The 3 operations
load-from-memory
accumulation-of-loaded-values
increment of pointer
can happen in parallel.
Does ARM Cortex M4F do this?
The project was run on custom board with Freescale K24 processor having ARM Cortex M4F.

ARM implementation is very traditional you can check it on Github , they just do loop unrolling to reduce loop overhead then accumulate the sum of 4 samples each loop and finally divide the number of samples, i tried it with M4F and i got 5.3 cycles per float.
Here is the code i used
#include "arm_math.h"
#define MAX_BLOCKSIZE 32
float32_t src_buf_f32[MAX_BLOCKSIZE] =
{
-0.4325648115282207, -1.6655843782380970, 0.1253323064748307,
0.2876764203585489, -1.1464713506814637, 1.1909154656429988,
1.1891642016521031, -0.0376332765933176, 0.3272923614086541,
0.1746391428209245, -0.1867085776814394, 0.7257905482933027,
-0.5883165430141887, 2.1831858181971011, -0.1363958830865957,
0.1139313135208096, 1.0667682113591888, 0.0592814605236053,
-0.0956484054836690, -0.8323494636500225, 0.2944108163926404,
-1.3361818579378040, 0.7143245518189522, 1.6235620644462707,
-0.6917757017022868, 0.8579966728282626, 1.2540014216025324,
-1.5937295764474768, -1.4409644319010200, 0.5711476236581780,
-0.3998855777153632, 0.6899973754643451
};
float32_t result_f32;
int main(void)
{
arm_mean_f32(src_buf_f32, MAX_BLOCKSIZE, &result_f32);
return 0;
}
I think this is the best performance you can get with floating point, your poor performance can be due to that your are measuring the number of cycles incorrectly or your silicon. you can also try to increase compiler optimization.

Related

How can I effectively time the execution of a function that's only a few cycles long?

I'm trying to do some comparisons on different methods for calculating dot products using SSE Intrinsics, but since the methods are only a few cycles long, I have to run the instructions trillions of times for it to take more than a tiny fraction of a second. The only problem with that is that gcc with the -O3 flag is "optimizing" my main method into an infinite loop.
My code is
#include <immintrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <inttypes.h>
#define NORMAL 0
struct _Vec3 {
float x;
float y;
float z;
float w;
};
typedef struct _Vec3 Vec3;
__m128 singleDot(__m128 a, __m128 b) {
return _mm_dp_ps(a, b, 0b00001111);
}
int main(int argc, char** argv) {
for (uint16_t j = 0; j < (1L << 16); j++) {
for (uint64_t i = 0; i < (1L << 62); i++) {
Vec3 a = {i, i + 0.5, i + 1, 0.0};
Vec3 b = {i, i - 0.5, i - 1, 0.0};
#if NORMAL
float ans = normalDot(a, b); // naive implementation
#else
// float _c[4] = {a.x, a.y, a.z, 0.0};
// float _d[4] = {b.x, b.y, b.z, 0.0};
__m128 c = _mm_load_ps((float*)&a);
__m128 d = _mm_load_ps((float*)&b);
__m128 ans = singleDot(c, d);
#endif
}
}
}
but when I compile with gcc -std=c11 -march=native -O3 main.c and run objdump -d, it turns main into
0000000000400400 <main>:
400400: eb fe jmp 400400 <main>
is there an alternative for timing different approaches?
That's because this:
for (uint16_t j = 0; j < (1L << 16); j++) {
is an infinte loop -- the maximum value for a uint16_t is 65535 (216-1), after which it will wrap back to 0. So the test will always be true.
Even after fixing the uint16_t instead of uint64_t typo that makes your loop infinite, the actual work would still be optimized away because nothing uses the result.
You can use Google Benchmark's DoNotOptimize to stop your unused ans result from being optimized away. e.g. functions like "Escape" and "Clobber" that this Q&A is asking about. That works in GCC, and that question links to a relevant youtube video from a clang developer's CppCon talk.
Another worse way is to assign the result to a volatile variable. But keep in mind that common-subexpression elimination can still optimize away earlier parts of the calculation, whether you use volatile or an inline-asm macro to make sure the compiler materializes the actual final result somewhere. Micro-benchmarking is hard. You need the compiler to do exactly the amount of work that would happen in the real use-case, but not more.
See Idiomatic way of performance evaluation? for that and more.
Keep in mind exactly what you're measuring here.
Probably a bunch of loop overhead and probably store-forwarding stalls depending on whether the compiler vectorizes those initializers or not, but even if it does; conversion of integer to FP and 2x SIMD FP additions are comparable in cost a dpps in terms of throughput cost. (Which is what you're measuring, not latency; the difference matters a lot on CPUs with out-of-order execution depending on the context of your real use case).
Performance is not 1-dimensional at the scale of a couple instructions. Slapping a repeat loop around some work can measure the throughput or latency, depending on whether you make the input dependent on the previous output (a loop-carried dependency chain). But if your work ends up bound on front-end throughput, then loop overhead is an important part. Plus you might end up with effects due to how the machine code for your loop lines up with 32-byte boundaries for the uop cache.
For something this short and simple, static analysis is usually good. Count uops for the front-end, and ports in the back end, and analyze latency. What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. LLVM-MCA can do this for you, so can IACA. You can also measure as part of your real loop that uses dot products.
See also RDTSCP in NASM always returns the same value for some discussion of what you can measure about a single instruction.
I have to run the instructions trillions of times for it to take more than a tiny fraction of a second
Current x86 CPUs can loop at best one iteration per clock cycle for a tiny loop. It's impossible to write a loop that runs faster than that. 4 billion iterations (in asm) will take at least a whole second on a 4GHz CPU.
Of course an optimizing C compiler could unroll your loop and be doing as many source iterations as it wants per asm jump.

Slower SSE performance on large array sizes

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below.
int ssum(const int *d, unsigned int len)
{
static const unsigned int BLOCKSIZE=4;
unsigned int i,remainder;
int output;
__m128i xmm0, accumulator;
__m128i* src;
remainder = len%BLOCKSIZE;
src = (__m128i*)d;
accumulator = _mm_loadu_si128(src);
output = 0;
for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){
xmm0 = _mm_loadu_si128(++src);
accumulator = _mm_add_epi32(accumulator,xmm0);
}
accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 8));
accumulator = _mm_add_epi32(accumulator, _mm_srli_si128(accumulator, 4));
output = _mm_cvtsi128_si32(accumulator);
for(i=len-remainder;i<len;i++){
output += d[i];
}
return output;
}
As you can see, it is a fairly straight forward implementation where I sum the array 4 at a time using the extended xmm registers and then clean up at the end by adding up the remaining elements.
I then compared the performance of this SIMD implementation against just a plain for loop. The result of this experiment is available here:
SIMD vs. for-loop
As you can see, in comparison to a for loop, this implementation does indeed show about ~60% speedup for a input sizes ( meaning the length of the array ) upto about 5M elements. However, for larger values of the input size the performance, in relation to a for loop, takes a dramatic dive and produces only about a 20% speed up.
I am at a loss to explain this dramatic decrease in performance. I am more or less stepping linearly through memory so the affect of cache misses and page faults should be about the same for both implementations. What am I missing here? Is there any way we can flatten that curve out? Any thoughts would be greatly appreciated.
For large input, the data is outside the cache, and the code is memory bounded.
For small input, the data is inside the cache (i.e L1 / L2 / L3 cache), and the code is computation bounded.
I assume you didn't try to flush the cache, before performance measurement.
The cache memory is inside the CPU, and the bandwidth between cache memory and ALU (or SSE) units is very high (high bandwidth - less time transferring data).
Your highest level cache (i.e L3) size is about 4MB to 8MB (depending your CPU model).
Larger amount of data must be located on the DDR SDRAM, witch is external RAM (outside the CPU).
The CPU is connected to the DDR SDRAM with memory bus, with has much lower bandwidth than the cache memory.
Example:
Assume your external RAM type is Dual Channel DDR3 SDRAM 1600.
The maximum theoretical bandwidth between external RAM and CPU is about 25GB/Sec.
Reading 100MBytes of data (at 25GB/S) from the RAM to the CPU takes about 100e6 / 25e9 = 4msec.
From my experience the utilized bandwidth is about half of theoretical bandwidth, so the reading time is about 8msec.
The computation time is shorter:
Assume each iteration of your loop takes about 2 CPU clocks (just an example).
Each iteration process 16 bytes of data.
Total CPU clocks for processing 100MB takes about (100e6 / 16)*2 = 12500000 clks.
Assume CPU frequency is 3GHz.
Total SSE processing time is about 12500000 / 3e9 = 4.2msec.
As you can see, reading the data from external RAM takes twice as much as SSE computation time.
Since the data transfer and computation occur in parallel, the total time is the maximum of 4.2mesc and 8msec (i.e 8msec).
Lets assume loop without using SSE takes twice as much computation time, so without using SSE the computation time is about 8.4msec.
In the above example the total improvement of using SSE is about 0.4msec.
Note: The selected numbers are just for example purposes.
Benchmarks:
I did some benchmarks on my system.
I am using Windows 10 and Visual Studio 2010.
Benchmark test: Summing 100MBytes of data (summing 25*1024^2 32bits integers).
CPU
Intel Core i5 3550 (Ivy Bridge).
CPU Base frequency is 3.3GHz.
Actual Core Speed during the test: 3.6GHz (Turbo boost is enabled).
L1 data cache size: 32KBytes.
L2 cache size: 256Bytes (single core L2 cache size).
L3 cache size: 6MBytes.
Memory:
8GB DDR3 Dual channel.
RAM Frequency: 666MHz (equivalent to 1333MHz without DDR).
Memory theoretical maximum bandwidth: (128*1333/8) / 1024 = 20.8GBytes/Sec.
Sum 100MB as large chunk with SSE (data in external RAM).
Processing time: 6.22msec
Sum 1KB 100 times with SSE (data inside cache).
Processing time: 3.86msec
Sum 100MB as large chunk without SSE (data in external RAM).
Processing time: 8.1msec
Sum 1KB 100 times without SSE (data inside cache).
Processing time: 4.73msec
Utilized memory bandwidth: 100/6.22 = 16GB/Sec (dividing data size by time).
Average clocks per iteration with SSE (data in cache): (3.6e9*3.86e-3)/(25/4*1024^2) = 2.1 clks/iteration (dividing total CPU clocks by number of iterations).

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?
First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.
The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code

I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.
From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

OpenCL code runs faster on MBP than on NVIDIA GTX 480

I'm I have come across a strange problem. I'm implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:
__kernel void matrix_mult(__global float* a,
__global float* b,
__global float* c,
const int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < N; i++) {
sum += a[row*N+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
I test the hardware by running the code 100 times like this:
clock_t begin=clock();
const unsigned int repeats = 100;
for(int i = 0; i != repeats; i++){
runCL(a, b, results,N, N*N);
}
clock_t end=clock();
On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would't expect the expensive GTX card to be a little faster than the laptop.
As far as I can see either my code is 'wrong' of I'm timing in some wrong way.
I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.
cl_event event = {0};
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);
cl_int err = clWaitForEvents (1,&event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;
Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps up the same performance. Can anyone explain why this is?
In timing the original problem, what you're seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.
The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can't be coallesced because of the stride.
The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there's only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you're seeing.
The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.
The profiling results you're seeing, though, I have no idea about. Perhaps the 330M doesn't have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I'd just use the simpler timing approach for now, which since you're not using asynchronous kernels or transfer, should be fine.
I think you're pushing the limits on the timer resolution for Nvidia. Try clGetDeviceInfo() with CL_DEVICE_PROFILING_TIMER_RESOLUTION to check it. With those tiny times I wouldn't really conclude anything.
A few ms could be the difference between initialization routines for each code path, especially when both testing systems have different hardware.
I recommend starting by testing a larger set which requires at least several seconds on both the laptop and the nVidia card.

Resources