OpenCL code runs faster on MBP than on NVIDIA GTX 480

OpenCL code runs faster on MBP than on NVIDIA GTX 480 - c

I'm I have come across a strange problem. I'm implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:
__kernel void matrix_mult(__global float* a,
__global float* b,
__global float* c,
const int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < N; i++) {
sum += a[row*N+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
I test the hardware by running the code 100 times like this:
clock_t begin=clock();
const unsigned int repeats = 100;
for(int i = 0; i != repeats; i++){
runCL(a, b, results,N, N*N);
}
clock_t end=clock();
On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would't expect the expensive GTX card to be a little faster than the laptop.
As far as I can see either my code is 'wrong' of I'm timing in some wrong way.
I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.
cl_event event = {0};
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);
cl_int err = clWaitForEvents (1,&event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;
Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps up the same performance. Can anyone explain why this is?

In timing the original problem, what you're seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.
The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can't be coallesced because of the stride.
The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there's only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you're seeing.
The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.
The profiling results you're seeing, though, I have no idea about. Perhaps the 330M doesn't have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I'd just use the simpler timing approach for now, which since you're not using asynchronous kernels or transfer, should be fine.

I think you're pushing the limits on the timer resolution for Nvidia. Try clGetDeviceInfo() with CL_DEVICE_PROFILING_TIMER_RESOLUTION to check it. With those tiny times I wouldn't really conclude anything.

A few ms could be the difference between initialization routines for each code path, especially when both testing systems have different hardware.
I recommend starting by testing a larger set which requires at least several seconds on both the laptop and the nVidia card.

Related

Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.
The code is available on this link1
The kernel code is available on this link2
and if you want to compile : link3 for Makefile
My issue is about the bad performances of GPU version :
for size of vector lower than 1,024 * 10^9 elements (i.e with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements) the runtime for GPU version is higher (slightly higher but higher) than CPU one.
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.
Concerning the number of workgroups, I have taken :
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups, isn't this too much ?? ).
I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.
I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.
Thanks

Well I can tell you some reasons:
You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0,
local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth.
However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?

First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.

The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Do virtual cores contribute to performance when parallelizing a matrix multiplication?

I have an O(n^3) matrix multiplication function in C.
void matrixMultiplication(int N, double **A, double **B, double **C, int threadCount) {
int i = 0, j = 0, k = 0, tid;
pragma omp parallel num_threads(4) shared(N, A, B, C, threadCount) private(i, j, k, tid) {
tid = omp_get_thread_num();
pragma omp for
for (i = 1; i < N; i++)
{
printf("Thread %d starting row %d\n", tid, i);
for (j = 0; j < N; j++)
{
for (k = 0; k < N; k++)
{
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
return;
}
I am using OpenMP to parallelize this function by splitting up the multiplications. I am performing this computation on square matrices of size N = 3000 with a 1.8 GHz Intel Core i5 processor.
This processor has two physical cores and two virtual cores. I noticed the following performances for my computation
1 thread: 526.06s
2 threads: 264.531
3 threads: 285.195
4 threads: 279.914
I had expected my gains to continue until the setting the number of threads equal to four. However, this obviously did not occur.
Why did this happen? Is it because the performance of a core is equal to the sum of its physical and virtual cores?

Using more than one hardware thread per core can help or hurt, depending on circumstances.
It can help if one hardware thread stalls because of a cache miss, and the other hardware thread can keep going and keep the ALU busy.
It can hurt if each hardware thread forces evictions of data needed by the other thread. That is the threads destructively interfere with each other.
One way to address the problem is to write the kernel in a way such that each thread needs only half the cache. For example, blocked matrix multiplication can be used to minimize the cache footprint of a matrix multiplication.
Another way is to write the algorithm in a way such that both threads operate on the same data at the same time, so they help each other bring data into cache (constructive interference). This approach is admittedly hard to do with OpenMP unless the implementation has good support for nested parallelism.

I guess that the bottleneck is the memory (or L3 CPU cache) bandwidth. Arithmetic is quite cheap these days.
If you can afford it, try to benchmark the same code with the same data on some more powerful processor (e.g. some socket 2013 i7)
Remember that on today's processors, a cache miss lasts as long as several hundred instructions (or cycles): RAM is very slow w.r.t. cache or CPU.
BTW, if you have a GPGPU you could play with OpenCL.
Also, it is probable that linear software packages like LAPACK (or some other numerical libraries) are more efficient than your naive matrix multiplication.
You could also consider using __builtin_prefetch (see this)
BTW, numerical computation is hard. I am not expert at all, but I met people who worked dozens of years in it (often after a PhD in the field).

Time measurement for getting speedup of OpenCL code on Intel HD Graphics vs C host code

I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?

I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.

From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
}
}
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
}
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
}
#pragma omp atomic update
done &= tdone;
}
} while(++count < max_iters && !done);
return count;
}
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Update
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.

The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
}
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
{
// ...
if(residual > tolerance) {
done = false;
}
// ...
}
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.

Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.

Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.

I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.

Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight