On my dual core machine, Node JS runs faster than an equivalent program written in C.
Is node so well optimized that it actually is more efficient, or is there something wrong with my C program that makes it slower?
Node js code:
var Parallel = require("paralleljs");
function slow(n){
var i = 0;
while(++i < n * n){}
return i;
}
var p = new Parallel([20001, 32311, 42222]);
p.map(slow).then(function(data){
console.log("Done!"+data.toString());
});
C code:
#include <stdio.h>
#include <pthread.h>
struct thread_s {
long int n;
long int r;
};
void *slow(void *p){
thread_s *t = (thread_s*)p;
long int i = 0;
while(++i < t->n * t->n){}
t->r = i;
pthread_exit( 0 );
}
thread_s arr[] = {{20001, 0}, {32311, 0}, {42222, 0}};
int main(){
pthread_t t[3];
for(int c = 0; c < 3; c++){
pthread_create(&t[c], NULL, slow, &arr[c]);
}
for(int c = 0; c < 3; c++){
pthread_join(t[c], NULL);
}
printf("Done! %ld %ld %ld\n", arr[0].r, arr[1].r, arr[2].r);
return 0;
}
You are benchmarking a toy program which is not a good way to compare compilers. Also, the loops you are doing have no side-effects. All it does is set i to n * n. The loops should be optimized out. Are you running unoptimized?
Try to compute something real that approximates the work-load that you will later apply in production. If your code will be numerics-heavy you could benchmark a naive matrix multiplication for example.
All basic operations (+-, Math.xx etc.) are mapped to V8 engine which just execute it as C programm. So you should have pretty same results for C vs Node.js in these kind of scenarios.
Also I have tried C#.NET vs Node Fibonacci of 45. And first time I ran it was 5 times slower for C#, that was really strange. In a moment I understood that this is due to debug mode I ran C# app.
Going to release make it very close(20sec node, 22sec C#), probably this is just measurement inconsistency.
In any case this is just a matter of percents.
Question currently lacks details on benchmark, so it is impossible to say anything definitive about it. However, general comparison between V8 running javascript, and a bknary program compiled from C source is possible.
V8 is pretty darn good at JIT compilation, so while there is the overhead of JIT compilation, this compensates for dynamic nature of JavaScript, so for simple integer operations in a loop there's no reason for JIT code to be slower. JIT
Another consideration is startup time. If you load node.js first and load javascript from interactive prompt, startup time of script is minimal even with JIT, especially compared to dynamically linked binary which needs to resolve symbols etc. If you have small statically linked binary, it will start very fast, and will have done a lot of processing by the time a new node.js is even started and starts to look for some Javascript to execute. You need to be careful how you handle this in benchmarks, or results will be meaningless.
Related
import numpy as np
array = np.random.rand(16384)
array *= 3
above python code make each element in array has 3 times multiplied value of its own.
On my Laptop, these code took 5ms
Below code is what i tried on C language.
#include <headers...>
array = make 16384 elements...;
for(int i = 0 ; i < 16384 ; ++i)
array[i] *= 3
compile command was
gcc -O2 main.cpp
it takes almost 30ms.
Is there any way i can reduce process time of this?
P.S it was my fault. I confused unit of timestamp value.
this code is faster than numpy. sorry for this question.
This sounds pretty unbelievable. For reference, I wrote a trivial (but complete) program that does roughly what you seem to be describing. I used C++ so I could use its chrono library to get (much) more precise timing than C's clock provides, but I wouldn't expect that to affect the speed at all.
#include <iostream>
#include <chrono>
#define SIZE (16384)
float array[SIZE];
int main() {
using namespace std::chrono;
for (int i = 0; i < SIZE; i++) {
array[i] = i;
}
auto start = high_resolution_clock::now();
for (int i=0; i<SIZE; i++) {
array[i] *= 3.0;
}
auto stop = high_resolution_clock::now();
std::cout << duration_cast<microseconds>(stop - start).count() << '\n';
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += i;
}
std::cout << "Ignore: " << total << "\n";
}
On my machine (2.8 GHz Haswell, so probably slower than whatever you're running) this shows a time of 7 or 8 microseconds, so around 600-700 times as fast as you're getting from Python.
Adding the compiler flag to use AVX 2 instructions reduces that to 4 microseconds, or a little more than 1000 times as fast (warning: AMD processors generally don't get as much of a speed boost from using AVX 2, but if you have a reasonably new AMD processor I'd expect it to be faster than this anyway).
Bottom line: the speed you're reporting for your C code only seems to make sense if you're running the code on some sort of slow microcontroller, or maybe a really old desktop system--though it would have to be quite old to run nearly as slow as you're reporting. My immediate guess is that even a 386 would be faster than that.
When/if you have something that takes enough time to justify it, you can also use OpenMP to run a loop like this in multiple threads. I tried that, but in this case the overhead of starting up and synchronizing the threads is (quite a bit) more than running in parallel can gain, so it's a net loss.
Compiler: VS 2019 (Microsoft (R) C/C++ Optimizing Compiler Version 19.27.28919.3 for x64).
Flags: /O2b2 /GL (and part of the time, /arch:AVX2)
I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.
I'm trying to figure out how to structure the main loop code for a numerical simulation in such a way that the compiler generates nicely vectorized instructions in a compact way.
The problem is most easily explained by a C pseudocode, but I also have a Fortran version which is affected by the same kind of issue. Consider the following loop where lots_of_code_* are some complicated expressions which produces a fair number of machine instructions.
void process(const double *in_arr, double *out_arr, int len)
{
for (int i = 0; i < len; i++)
{
const double a = lots_of_code_a(i, in_arr);
const double b = lots_of_code_b(i, in_arr);
...
const double z = lots_of_code_z(i, in_arr);
out_arr[i] = final_expr(a, b, ..., z);
}
}
When compiled with an AVX target the Intel compiler generates code which goes like
process:
AVX_loop
AVX_code_a
AVX_code_b
...
AVX_code_z
AVX_final_expr
...
SSE_loop
SSE_instructions
...
scalar_loop
scalar_instructions
...
The resulting binary is already quite sizable. My actual calculation loop, though, looks more like the following:
void process(const double *in_arr1, ... , const double *in_arr30,
double *out_arr1, ... double *out_arr30,
int len)
{
for (int i = 0; i < len; i++)
{
const double a1 = lots_of_code_a(i, in_arr1);
...
const double a30 = lots_of_code_a(i, in_arr30);
const double b1 = lots_of_code_b(i, in_arr1);
...
const double b30 = lots_of_code_b(i, in_arr30);
...
...
const double z1 = lots_of_code_z(i, in_arr1);
...
const double z30 = lots_of_code_z(i, in_arr30);
out_arr1[i] = final_expr1(a1, ..., z1);
...
out_arr30[i] = final_expr30(a30, ..., z30);
}
}
This results in a very large binary indeed (400KB for the Fortran version, 800KB for C99). If I now define lots_of_code_* as functions, then each function gets turned into non-vectorized code. Whenever the compiler decides to inline a function it does vectorize it, but seems to also duplicate the code each time as well.
In my mind, the ideal code should look like:
AVX_lots_of_code_a:
AVX_code_a
AVX_lots_of_code_b:
AVX_code_b
...
AVX_lots_of_code_z:
AVX_code_z
SSE_lots_of_code_a:
SSE_code_a
...
scalar_lots_of_code_a:
scalar_code_a
...
...
process:
AVX_loop
call AVX_lots_of_code_a
call AVX_lots_of_code_a
...
SSE_loop
call SSE_lots_of_code_a
call SSE_lots_of_code_a
...
scalar_loop
call scalar_lots_of_code_a
call scalar_lots_of_code_a
...
This clearly results in a much smaller code which is still just as well optimized as the fully-inlined version. With luck it might even fit in L1.
Obviously I can write the this myself using intrinsics or whatever, but is it possible to get the compiler to automatically vectorize in the way described above through "normal" source code?
I understand that the compiler will probably never generate separate symbols for each vectorized version of the functions, but I thought it could still just inline each function once inside process and use internal jumps to repeat the same code block, rather than duplicating code for each input array.
Formal answer to questions like yours:
Consider using OpenMP4.0 SIMD-enabled (I didn't say inlined) functions or equivalent proprietary mechanisms. Available in Intel Compiler or fresh GCC4.9.
See more details here: https://software.intel.com/en-us/node/522650
Example:
//Invoke this function from vectorized loop
#pragma omp declare simd
int vfun(int x, int y)
{
return x*x+y*y;
}
It will give you capability to vectorize loop with function calls without inlining and as a result without huge code generation. (I didn't really explore your code snippet in details; instead I answered the question you asked in textual form)
The immediate problem that comes to mind is the lack of restrict on the input/output-pointers. The input is const though, so it's probably not too much of a problem, unless you have multiple output-pointers.
Other than that, I recommend -fassociative-math or whatever the ICC equivalent is. Structurally, you seem to iterate over the array, doing multiple independent operations on the array that are only munged together in the very end. Strict fp compliance might kill you on the array-operations.Finally, there's probably no way this will get vectorized if you need more intermediate results than vector_registers - input_arrays.Edit:
I think I see your problem now. You call the same function on different data, and want each result stored independently, right?The problem is that the same function always writes to the same output register, so subsequent, vectorized calls would clobber earlier results. The solution could be:A stack of results (either in memory or like the old x87 FPU-stack), that gets pushed every time. If in memory, it is slow, if x87, it's not vectorized. Bad idea.
Effectively multiple functions to write into different registers. Code duplication. Bad idea.Rotating registers, like on the Itanium. You don't have an Itanium? You're not alone.It's possible that this can't be easily vectorized on current architectures. Sorry.
Edit, you're apparently fine with going to memory:
void function1(double const *restrict inarr1, double const *restrict inarr2, \
double *restrict outarr, size_t n)
{
for (size_t i = 0; i<n; i++)
{
double intermediateres[NUMFUNCS];
double * rescursor = intermediateres;
*rescursor++ = mungefunc1(inarr1[i]);
*rescursor++ = mungefunc1(inarr2[i]);
*rescursor++ = mungefunc2(inarr1[i]);
*rescursor++ = mungefunc2(inarr2[i]);
...
outarr[i] = finalmunge(intermediateres[0],...,intermediateres[NUMFUNCS-1]);
}
}
This might be vectorizable. I don't think it'll be all that fast, going at memory speed, but you never know till you benchmark.
If you moved the lots_of_code blocks into separate compilation units without the for loop, they will probably not vecorize. Unless the compiler has a motive for vectorization, it will not vectorize the code because vectorization might lead for longer latencies in the pipelines. To get around that, split the loop into 30 loops, and put each one of them in a separate compilation unit like that:
for (int i = 0; i < len; i++)
{
lots_of_code_a(i, in_arr1);
}
I'm doing a proof of concept descrypt bruteforcer, and have the single threaded version working nicely at around 190k hashes/s with a single core of i-7 860 cpu.
I am now trying to make a multithreaded version of this program (my first time playing with threads, so I'm hoping that I'm doing something wrong here).
I first attempted to use crypt directly, this was fast but resulted in mangled hashes, as the threads were contesting the crypt function.
Using mutex lock and unlock on the function helped, but this reduced the speed of the program to just a few percent higher than the single threaded version.
I then managed to google up crypt_r which was advertised to be threadsafe.
Modified both the singlethreaded version to use crypt_r (with single thread)
and the multithreaded version to use it instead of crypt, and the performance in singlethreaded version dropped to around 3.6k h/s and to around 7.7k h/s in the multithreaded version when using two cores at 99.9% utilization.
So the question is, should it be this slow?
The problem was, that the function I was calling when executing the crypt_r function also contained the code for initializing the struct that it requires.
The solution was to move the initialization out of the function that is called when hashing is done.
simplified example incorrect way:
for (int i = 0; i < 200000; i++)
{
struct crypt_data data;
data.initialized = 0;
char* hash = crypt_r("password", "sa", &data);
printf("%i %s", i, hash);
}
correct way:
struct crypt_data data;
data.initialized = 0;
for (int i = 0; i < 200000; i++)
{
char* hash = crypt_r("password", "sa", &data);
printf("%i %s", i, hash);
}
I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
{
fftwf_execute(p);
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
}
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
flyree
Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.