I'm doing a proof of concept descrypt bruteforcer, and have the single threaded version working nicely at around 190k hashes/s with a single core of i-7 860 cpu.
I am now trying to make a multithreaded version of this program (my first time playing with threads, so I'm hoping that I'm doing something wrong here).
I first attempted to use crypt directly, this was fast but resulted in mangled hashes, as the threads were contesting the crypt function.
Using mutex lock and unlock on the function helped, but this reduced the speed of the program to just a few percent higher than the single threaded version.
I then managed to google up crypt_r which was advertised to be threadsafe.
Modified both the singlethreaded version to use crypt_r (with single thread)
and the multithreaded version to use it instead of crypt, and the performance in singlethreaded version dropped to around 3.6k h/s and to around 7.7k h/s in the multithreaded version when using two cores at 99.9% utilization.
So the question is, should it be this slow?
The problem was, that the function I was calling when executing the crypt_r function also contained the code for initializing the struct that it requires.
The solution was to move the initialization out of the function that is called when hashing is done.
simplified example incorrect way:
for (int i = 0; i < 200000; i++)
{
struct crypt_data data;
data.initialized = 0;
char* hash = crypt_r("password", "sa", &data);
printf("%i %s", i, hash);
}
correct way:
struct crypt_data data;
data.initialized = 0;
for (int i = 0; i < 200000; i++)
{
char* hash = crypt_r("password", "sa", &data);
printf("%i %s", i, hash);
}
Related
SITUATION
I want to see the advantage of using pthread. If I'm not wrong: threads allow me to execute given parts of program in parallel.
so here is what I try to accomplish: I want to make a program that takes a number(let's say n) and outputs the sum of [0..n].
code
#define MAX 1000000000
int
main() {
long long n = 0;
for (long long i = 1; i < MAX; ++i)
n += i;
printf("\nn: %lld\n", n);
return 0;
}
time: 0m2.723s
to my understanding I could simply take that number MAX and divide by 2 and let 2 threads
do the job.
code
#define MAX 1000000000
#define MAX_THREADS 2
#define STRIDE MAX / MAX_THREADS
typedef struct {
long long off;
long long res;
} arg_t;
void*
callback(void *args) {
arg_t *arg = (arg_t*)args;
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
pthread_exit(0);
}
int
main() {
pthread_t threads[MAX_THREADS];
arg_t results[MAX_THREADS];
for (int i = 0; i < MAX_THREADS; ++i) {
results[i].off = i * STRIDE;
results[i].res = 0;
pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
}
for (int i = 0; i < MAX_THREADS; ++i)
pthread_join(threads[i], NULL);
long long result;
result = results[0].res;
for (int i = 1; i < MAX_THREADS; ++i)
result += results[i].res;
printf("\nn: %lld\n", result);
return 0;
}
time: 0m8.530s
PROBLEM
The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.
Can someone suggest a solution or show what I'm doing/understanding wrong here?
Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).
The naive (-O0) code for
for (long long i = arg->off; i < arg->off + STRIDE; ++i)
arg->res += i;
will access the memory of *arg. With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.
If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)
Another (better) option is to align arg_t on a cache line:
typedef struct {
_Alignas(64) /*typical cache line size*/ long long off;
long long res;
} arg_t;
Then you should get better performance with threads regardless of whether or not you turn optimization on.
Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory).
Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.
Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.
On my dual core machine, Node JS runs faster than an equivalent program written in C.
Is node so well optimized that it actually is more efficient, or is there something wrong with my C program that makes it slower?
Node js code:
var Parallel = require("paralleljs");
function slow(n){
var i = 0;
while(++i < n * n){}
return i;
}
var p = new Parallel([20001, 32311, 42222]);
p.map(slow).then(function(data){
console.log("Done!"+data.toString());
});
C code:
#include <stdio.h>
#include <pthread.h>
struct thread_s {
long int n;
long int r;
};
void *slow(void *p){
thread_s *t = (thread_s*)p;
long int i = 0;
while(++i < t->n * t->n){}
t->r = i;
pthread_exit( 0 );
}
thread_s arr[] = {{20001, 0}, {32311, 0}, {42222, 0}};
int main(){
pthread_t t[3];
for(int c = 0; c < 3; c++){
pthread_create(&t[c], NULL, slow, &arr[c]);
}
for(int c = 0; c < 3; c++){
pthread_join(t[c], NULL);
}
printf("Done! %ld %ld %ld\n", arr[0].r, arr[1].r, arr[2].r);
return 0;
}
You are benchmarking a toy program which is not a good way to compare compilers. Also, the loops you are doing have no side-effects. All it does is set i to n * n. The loops should be optimized out. Are you running unoptimized?
Try to compute something real that approximates the work-load that you will later apply in production. If your code will be numerics-heavy you could benchmark a naive matrix multiplication for example.
All basic operations (+-, Math.xx etc.) are mapped to V8 engine which just execute it as C programm. So you should have pretty same results for C vs Node.js in these kind of scenarios.
Also I have tried C#.NET vs Node Fibonacci of 45. And first time I ran it was 5 times slower for C#, that was really strange. In a moment I understood that this is due to debug mode I ran C# app.
Going to release make it very close(20sec node, 22sec C#), probably this is just measurement inconsistency.
In any case this is just a matter of percents.
Question currently lacks details on benchmark, so it is impossible to say anything definitive about it. However, general comparison between V8 running javascript, and a bknary program compiled from C source is possible.
V8 is pretty darn good at JIT compilation, so while there is the overhead of JIT compilation, this compensates for dynamic nature of JavaScript, so for simple integer operations in a loop there's no reason for JIT code to be slower. JIT
Another consideration is startup time. If you load node.js first and load javascript from interactive prompt, startup time of script is minimal even with JIT, especially compared to dynamically linked binary which needs to resolve symbols etc. If you have small statically linked binary, it will start very fast, and will have done a lot of processing by the time a new node.js is even started and starts to look for some Javascript to execute. You need to be careful how you handle this in benchmarks, or results will be meaningless.
I'm just trying to get my head around multithreading environments, specifically how you would implement a cooperative one in c (on an AVR, but out of interest I would like to keep this general).
My problem comes with the thread switch itself: I'm pretty sure I could write this in assembler, flushing all the registers to a stack and then saving the PC to return to later.
How would one pull something like this off in c? I have been told it can do "everything".
I realize this is quite a general question, so any links with information on this topic would be greatly appreciated.
Thanks
You can do this with setjmp/longjmp on most systems -- here is some code I've use in the past for task switching:
void task_switch(Task *to, int exit)
{
int tmp;
int task_errno; /* save space for errno */
task_errno = errno;
if (!(tmp = setjmp(current_task->env))) {
tmp = exit ? (int)current_task : 1;
current_task = to;
longjmp(to->env, tmp); }
if (exit) {
/* if we get here, the stack pointer is pointing into an already
** freed block ! */
abort(); }
if (tmp != 1)
free((void *)tmp);
errno = task_errno;
}
This depends on sizeof(int) == sizeof(void *) in order to pass a pointer as the argument to setjmp/longjmp, but that could be avoided by using handles (indexes into a global array of all task structures) instead of raw pointers here, or by using a static pointer.
Of course, the tricky part is setting up jmpbuf objects for newly created tasks, each with their own stack. You can use a signal handler with sigaltstack for that:
static void (*tfn)(void *);
static void *tfn_arg;
static stack_t old_ss;
static int old_sm;
static struct sigaction old_sa;
Task *current_task = 0;
static Task *parent_task;
static int task_count;
static void newtask()
{
int sm;
void (*fn)(void *);
void *fn_arg;
task_count++;
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sm = old_sm;
fn = tfn;
fn_arg = tfn_arg;
task_switch(parent_task);
sigsetmask(sm);
(*fn)(fn_arg);
abort();
}
Task *task_start(int ssize, void (*_tfn)(void *), void *_arg)
{
Task *volatile new;
stack_t t_ss;
struct sigaction t_sa;
old_sm = sigsetmask(~sigmask(SIGUSR1));
if (!current_task) task_init();
tfn = _tfn;
tfn_arg = _arg;
new = malloc(sizeof(Task) + ssize + ALIGN);
new->next = 0;
new->task_data = 0;
t_ss.ss_sp = (void *)(new + 1);
t_ss.ss_size = ssize;
t_ss.ss_flags = 0;
if ((unsigned long)t_ss.ss_sp & (ALIGN-1))
t_ss.ss_sp = (void *)(((unsigned long)t_ss.ss_sp+ALIGN) & ~(ALIGN-1));
t_sa.sa_handler = newtask;
t_sa.sa_mask = ~sigmask(SIGUSR1);
t_sa.sa_flags = SA_ONSTACK|SA_RESETHAND;
sigaltstack(&t_ss, &old_ss);
sigaction(SIGUSR1, &t_sa, &old_sa);
parent_task = current_task;
if (!setjmp(current_task->env)) {
current_task = new;
kill(getpid(), SIGUSR1); }
sigaltstack(&old_ss, 0);
sigaction(SIGUSR1, &old_sa, 0);
sigsetmask(old_sm);
return new;
}
If you wanted to keep it pure C, I think you might be able to use setjmp and longjmp, but I've never tried it myself, and I imagine there's probably some platforms on which this wouldn't work (i.e. certain registers/other settings not being saved). The only other alternative would be to write it in assembly.
As mentioned, setjmp/longjmp are standard C and are available even in the libc of 8-bit AVRs. They do exactly what you said you'd do in assembler: save the processor context. But one has to keep in mind that the intended purpose of those functions is just to jump backwards in the flow of control; switching between tasks is an abuse. It does work anyway, and looks like this is even frequently used in a variety of user-level thread libraries -- like GNU Pth. But still, is an abuse of the intended purpose, and requires being careful.
As Chris Dodd said, you still need to provide an stack for each new task. He used sigaltstack() and other signal-related functions, but those do not exist in standard C, only in unix-like environments. For example, the AVR libc does not provide them. So as an alternative you can try reserving a part of your existing stack (by declaring a big local array, or using alloca()) for use as the stack of the new thread. Just keep in mind that the main/scheduler thread will keep using its stack, each thread uses its own stack, and all of them will grow and shrink as stacks usually do, so they will need space for doing so without interfering with each other.
And since we're already mentioning unix-like, non-standard-C mechanisms, there is also makecontext()/swapcontext() and family, which are more powerful but harder to find than setjmp()/longjmp(). The names say it all really: the context functions let you manage full process contexts (stacks included), the jmp functions let you just jump around - you'll have to hack the rest.
For the AVR anyway, given that you won't probably have an OS to help nor much memory to blindly reserve, you'd be probably better off using assembler for the switching and stack initializing.
In my experience if people start writing schedulers it isn't too long before they start wanting things like network stacks, memory allocation and file systems too. It's almost never worth going down that route; you end up spending more time writing your own operating system than you're spending on your actual application.
First whiff of your project heading that way and it's almost always worth putting the effort to put in an existing OS (linux, VxWorks, etc). Of course, that might mean that you run into problems if the CPU isn't up to it. And AVR isn't exactly a whole lot of CPU, and fitting an existing OS on to it ranges from mostly impossible to tricky for the major OSes, though there are some tiny OSes (some open source, see http://en.wikipedia.org/wiki/List_of_real-time_operating_systems).
So at the commencement of a project you should carefully consider how you might wish to evolve it going into the future. This might influence your choice of CPU now to save having to do hideous things in software later.
I'm trying to parallelize a ray tracer in C, but the execution time is not dropping as the number of threads increase. The code I have so far is:
main2(thread function):
float **result=malloc(width * sizeof(float*));
int count=0;
for (int px=0;, px<width; ++px)
{
...
for (int py=0; py<height; ++py)
{
...
float *scaled_color=malloc(3*sizeof(float));
scaled_color[0]=...
scaled_color[1]=...
scaled_color[2]=...
result[count]=scaled_color;
count++;
...
}
}
...
return (void *) result;
main:
pthread_t threads[nthreads];
for (i=0;i<nthreads;i++)
{
pthread_create(&threads[i], NULL, main2, &i);
}
float** result_handler;
for (i=0; i<nthreads; i++)
{
pthread_join(threads[i], (void *) &result_handler);
int count=0;
for(j=0; j<width;j++)
{
for(k=0;k<height;k++)
{
float* scaled_color=result_handler[count];
count ++;
printf...
}
printf("\n");
}
}
main2 returns a float ** so that the picture can be printed in order in the main function. Anyone know why the exectution time is not dropping (e.g. it runs longer with 8 threads than with 4 threads when it's supposed to be the other way around)?
It's not enough to add threads, you need to actually split the task as well. Looks like you're doing the same job in every thread, so you get n copies of the result with n threads.
Parallelism of programs and algorithms is usually non trivial to achieve and doesn't come without some investment.
I don't think that working directly with threads is the right tool for you. Try to look into OpenMp, it is much more highlevel.
Two things are working against you here. (1) Unless you can allocate threads to more than one core, you couldn't expect a speed up in the first place; using a single core, that core has the same amount of work to do whether you parallelize the code or not. (2) Even with multiple cores, parallel performance is exquisitely sensitive to the ratio of computation done on-core to the amount of communication necessary between cores. With ptrhead_join() inside the loop, you're incurring a lot of this kind of 'stop and wait for the other guy' kind of performance hits.
I'm taking a class on C, and running into a segmentation fault. From what I understand, seg faults are supposed to occur when you're accessing memory that hasn't been allocated, or otherwise outside the bounds. 'Course all I'm trying to do is initialize an array (though rather large at that)
Am I simply misunderstanding how to parse a 2d array? Misplacing a bound is exactly what would cause a seg fault-- am I wrong in using a nested for-loop for this?
The professor provided the clock functions, so I'm hoping that's not the problem. I'm running this code in Cygwin, could that be the problem? Source code follows. Using c99 standard as well.
To be perfectly clear: I am looking for help understanding (and eventually fixing) the reason my code produces a seg fault.
#include <stdio.h>
#include <time.h>
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
int majorArray [1000][1000] = {};
clock_t start, end;
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
start=clock();
//first we do row major
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j<1000; j++)
{
majorArray[i][j] = 314;
}
}
end=clock();
rowMajor+= (end-start)/(double)CLOCKS_PER_SEC;
//at this point, we've only done rowMajor, so elapsed = rowMajor
start=clock();
//now we do column major
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j<1000; j++)
{
majorArray[j][i] = 314;
}
}
end=clock();
colMajor += (end-start)/(double)CLOCKS_PER_SEC;
}
//now that we've done the calculations 100 times, we can compare the values.
printf("Row major took %f seconds\n", rowMajor);
printf("Column major took %f seconds\n", colMajor);
if(rowMajor<colMajor)
{
printf("Row major is faster\n");
}
else
{
printf("Column major is faster\n");
}
return 0;
}
Your program works correctly on my computer (x86-64/Linux) so I suspect you're running into a system-specific limit on the size of the call stack. I don't know how much stack you get on Cygwin, but your array is 4,000,000 bytes (with 32-bit int) - that could easily be too big.
Try moving the declaration of majorArray out of main (put it right after the #includes) -- then it will be a global variable, which comes from a different allocation pool that can be much bigger.
By the way, this comparison is backwards:
if(rowMajor>colMajor)
{
printf("Row major is faster\n");
}
else
{
printf("Column major is faster\n");
}
Also, to do a test like this you really ought to repeat the process for many different array sizes and shapes.
You are trying to grab 1000 * 1000 * sizeof( int ) bytes on the stack. This is more then your OS allows for the stack growth. If on any Unix - check the ulimit -a for max stack size of the process.
As a rule of thumb - allocate big structures on the heap with malloc(3). Or use static arrays - outside of scope of any function.
In this case, you can replace the declaration of majorArray with:
int (*majorArray)[1000] = calloc(1000, sizeof majorArray);
I was unable to find any error in your code, so I compiled it and run it and worked as expected.
You have, however, a semantic error in your code:
start=clock();
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
Should be:
//set it up to perform the test 100 times.
for(int k = 0; k<10; k++)
{
start=clock();
Also, the condition at the end should be changed to its inverse:
if(rowMajor<colMajor)
Finally, to avoid the problem of the os-specific stack size others mentioned, you should define your matrix outside main():
#include <stdio.h>
#include <time.h>
int majorArray [1000][1000];
int main(void){
//first define the array and two doubles to count elapsed seconds.
double rowMajor, colMajor;
rowMajor = colMajor = 0;
This code runs fine for me under Linux and I can't see anything obviously wrong about it. You can try to debug it via gdb. Compile it like this:
gcc -g -o testcode test.c
and then say
gdb ./testcode
and in gdb say run
If it crashes, say where and gdb tells you, where the crash occurred. Then you now in which line the error is.
The program is working perfectly when compiled by gcc, & run in Linux, Cygwin may very well be your problem here.
If it runs correctly elsewhere, you're most likely trying to grab more stack space than the OS allows. You're allocating 4MB on the stack (1 mill integers), which is way too much for allocating "safely" on the stack. malloc() and free() are your best bets here.