I recently tried to parallelize my C code using the OpenMP library. It worked until the moment I call the MPSolve library to find the roots of a polynomial. When the code needs to perform this task it creates multiple threads without my control.
I know that MPsolve uses a multithreading algorithm to increase performance with pthread library, but I would like to know if the number of threads can be controlled when using it because I intend to use this code in a cluster and I would not like to use the limit of available threads.
I expected understand if it is possible to limit the number of threads that mpsolve creates and how to do it
Related
I have a hybrid MPI+OpenMP code using C programming. Before I run the code, I set the threads I want to use in Linux Bash:
!# /bin/sh
export OMP_NUM_THREADS=1
However, if I want to use the maximum threads in this computer, how can I set in Linux bash environment like the above?
Typically, OpenMP implementations are smart enough to provide a good default for the number of threads. For instance:
If you run an OpenMP program without any setting of OMP_NUM_THREADS and there's no MPI at all, OpenMP will usually determine how many cores there are and use all of them.
If there's MPI and the MPI processes are bound to a subset of the machine through process pinning, e.g., one rank per processor socket, OpenMP will inherit this subset and the implementation will automatically only use that subset.
In all those cases and most other cases, you do not need to set OMP_NUM_THREADS at all.
If you can share more details about what you are trying to achieve, one can provide a more detailed answer about what you need to do.
If I search for counting the number of threads an application has, all the answers involve external programs like top. I want to count the threads within the application itself.
I can't add code at the point of thread creation because it happens inside an immutable library.
I can't read /proc.
It's a C/pthreads program running on a few different Unices.
If you can't read /proc you are a bit in trouble, unless your program communicate with another program which reads /proc
If you don't want to read /proc because of portability concerns, you might use a library which abstracts that a bit, like libproc does
You could write a tiny wrapper for pthread_create that counts created threads and link against that wrapper after you linked against the immutable library.
Use top -H. But chances are, if you can't read proc, top won't work anyway. If thats the case, there is no easy way and it would depend on your specific system.
I'm learning OpenCL in order to implement a relatively complex image processing algorithm which includes several subroutines that should be implemented as kernels.
The implementation is intended to be on Mali T-6xx GPU.
I read the "OpenCL Programming by Example" book and the "Optimizing OpenCL kernels on the Mali-T600 GPUs" document.
In the book examples they use some global size of work items and each work item processes several pixels in for loops.
In the document the kernels are written without loops as in there is a single execution per work item in the kernel.
Since the maximum global size of work items that are possible to spawn on the Mali T-600 GPUs are 256 (and thats for simple kernels) And there are clearly more pixels to process in most images, in my understanding the kernel without loops will spawn more work item threads as soon as possible until the global size of work items completed executing the kernel and the global size might just be the amount of pixels in the image. Is that right? Such that it is a kind of a thread spawning loop in itself?
On the other hand in the book. The global work size is smaller than the amount of pixels to process, but the kernel has loops that make each work item process several pixels while executing the kernel code.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that
matter and in what situations one way might be better than the other, assuming I understood correctly both ways...
Is that right? Such that it is a kind of a thread spawning loop in itself?
Yes.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one
I suspect there isn't a "right" answer in general - there are multiple hardware vendors and multiple drivers - so I suspect the "best" approach will vary from vendor to vendor.
For Mali in particular the thread spawning is all handled by hardware, so will in general be faster than explicit loops in the shader code which will take instructions to process.
There is normally some advantage to at least some vectorization - e.g. processing vec4 or vec8 vectors of pixels per work item rather than just 1 - as the Mali-T600/700/800 GPU cores uses a vector arithmetic architecture.
In linux.
I want to build an autoclicker that will have an enable/disable function when a key is pressed. Obviously there should be 2 things running in parallel (the clicker itself, and the enable/disable function)
What are the cons and pros of each implementation:
Using a thread which will handle the autoclicking function and another main thread (for the enable/disable etc...)
Or using the syscall select and wait for input/keyboard?
Using select is better for performance, especially when you could have potentially hundreds of simultaneous operations. However it can be difficult to write the code correctly and the style of coding is very different from traditional single threaded programming. For example, you need to avoid calling any blocking methods as it could block your entire application.
Most people find using threads simpler because the majority of the code resembles ordinary single threaded code. The only difficult part is in the few places where you need interthread communication, via mutexes or other synchronization mechanisms.
In your specific case it seems that you will only need a small number of threads, so I'd go for the simpler programming model using threads.
Given the amount of work you're doing, it probably doesn't matter.
For high performance applications, there is a difference. In these cases, you need to be handling several thousand connections simultaneously; in such cases, you hand off new connections to new threads.
Creating several thousand threads is expensive, so selecting is used for efficiency. Actually different techniques such as kqueue or epoll are used for optimal switching.
I say it doesn't matter, because you're likely only going to create the thread once and have exactly two threads running for the lifetime of the application.
If I search for counting the number of threads an application has, all the answers involve external programs like top. I want to count the threads within the application itself.
I can't add code at the point of thread creation because it happens inside an immutable library.
I can't read /proc.
It's a C/pthreads program running on a few different Unices.
If you can't read /proc you are a bit in trouble, unless your program communicate with another program which reads /proc
If you don't want to read /proc because of portability concerns, you might use a library which abstracts that a bit, like libproc does
You could write a tiny wrapper for pthread_create that counts created threads and link against that wrapper after you linked against the immutable library.
Use top -H. But chances are, if you can't read proc, top won't work anyway. If thats the case, there is no easy way and it would depend on your specific system.