we have a C software on a i.MX53 (SO based on Linux) that have to compute a FFT; for this purpose we have adopted the FFTW library. For performances reasons we have decided to separate the FFT from the main application, using a separated thread. The overall application seems work, but after a while we have a segmentation fault in correspondence of fftwf_execute. I am sure of this because without this single istruction we have not segmentation faults. We have made several attempts but the problem persists. Here parte of the thread function:
void* vGestDiag_ThreadFFT( void* unused )
{
Int32U idx = 0, idxI = 0, idxJ = 0, idxZ = 0, idxK = 0;
Flo32 lfBufferAccm_chn01[LEN_BUFFER_SAMPLES];
Flo32 lfBufferAccm_chn02[LEN_BUFFER_SAMPLES];
Flo64 dblBufferFFT[LEN_BUFFER_SAMPLES];
Int32U ulCntUtilSample = 0;
float *in;
fftwf_complex *out;
fftwf_plan plan;
/* other variables.... */
/* Init */
memset(lfBufferAccm_chn01, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo32));
memset(lfBufferAccm_chn02, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo32));
memset(dblBufferFFT, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo64));
/* other local memsets .... */
/* Inputs */
pthread_mutex_lock(&lockIN);
ulCntUtilSample = wulCntUtilSample;
/* other inputs.... */
for (idxJ = 0; idxJ < ulCntUtilSample; idxJ++)
{
boBuffCirc_ReadBuffer(&wulBufferAcc01, &ulTmpValue);
lfBufferAccm_chn01[idxJ] = (Flo32)((((Flo32)ulTmpValue - ACC_Q)/ACC_M) * ACC_U) * wlfSensAcc;
boBuffCirc_ReadBuffer(&wulBufferAcc02, &ulTmpValue);
lfBufferAccm_chn02[idxJ] = (Flo32)((((Flo32)ulTmpValue - ACC_Q)/ACC_M) * ACC_U) * wlfSensAcc;
}
pthread_mutex_unlock(&lockIN);
/* --------- Plan FFT ------------------------- */
in = (float*) fftwf_malloc(sizeof(float) * ulCntUtilSample);
out = (fftwf_complex*) fftwf_malloc(sizeof(fftwf_complex) * ulCntUtilSample);
fftwf_plan_with_nthreads(1);
plan = fftwf_plan_dft_r2c_1d(ulCntUtilSample, in, out, FFTW_ESTIMATE);
for (idxI = 0; idxI <= 1; idxI++)
{
switch(idxI)
{
case 0:
memcpy(in, lfBufferAccm_chn01, ulCntUtilSample*sizeof(float));
break;
case 1:
memcpy(in, lfBufferAccm_chn02, ulCntUtilSample*sizeof(float));
break;
default:
break;
}
/* --------- FFT ------------------------- */
/* EXEC */
fftwf_execute(plan);
/* Complex -> Real */
for (idxZ = 0; idxZ < ulCntUtilSample; idxZ++)
{
dblBufferFFT[idxZ] = cabs(out[idxZ]);
}
/* --------- End FFT ------------------------- */
/* Post-Processing FFT */
/* post-processing and outputs in module static variables, within mutex */
}
/* DEL plan */
fftwf_destroy_plan(plan);
fftwf_free(in);
fftwf_free(out);
/* exit */
pthread_exit(NULL);
}
Variables starting with 'w' are module static variables, LEN_BUFFER is oversized respect the number of samples.
Thanks everyone for helping!!
This is my first project involving threads and also my coworkers have not a great experience, so we have not considered several issues. The first one was the difference between DETACHABLE and JOINABLE: our application have not to wait for thread completing; indeed this was the problem that lead us to thread (the main that waited a long time for FFT completing, while this was not necessary). The previous version of my SW was default, joinable, so resources allocated to the thread were never free. The second point that we have not considered was that generating a thread requires additional resources. In the previous version of my SW was generated a thread each time that FFT had to be computed; in our application this append about every 10 second or more, so apparently not critical; moreover the fastest is FFT beginning, the less data are used for FFT (is a long story, but we can summerized so); finally, the use of mutex and the logic of the algorithm apparently protected the shared resources (it was quite improbable that a shared resources was used from the threads and main at the same time). Unfortunately, the thread generation, after several cycles saturated the memory (we are working not on a PC but on a simple microprocessor....), and this was the origin of the segmentation faults, so i believe.
I have solved the problem in this way: the FFT thread is generated at the start of the application, so it is generated only a single thread, together with the main. Within the thread is an infinite loop that periodically (at the moment 4 times fastest than the main loop) check for a shared variable that indicate the request for FFT: if true, the thread read and copy locally a shared memory that cointains the FFT inputs and other required parameters. At the end of the processing, outputs are saved in a shared memory and another flag is set to indicated the main loop that a fft is available. Each shared data is accessed R/W using mutex, both within thread and main, and only data used by the thread are controlled by mutexes. The thread is generated as detachable without any kills or exits, because the thread has to live "forever" together with the main. Yesterday this SW has run for several hours without problems (i have stopped manually) and also in stress conditions.
Related
I have a GPU code that, at each iteration, decides if the iteration can be offloaded to the accelerator. OpenACC come to be the best tool:
void module(struct my_aos *aos, int n_aos){
int criteria = /* check either that n_aos is large enough and that aos[:n_aos] will fit the GPU */
#pragma acc data copy(aos[0:n_aos]) if(criteria)
#pragma acc parallel loop if(criteria)
for(int i = 0; i < n_aos; i++){
/* work on my_aos*/
}
}
How can I decide in advance if aos[0:n_aos] will fit the GPU memory? Is there an openacc_get_free_device_memory() sort of function?
Other wise how can I start a device copy and get back to CPU-only run in case of out-of-memory failure?
See section "3.2.6 acc get property" section of the OpenACC standard. In particular the "acc_property_free_memory" property.
https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf
I have a strange situation under C/Visual Studio on a Windows 7 platform. There is a problem from time to time and I spent a lot of time to find it. The problem is within a third party library, for which I have the complete code. There a thread is created (the printLog statements are from myself):
...
plafParams->eventThreadFlag = 2;
printLog("before CreateThread");
if (plafParams->hReadThread_p = CreateThread(NULL, 0, ( LPTHREAD_START_ROUTINE ) plafPortReadThread, ( void * ) dlmsInstance, 0,
&plafParams->portReadThreadID) )
{
printLog("after CreateThread: OK");
plafParams->eventThreadFlag = 3;
}
else
{
unsigned int lasterr = GetLastError();
printLog("error CreateThread, last error:%x", lasterr);
/* Could not create the read thread. */
...
...
return FAILURE;
}
printLog("SUCCESS");
...
...
The thread function is:
void *plafPortReadThread(DLMS_GLOBALS *dlmsInstance)
{
PLAF_PARAMS *plafParams;
plafParams = (PLAF_PARAMS *)(dlmsInstance->plafParams);
printLog("start - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
while ((plafParams->eventThreadFlag != 1) && (plafParams->eventThreadFlag != 3))
{
if (plafParams->eventThreadFlag == 0)
{
printLog("exit 1 - plafPortReadThread, plafParams->eventThreadFlag=%x", plafParams->eventThreadFlag);
CloseHandle(plafParams->hReadThread_p);
plafFree((void **)&plafParams);
ExitThread(0);
break;
}
}
printLog("start - plafPortReadThread, proceed=%d", proceed);
...
Now, when the flag is set before the while loop is started within the thread, everything works OK:
SUCCESS
start - plafPortReadThread, plafParams->eventThreadFlag=3
But sometimes the thread is quick enough so the while loop is started before the flag is actually set within the outer part.
The output is then:
start - plafPortReadThread, plafParams->eventThreadFlag=2
SUCCESS
Most surprisingly the while loop doesn't exit, even after the flag has been set to 3.
It seems, that the compiler "optimizes" the flag and assumes, that it cannot be changed from outside.
What could be the problem? I'm really surprised. Or is there something else I have overseen completely? I know, that the code is not very elegant and that such things should better be done with semaphores or signals. But it is not my code and I want to change as little as possible.
After removing the whole while condition it works as expected.
Should I change the struct or its fields to volatile ? Everybody says, that volatile is useless in our days and not needed anymore, except in the case, where a memory location is changed by peripherals...
Prior to C11 this is totally platform-dependent, because the effect you are observing is due to the memory model used by your platform. This is different from a compiler optimization as synchronization points between threads require the compiler to insert barrier instructions, instead of, e.g., making something a constant. For C11 for section 7.17.3 specifies the different models. So your value is not optimized out statically, thread A just never reads the value thread B wrote, but still has its local value.
In practice many projects don't use C11 yet, and thus you will likely have to check the documentation of your platform. Note that in many cases you don't have to modify the type of the variable for the flag (in case you can't). Most memory models specify synchronization points that also forbid reordering of certain instructions, i.e. in:
int x = 3;
_Atomic int a = 1;
x = 5;
a = 2;
the compiler will often have to ensure that x has the value 3 when a has the value 1, and that when a is assigned the value 2, x will have the value 5. volatile does not participate in this relationship (in the C/C++ 11 models - often confused because it does participate in Java's happened-before), and is mostly useless, unless your writes should never be optimized out because they have side-effects such as a LED blinking which the compiler can't understand:
volatile int x = 1; // some special location - blink then clear
x = 1; // blink then clear
x = 1; // blink then clear
I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops.
Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?
I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 The Flush Operation
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.
I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"
I'm writing a small program that uses a certain percentage of CPU. The basic strategy is that I will continuously check the CPU usage, and make the process sleep if the level of usage is higher than the given value.
Moreover, since I'm using MacOS(no proc/stat like Linux, no PerformanceCounter in C#), I have to execute top command in another thread and get the CPU usage from it.
The problem is I keep getting a very high usage of CPU even I give a small value as argument. And after several experiments, it seems caused by shared field by multithreads.
Here are my code(code 1) and experiments:
(code 2)Initially I thought it is the shell commands make the usage very high, so I commented the infinite loop in run(), leaving only the getCpuUsage() running. However, the CPU usage is nearly zero.
(code 3)Then, I wrote another run() function independent from the cpuUsage, which is intended to use 50% of CPU. It works well! I think the only difference between code 1 and code 3 is the usage of cpuUsage. So I'm wondering if sharing field between threads will use CPU a lot?
code 1
const char CPU_COMMAND[] = "top -stats cpu -l 1 -n 0| grep CPU\\ usage | cut -c 12-15";
int cpuUsage; // shared field that stores the cpu usage
// thread that continuously check CPU usage
// and store it in cpuUsage
void getCpuUsage() {
char usage[3];
FILE *out;
while (1) {
out = popen(CPU_COMMAND, "r");
if (fgets(usage, 3, out) != NULL) {
cpuUsage = atof(usage);
} else {
cpuUsage = 0;
}
pclose(out);
}
}
// keep the CPU usage under ratio
void run(int ratio) {
pthread_t id;
int ret = pthread_create(&id, NULL, (void *)getCpuUsage, NULL);
if (ret!=0) printf("thread error!");
while (1) {
// if current cpu usage is higher than ration, make it asleep
if (cpuUsage > ratio) {
usleep(10);
}
}
pthread_join(id, NULL);
}
code 2
// keep the CPU usage under ratio
void run(int ratio) {
pthread_t id;
int ret = pthread_create(&id, NULL, (void *)getCpuUsage, NULL);
if (ret!=0) printf("thread error!");
/*while (1) {
// if current cpu usage is higher than ration, make it asleep
if (cpuUsage > ratio) {
usleep(10);
}
}*/
pthread_join(id, NULL);
}
code 3
void run() {
const clock_t busyTime = 10;
const clock_t idleTime = busyTime;
while (1) {
clock_t startTime = clock();
while (clock() - startTime <= busyTime);
usleep(idleTime);
}
}
Does shared field in multithreading C program use CPU a lot?
Yes, constant reads/writes to/from a shared memory locations, by multiple threads on multiple CPUs cause a cache line to constantly move between CPUs (cache bounce). IMO, it's the single most important reason for poor scalability in naive "parallel" applications.
Ok. Code1 creates a thread which -as fast as possible- does a popen. So this thread uses up all the cpu-time. the other thread (main-thread) does usleep's, but not the popening thread...
Code2 also starts this cpu-using thread and then waits for it to finish (join), which will never happen.
Code3 runs for a while, then sleeps the same amount, so it should use up around 50%.
So basically what you should do (if you really want to use top for that purpose), you call it, then sleep for lets say 1 second, or 100ms, and see wether your main-loop in code1 adjusts.
while (1) {
usleep (100*1000);
out = popen(CPU_COMMAND, "r");
Okay, so I've got some C code to perform a mathematical operation which could, pretty much, take any length of time (depending on the operands supplied to it, of course). I was wondering if there is a way to register some kind of method which will be called every n seconds which can analyse the state of the operation, i.e. what iteration it is currently at, possibly using a hardware timer interrupt or something?
The reason I ask this is because I know the common way to implement this is to be keeping track of the current iteration in a variable; say, an integer called progress and have an IF statement like this in the code:
if ((progress % 10000) == 0)
printf("Currently at iteration %d\n", progress);
but I believe that a mod operation takes a relatively long time to execute, so the idea of having it inside a loop which will be ran many, many times scares me, from an optimisation point of view.
So I get the feeling that having an external way of signalling a progress print is nice and efficient. Are there any great ways to perform this, or is the simple 'mod check' the best (in terms of optimising)?
I'd go with the mod check, but maybe with subtractions instead :-)
icount = 0;
progress = 10000;
/* ... */
if (--progress == 0) {
progress = 10000;
printf("Currently at iteration %d0000\n", ++icount);
}
/* ... */
While mod operations are usually slow, the compiler should be able to optimize and predict this really well and only mis-predict once ever 10'000 ifs, burning one mod operation and ~20 cycles (for the mis-prediction) on it, which is fine. So you are trying to optimize one mod operation every 10'000 iterations. Of course this assumes you are running it on a modern and typical CPU, and not some embedded system with unknown specs. This should even be faster than having a counter variable.
Suggestion: Test it with and without the timing code, and figure out a complex solution if there is really a problem.
Premature optimisation is the root of all evil. -Knuth
mod is about the same speed as division, on most CPU's these days that means about 5-10 cycles... in other words hardly anything, slower than multiply/add/subtract, but not enough to really worry about.
However you are right to want to avoid sting in a loop spinning if you're doing work in another thread or something like that, if you're on a unixish system there's timer_create() or on linux the much easier to use timerfd_create()
But for single threaded, just putting that if in is enough.
Use alarm setitimer to raise SIGALRM signals at regular intervals.
struct itimerval interval;
void handler( int x ) {
write( STDOUT_FILENO, ".", 1 ); /* Defined in POSIX, not in C */
}
int main() {
signal( SIGALRM, &handler );
interval.it_value.tv_sec = 5; /* display after 5 seconds */
interval.it_interval.tv_sec = 5; /* then display every 5 seconds */
setitimer( ITIMER_REAL, &interval, NULL );
/* do computations */
interval.it_interval.tv_sec = 0; /* don't display progress any more */
setitimer( ITIMER_REAL, &interval, NULL );
printf( "\n" ); /* done with the dots! */
}
Note, only a smattering of functions are OK to call inside handler. They are listed partway down this page. If you want to communicate anything for a fancier printout, do it through a sig_atomic_t variable.
you could have a global variable for the iterations, which you could monitor from an external thread.
While () {
Print(iteration);
Sleep(1000);
}
You may need to watch out for data races though.