Skip to end of time quanta - c

Is it possible to skip to the end of a process's allocated time-quantum? I have a program that works in parallel on a piece of shared memory, and then all of the processes need to wait for the others to finish and sync up before the next step. Each process will do a maximum of one iteration more than any other, so any timing differences are minimal.
microsleep almost works, but I'm pretty sure that even usleep(1) would take longer than I'd like (as of now I can do 5000 times through in about 1.5 seconds, so that would add around 20ms to a test).
Some kind of busy wait seems like a bad idea, though it's what I might end up going with.
What I would really like is something along the lines of
while(*everyoneDone != 0) {
//give up rest of this time-quantum
}
It doesn't need to be realtime, it just needs to be fast. Any ideas?
Note that this will be run on a multiproc machine, because if there's only one core to work with, the existing single-threaded version is going to perform better.

Don't do that, active waiting is almost always a bad idea in an application context. Use pthread_barrier_t, this is exactly the tool that is foreseen for your purpose.

You don't state what OS you're working with, but if it's POSIX then sched_yield() might be what you're looking for.
Really though, you'd almost certainly be better off using a proper synchronisation primitive, like a semaphore.

Related

Use an external function to take a mutex within a task

I was wondering if I can take a mutex within a task but by calling an external function.
Here is my code below:
void TakeMutexDelay50(SemaphoreHandle_t mutex)
{
while(xSemaphoreTake(mutex, 10) == pdFALSE)
{
vTaskDelay(50);
}
}
bool ContinueTaskCopy()
{
TakeMutexDelay50(ContinueTask_mutex);
bool Copy = ContinueTask;
xSemaphoreGive(ContinueTask_mutex);
return Copy;
}
Basically, my task calls the function ContinueTaskCopy(). Woud this be good practice?
The code above will work, but if you are not doing anything in the while loop for taking the mutex you could just set the timeout to portMAX_DELAY and avoid all the context switches every 50 ticks.
To answer your doubts - yes, this code will work. From the technical point of view, the RTOS code itself doesn't really care about the function which takes or releases the mutex, it's the task that executes the code that matters (or more specifically the context, as we also include interrupts).
In fact, a while ago (some FreeRTOS 7 version I think?) they've introduced an additional check in the function releasing the mutex which compares the task releasing the mutex to the task that holds the mutex. If it's not the same one, it actually fails an assert which is effectively an endless loop stopping your task from continuing further so that you can notice and fix the issue (there's extensive comments around asserts which help you diagnose the issue). It's done this way as mutexes are used to guard resources - think SD card access, display access and similar - and taking a mutex from one task and releasing it from another goes against this whole idea, or at least points to smelly code. If you need to do something like this, you likely wanted to use a semaphore instead.
Now as for the second part of your question - whether that's a good practice to do that - I'd say it depends how complicated you make it, but generally I consider it questionable. In general the code operating on a mutex looks something like this:
if(xSemaphoreTake(mutex, waitTime) == pdTRUE)
{
doResourceOperation();
xSemaphoreGive(mutex);
}
It's simple, it's easy to understand as that's how most are used to writing code like this. It pretty much avoids whole class of problems with this mutex which may arise if you start complicating code taking and releasing it ("Why isn't it released?", "Who holds this mutex?", "Am I in a deadlock?"). These kinds of problems tend to be hard to debug and sometimes even hard to fix because it may involve rearranging some parts of the code.
To give a general advice - keep it simple. Seeing some weird things being done around a mutex often means there's some questionable things going there. Possible some nasty deadlocks or race conditions. As in your example, instead of trying to take the mutex every 50ms forever until it succeeds, just wait forever by specifying portMAX_DELAY delay time for xSemaphoreTake and put it inside the same function that uses the resource and releases the mutex.

About Dijkstra omp

Recently I've download a source code from internet of the OpenMP Dijkstra.
But I found that the parallel time will always larger than when it is run by one thread (whatever I use two, four or eight threads.)
Since I'm new to OpenMP I really want to figure out what happens.
The is due to the overheard of setting up the threads. The execution time of the work itself is theoretically the same, but the system has to set up the threads that manage the work (even if there's only one). For little work, or for only one thread, this overhead time makes your time-to-solution slower than the serial time-to-solution.
Alternatively, if you see the time increasing dramatically as you increase the thread-count, you could only be using 1 core on your computer and tricking it into thinking its 2,4,8, etc threads.
Finally, it's possible that the way you're implementing dijkstra's method is largely serial. But without looking at your code it would be too hard to say.

libuv logging best practice?

I've got on my program a std::stringstream that is periodically flushed (with a timer) to a log file. The flushing and timer are on the default run loop.
Other parts of the app just append to that std::stringstream and the timer takes care of the rest. I do limit the size of the stringstream (to 1mb) so I drop messages if the stream is "full".
I'm just wondering, is this best practice for;
performance? Is being on the main thread OK to handle this IO? Can I do better?
Critical errors? The problem could be within my usage of libuv, which could mean that libuv based logging would be borked?
How does node.js handle logging?
I think that question is more complex than one might think at first sight.
One part of a good response have little to do with libuv and much with your concrete needs and tradeoffs. While, for instance, some buffering, i.e. less frequent write syscalls, is good it also introduces a problem that might (or not) hit you hard in the area of logging. Reason: Buffered stuff dies along with the application if it dies. That, however, goes against the very logic of logging.
As for libuv and performance, my personal experience is that
a) one wants to find a good balance between writing info out and buffering. In you case my gut feeling is that you are buffering too much and that you should probably write out more frequently.
b) one wants to think well about performance, both in terms of, whether its's really critical at all, and in terms of details. The latter being of increasing importance under heavy server load. When you serve some 100 connections it's probably irrelevant but if you serve tens or hundreds of thousands of connections, it might be too costly to use convenience functions like fprintf.
Concrete example: In a highly loaded situation you might want to get the wall time once at startup along with the (then) current value of a monotomic timer (which is very cheap). Any time information may then be relative to that start value (a simple subtraction). Writing it out works like this: preformatted start wall time plus monotonic diff (e.g. "03:52:41 +123456 ms").
Another point with your scenario is that a modern OS will virtually always provide excellent buffering, so it usually doesn't make a lot of sense to buffer too much yourself.
All in all I'd suggest to use a buffer of about 16K or 32K and to more frequently writing it out. If (and only if) your scenario is high performance/heavy load you may want to avoid convenient but expensive functions.
As far as libuv is concerned I wouldn't worry. Depending on your OS and the libuv version file stuff (as opposed to socket stuff) may indeed be pseudo asynchronous (faked through threading) but my experience is that libuv is not the problem; rather, for instance, your large buffer may be a problem as it might well be written in multiple chunks.
Regarding your timer based approach you might want to have a look at the libuv idle mechanism and to also take care of the problem of a full buffer. Simply throwing away logging info seems inacceptable to me; after all, you're not logging for the fun of it but because that info is presumably important (if it weren't you wouldn't have a problem in the first place. The solution then would be simple: less logging).
Finally, I'd like to make a more general remark: The secret here is balance, not optimized performance of single details. You want to keep the whole system nicely balanced rather than, for instance to optimize by using large buffers which in the end just pushes the problem to another level rather than solving it.
I like to think of that problem field like of the task of moving e.g. company headquarters: The issue isn't about the fastest truck but about all of them being quite fast, in other words, a well balanced approach.
Honestly there are better options than logging by hand. If you are programming an application it's often faster, in both development and execution time, to use a library.
If you are programming to learn, then I'd advice to take a look at spdlog (the fastest approach) and g3log, that claims to have the best worst case.
In my experience, std::stringstream is not enough fast to be part of a logging system.

Making process survive failure in its thread

I'm writing app that has many independant threads. While I'm doing quite low level, dangerous stuff there, threads may fail (SIGSEGV, SIGBUS, SIGFPE) but they should not kill whole process. Is there a way to do it proper way?
Currently I intercept aforementioned signals and in their signal handler then I call pthread_exit(NULL). It seems to work but since pthread_exit is not async-signal-safe function I'm a bit concerned about this solution.
I know that splitting this app into multiple processes would solve the problem but in this case it's not an feasible option.
EDIT: I'm aware of all the Bad Thingsā„¢ that can happen (I'm experienced in low-level system and kernel programming) due to ignoring SIGSEGV/SIGBUS/SIGFPE, so please try to answer my particular question instead of giving me lessons about reliability.
The PROPER way to do this is to let the whole process die, and start another one. You don't explain WHY this isn't appropriate, but in essence, that's the only way that is completely safe against various nasty corner cases (which may or may not apply in your situation).
I'm not aware of any method that is 100% safe that doesn't involve letting the whole process. (Note also that sometimes just the act of continuing from these sort of errors are "undefined behaviour" - it doesn't mean that you are definitely going to fall over, just that it MAY be a problem).
It's of course possible that someone knows of some clever trick that works, but I'm pretty certain that the only 100% guaranteed method is to kill the entire process.
Low-latency code design involves a careful "be aware of the system you run on" type of coding and deployment. That means, for example, that standard IPC mechanisms (say, using SysV msgsnd/msgget to pass messages between processes, or pthread_cond_wait/pthread_cond_signal on the PThreads side) as well as ordinary locking primitives (adaptive mutexes) are to be considered rather slow ... because they involve something that takes thousands of CPU cycles ... namely, context switches.
Instead, use "hot-hot" handoff mechanisms such as the disruptor pattern - both producers as well as consumers spin in tight loops permanently polling a single or at worst a small number of atomically-updated memory locations that say where the next item-to-be-processed is found and/or to mark a processed item complete. Bind all producers / consumers to separate CPU cores so that they will never context switch.
In this type of usecase, whether you use separate threads (and get the memory sharing implicitly by virtue of all threads sharing the same address space) or separate processes (and get the memory sharing explicitly by using shared memory for the data-to-be-processed as well as the queue mgmt "metadata") makes very little difference because TLBs and data caches are "always hot" (you never context switch).
If your "processors" are unstable and/or have no guaranteed completion time, you need to add a "reaper" mechanism anyway to deal with failed / timed out messages, but such garbage collection mechanisms necessarily introduce jitter (latency spikes). That's because you need a system call to determine whether a specific thread or process has exited, and system call latency is a few micros even in best case.
From my point of view, you're trying to mix oil and water here; you're required to use library code not specifically written for use in low-latency deployments / library code not under your control, combined with the requirement to do message dispatch with nanosec latencies. There is no way to make e.g. pthread_cond_signal() give you nsec latency because it must do a system call to wake the target up, and that takes longer.
If your "handler code" relies on the "rich" environment, and a huge amount of "state" is shared between these and the main program ... it sounds a bit like saying "I need to make a steam-driven airplane break the sound barrier"...

Can't get any speedup from parallelizing Quicksort using Pthreads

I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.
When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.
I am allowed to use up to 16 processors on the server I am running it on.
The algorithm goes like this:
Split array into right and left by comparing the elements to the pivot.
Start a new thread for the right and left, and wait until the threads join back.
If there are more available threads, they can create more recursively.
Each thread waits for its children to join.
Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.
I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).
I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.
The code I used is below.
http://pastebin.com/UaGsjcq2
Does anybody have any clue as to what I could be doing wrong?
Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).
This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.
First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.
Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.
I just have a quick look at your code. And i got a remark.
Why are you using lock.
If I understand what you are doing is something like:
quickSort(array)
{
left, right = partition(array);
newThread(quickSort(left));
newThread(quickSort(right));
}
You shouldn't need lock.
Normally each call to quick sort do not access the other part of the array.
So no sharing is involve.
Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.
Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.
Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?

Resources