How do you conserve memory when receiving messages from POSIX queues? - c

How do you conserve memory when receiving messages from POSIX queues?
It seems that when using POSIX queues in a multiprocess / multithreaded environment, there is no thread safe way to dequeue a message into a buffer that is anything smaller than the max_msgsize.
Are there any standard solutions to this problem? Or is it even a problem?
I am well aware that there are other really great libraries to do this, but I just wanted to include a completely standard solution for users if they don't want to deal with dependencies.
FYI, I am trying to queue up potentially hundreds of megabytes per message and have a pool of processes with multiple threads each dequeuing the messages for processing.

The POSIX queue interface, as you note, does not allow you to query the size of a message.
In effect, therefore, all messages may be at the maximum size as configured by the queue definition, and you have to assume that a simplistic implementation might well make use of that for ease of record-keeping.
Given that you are dealing with multi-megabyte messages as yo say, putting those messages into the queue is unlikely to be a good solution (unfortunately).
If your message rate is low (to some definition of low) and you actually do have a reasonable upper bound, then just go ahead and try it out.
Barring that, your next best bet would be to use the queue as a work-order queue, and not as a work-item queue. Your work items would have to be stored differently, in files perhaps. Then in the queue you have a nice short filename, pointing to the location of the work-item to take care of.
Good Luck

I think there is no really good way to do this. Here is an idea, but I think you will find it performs badly due to lock contention:
Have one static buffer that is equal the size of the maximum message. Because there is a shared buffer, now your dequeue process must look like this:
lock sempaphore
dequeue into static buffer
figure out the real size of the message
copy from static buffer to a thread local buffer that is the actual size of the message
unlock semaphore
There is the overhead of having the static buffer, but depenidng on the distribution of the sizes of your messages, you are still likely to see a reduction in total memory usage. However, now you have to deal with the contention for the static buffer, which is likely to be great especially when a few large message arrive in a row. If very large messages are rare, then this might not be a terrible solution.


Can I get socket buffer remainder size?

I have a real-time system, so I using the non-blocking socket to send my data
But there is happened that the socket buffer is full,
so the send function's return value less than my data length.
If I save the return length and re-send, there is not different with blocking socket?
So can I get the socket buffer's remainder size? I can check it first,
if it is enough then I call send, skip send else.
Thank you, all.
Well there is a difference between blocking and non-blocking - if you experience a short write you don't block. That's the whole point of non-blocking. It give you an opportunity to do something more pressing while waiting for some buffer space to free up.
Your concern seems to be the repeated attempts to write a full message, that is, a form of polling. But a check of the bytes free in the buffer is the same thing, you are just substituting the call to the availability with the call to write. You really don't gain anything efficiency wise.
The commonplace solution to this is to use something like select or poll that monitors the socket descriptor for the ability to write (and least some) bytes. This allows you stop polling and foist some of the work off on the kernel to monitor the space availability for you.
That said, if you really want to check to see how much space is available there are usually work arounds that tend to somewhat platform specific, mostly ioctl calls with various platform specific parameters like FIONWRITE, SIOCOUTQ, etc. You would need to investigate exactly what your platform provides. But, again, it is better to consider if this is really something you need in the first place.
If the asynchronous send fails with EWOULDBLOCK/EAGAIN, no data is sent. You could then try to send something else, or wait until the buffer is free again.
Also see - a related issue is discussed there.

Circular buffer Vs. Lock free stack to implement a Free List

As I have been writing some multi-threaded code for fun, I came up with the following situation:
a thread claims a single resource unit from a memory pool, it processes it and sends a pointer to this data to another thread for further operation using a circular buffer (1R / 1W case).
The latter must inform the former thread whenever it is done with the data he received, so that the memory can be recycled.
I wonder whether it is better - performance-wise - to implement this "Freelist" as another circular buffer - holding the addresses of free resources - or choose the lock-free stack way (implementing DCAS on x86-64).
Generally speaking, what could be the pros and the cons of the two different approaches ?
Just in case, there is a difference between lock-free and wait-free. The former means there is no locking but the thread can still busy-spin not making any progress. The latter means that the thread always makes progress with no locking or busy-spinning.
With one reader and one writer lock-free and wait-free FIFO circular buffer is trivial to implement.
I hear that LIFO stack can also be made wait-free, but not so sure about FIFO list. And it sound like you need a queue here rather then a stack.
The main difference is the circular buffer will be bounded, while the stack will not.
It's hard to make a performance judgement on things like this without testing. On the one hand, the circular buffer is backed by a contiguous array. If the reader and writer indices remain "near" each other, you'll have each thread constantly invalidating a shared cache line.
On the other hand, with a stack you can have contention for the top-of-stack pointer, resulting in threads sometimes spinning in the CAS loop.
My guess would be that the best choice is workload-dependent.

Better to lock on a shared resource, or have a thread to fulfill requests?

I have a shared memory pool from which many different threads may request an allocation. Requesting an allocation from this will occur a LOT in every thread, however the amount of threads is likely to be small, often with only 1 thread running. I am not sure which of the following ways to handle this are better.
Ultimately I may need to implement both and see which produces more favorable results... I also fear that even thinking of #2 may be premature optimization at this point as I don't actually have the code that uses this shared resource written yet. But the problem is so darn interesting that it continues to distract me from the other work.
1) Create a mutex and have a thread attempt to lock it before obtaining the allocation, then unlocking it.
2) Have each thread register a request slot, when it needs an allocation it puts the request in the slot, then blocks(while (result == NULL) { usleep() }) waiting for the request slot to have a result. A single thread continuously iterates request slots making the allocations and assigning them to the result in the request slot.
Number 1 is the simple solution, but a single thread could potentially hog the lock if the timing is right. The second is more complex, but ensures fairness among threads when pulling from the resource. However it still blocks the requesting threads, and if there are many threads the iteration could burn cycles without doing any actual allocations until it finds a request to fulfill.
NOTE: C on Linux using pthreads
Solution 2 is bogus. It's an ugly hack and it does not ensure memory synchronization.
I would say go with solution 1, but I'm a little bit skeptical of the fact that you mentioned "memory pool" to begin with. Are you just trying to allocate memory, or is there some other resource you're managing (e.g. slots in some special kind of memory, memory-mapped file, textures in video memory, etc.)?
If you are just allocating memory, then you're completely right to be worried about premature optimization. The whole problem is premature optimization, and the system malloc will do as well as or better than your memory pool will do. (Or if your code will be running on one of the few systems with a pathologically broken malloc like some video game consoles, just drop in a replacement only on those known-broken systems.)
If you really do have a special resource you need to manage, start with solution 1 and see how it works. If you have problems, you might find you can improve it with a condition variable where the resource manager notifies you when a slot can be allocated, but I really doubt this will be necessary.

Are repeated recv() calls expensive?

I have a question about a situation that I face quite often. From time to time I have to implement various TCP-based protocols. Most of them define variable-length data packets that begin with a common header ([packet ID, length, payload] or something really similar). Obviously, there can be two approaches to reading these packets:
Read header (since header length is usually fixed), extract the payload length, read the payload
Read all available data and store it in a buffer; parse the buffer afterwards
Obviously, the first approach is simple, but requires two calls to read() (or probably more). The second one is slightly more complicated, but requires less calls.
The question is: does the first approach affect the performance badly enough to worry about it?
yes, system calls are generally expensive, compared to memory copies. IMHO it is particularly true on x86 architecture, and arguable on RISC machine (arm, mips, ...).
To be honest, unless you must handle hundreds or thousands of request per second, you will hardly notice the difference.
Depending on what is exactly the protocol, an hybrid approach could be the best. When the protocol uses a lot of small packets and less big ones, you can read the header and a partial amount of data. When it is a small packet, you win by avoiding a large memcpy, when the packet is big, you win by issuing a second syscall only for that case.
If your application is a server capable of handling multiple clients simultaneously and non-blocking sockets are used to handle multiple clients in one thread, you have little choice but to only ever issue one recv() syscall when a socket becomes ready for read.
The reason for that is if you keep calling recv() in a loop and the client sends a large volume of data, what can happen is that your recv() loop may block the thread for long time from doing anything else. E.g., recv() reads some amount of data from the socket, determines that there is now a complete message in the buffer and forwards that message to the callback. The callback processes the message somehow and returns. If you call recv() once more there can be more messages that have arrived while the callback was processing the previous message. This leads to a busy recv() loop on one socket preventing the thread from processing any other pending events.
This issue is exacerbated if the socket read buffer in your application is smaller than the kernel socket receive buffer. In other words, the whole contents of the kernel receive buffer can not be read in one recv() call. Anecdotal evidence is that I hit this issue on a busy production system when there was a 16Kb user-space buffer for a 2Mb kernel socket receive buffer. A client sending many messages in succession would block the thread in that recv() loop for minutes because more messages would arrive when the just read messages were being processed, leading to disruption of the service.
In such event-driven architectures it is best to have the user-space read buffer equal to the size of the kernel socket receive buffer (or the maximum message size, whichever is bigger), so that all the data available in the kernel buffer can be read in one recv() call. This works by doing one recv() call, processing all complete messages in the user-space read buffer and then returning control to the event loop. This way a connections with a lot of incoming data is not blocking the thread from processing other events and connections, rather it round-robin's processing of all connections with incoming data available.
The best way to get your answer is to measure. The strace program is decent for the purpose of measuring system call times. Using it adds a lot of overhead in itself, but if you merely compare the cost of one recv for this purpose versus the cost of two, it should be reasonably meaningful. Use the -tt option to get times. Or you can use the -c option to get an overview of time spent separated by which syscall it was spent on.
A better way to measure, albeit with more of a learning curve, is oprofile.
Also note that if you do decide buffering is worthwhile, you may be able to use fdopen and the stdio functions to take care of it for you. This is extremely easy and will work well if you're only dealing with a single connection or if you have a thread/process per connection, but won't work at all if you want to use a select/poll-based model.
Note that you generally have to "read all the available data into a buffer and process it afterwards" anyway, to account for the (unlikely, but possible) scenario where a recv() call returns only part of your header - so you might as well go the whole hog and use option 2.
Yes, depending upon the scenario the read/recv calls may be expensive. For example, if you are issuing huge number of recv() calls to read very small amount of data every small interval, it would be a performance hit. In such scenario you could issue a recv() with reasonably large buffer, let say 4k, and then parse that 4k buffer. It may contain multiple header+data combo. By reading header first you can find the data and its length. And to avoid the mem copy of data into a new buffer, you can just use the offset from where the actual data start, and store that pointer.

Queues implementation benchmark

I'm starting development of a series of image processing algorithms, some of them with intensive use of queues. Do you guys know a good benchmark for those data structures?
To narrow the scope, I'm using C mostly, but I can use C++, stl and any library.
I've got a few hits on data structure libraries, such as GLib and C-Generic-Library, and of course the containers of STL. Also, if any of you developed/know a faster queue than those, please advise :)
Also, the queue will have lots of enqueues and dequeues operations, so it better have a smart way to manage memory.
For a single threaded application you can often get around having to use any type of queue at all simply by processing the next item as it comes in, but there are many applications where this isn't the case (queuing up data for output, for instance).
Without the need to lock the queue (no other threads to worry about) a simple circular buffer is going to be hard to beat for performance. If for some reason the queue needs to grow after creation this is a little bit more difficult, but you shouldn't have a hard time finding a circular buffer queue implementation (or building your own). If either inserting or extracting are done in a signal handler (or interrupt service routine) then you may actually need to protect the read and/or write position indexes, but if you know your target well you may be able to determine that this is not the case (when in doubt protect, though). Protection would be by either temporarily blocking the signals or interrupts that could put things in your queue. (You would really need to block this if you were to need to resize the queue)
If whatever you are putting in the queue has to be dynamically allocated anyway then you might want to just tack on a pointer and turn the thing into a list node. A singly linked list where the list master holds a pointer to the head and the last node is sufficient. Extract from the head and insert at the tail. Here protecting the inserts and extractions from race conditions is pretty much independent and you only need to worry about things when the lenght of the list is very low. If you truly do have a single threaded application then you don't have to worry about it at all.
I don't have any actual benchmarks and can't make any suggestions about any library implementations, but both methods are O(1) for both insert and extract. The first is more cache (and memory pager) friendly unless your queue size is much larger than it needs to be. The second method is less cache friendly since each member of the queue can be in a different area of RAM.
Hope this helps you evaluate or create your own queue.
