libuv logging best practice? - libuv

I've got on my program a std::stringstream that is periodically flushed (with a timer) to a log file. The flushing and timer are on the default run loop.
Other parts of the app just append to that std::stringstream and the timer takes care of the rest. I do limit the size of the stringstream (to 1mb) so I drop messages if the stream is "full".
I'm just wondering, is this best practice for;
performance? Is being on the main thread OK to handle this IO? Can I do better?
Critical errors? The problem could be within my usage of libuv, which could mean that libuv based logging would be borked?
How does node.js handle logging?

I think that question is more complex than one might think at first sight.
One part of a good response have little to do with libuv and much with your concrete needs and tradeoffs. While, for instance, some buffering, i.e. less frequent write syscalls, is good it also introduces a problem that might (or not) hit you hard in the area of logging. Reason: Buffered stuff dies along with the application if it dies. That, however, goes against the very logic of logging.
As for libuv and performance, my personal experience is that
a) one wants to find a good balance between writing info out and buffering. In you case my gut feeling is that you are buffering too much and that you should probably write out more frequently.
b) one wants to think well about performance, both in terms of, whether its's really critical at all, and in terms of details. The latter being of increasing importance under heavy server load. When you serve some 100 connections it's probably irrelevant but if you serve tens or hundreds of thousands of connections, it might be too costly to use convenience functions like fprintf.
Concrete example: In a highly loaded situation you might want to get the wall time once at startup along with the (then) current value of a monotomic timer (which is very cheap). Any time information may then be relative to that start value (a simple subtraction). Writing it out works like this: preformatted start wall time plus monotonic diff (e.g. "03:52:41 +123456 ms").
Another point with your scenario is that a modern OS will virtually always provide excellent buffering, so it usually doesn't make a lot of sense to buffer too much yourself.
All in all I'd suggest to use a buffer of about 16K or 32K and to more frequently writing it out. If (and only if) your scenario is high performance/heavy load you may want to avoid convenient but expensive functions.
As far as libuv is concerned I wouldn't worry. Depending on your OS and the libuv version file stuff (as opposed to socket stuff) may indeed be pseudo asynchronous (faked through threading) but my experience is that libuv is not the problem; rather, for instance, your large buffer may be a problem as it might well be written in multiple chunks.
Regarding your timer based approach you might want to have a look at the libuv idle mechanism and to also take care of the problem of a full buffer. Simply throwing away logging info seems inacceptable to me; after all, you're not logging for the fun of it but because that info is presumably important (if it weren't you wouldn't have a problem in the first place. The solution then would be simple: less logging).
Finally, I'd like to make a more general remark: The secret here is balance, not optimized performance of single details. You want to keep the whole system nicely balanced rather than, for instance to optimize by using large buffers which in the end just pushes the problem to another level rather than solving it.
I like to think of that problem field like of the task of moving e.g. company headquarters: The issue isn't about the fastest truck but about all of them being quite fast, in other words, a well balanced approach.

Honestly there are better options than logging by hand. If you are programming an application it's often faster, in both development and execution time, to use a library.
If you are programming to learn, then I'd advice to take a look at spdlog (the fastest approach) and g3log, that claims to have the best worst case.
In my experience, std::stringstream is not enough fast to be part of a logging system.

Related

How to choose for multithreading - c

I have to do a program client-server in c where server can use n-threads that can work simultaneously for manage the request of clients.
For do it I use a socket that use a listener that put the new FD (of new connection request) in a list and then the threads can take it when they are able to do.
I know that I can use pipe too for communication between thread.
Is the socket the best way ? And why or why not?
Sorry for my bad English
To communicate between threads you can use socket as well as shared memory.
To do multithreading there are many libraries available on github, one of them I used is the below one.
https://github.com/snikulov/prog_posix_threads/blob/master/workq.c
I tried and tested the same way what you want. it works perfect!
There's one very nice resource related to socket multiplexing which I think you should stop and read after reading this answer. That resource is entitled The C10K problem, and it details numerous solutions to the problem people faced in the year 2000, of handling 10000 clients.
Of those solutions, multithreading is not the primary one. Indeed, multithreading as an optimisation should be one of your last resorts, as that optimisation will interfere with the instruments you use to diagnose other optimisations.
In general, here is how you should perform optimisations, in order to provide guaranteed justifications:
Use a profiler to determine the most significant bottlenecks (in your single-threaded program).
Perform your optimisation upon one of the more significant bottlenecks.
Use the profiler again, with the same set of data, to verify that your optimisation worked correctly.
You can repeat these steps ad infinitum until you decide the improvements are no longer tangible (meaning, good luck observing the differences between before and after). Following these steps will provide you with data you can show your employer, if he/she asks you what you've been doing for the last hour, so make sure you save the output of your profiler at each iteration.
Optimisations are per-machine; what this means is that an optimisation for your machine might actually be slower on another machine. For example, you may use a buffer of 4096 bytes for your machine, while the cache lines for another machine might indicate that 512 bytes is a better idea.
Hence, ideally, we should design programs and modules in such a way that their resources are minimal and can be easily be scaled up, substituted and/or otherwise adjusted for other machines. This can be difficult, as it means in the buffer example above you might start off with a buffer of one byte; you'd most likely need to study finite state machines to achieve that, and using buffers of one byte might not always be technically feasable (i.e. when dealing with fields that are guaranteed to be a certain width; you should use that width as your minimum limit, and scale up from there). The reward is ultra-portable and ultra-optimisable in all situations.
Keep in mind that extra threads use extra resources; we tend to assume that the stack space reserved for a thread can grow to 1MB, so 10000 sockets occupying 10000 threads (in a thread-per-socket model) would occupy about 10GB of memory! Yikes! The minimal resources method suggests that we should start off with one thread, and scale up from there, using a multithreading profiler to measure performance like in the three steps above.
I think you'll find, though, that for anything purely socket-driven, you likely won't need more than one thread, even for 10000 clients, if you study the C10K problem or use some library which has been engineered based on those findings (see your comments for one such suggestion). We're not talking about masses of number crunching, here; we're talking about socket operations, which the kernel likely processes using a single core, and so you can likely match that single core with a single thread, and avoid any context switching or thread synchronisation troubles/overheads incurred by multithreading.

Making process survive failure in its thread

I'm writing app that has many independant threads. While I'm doing quite low level, dangerous stuff there, threads may fail (SIGSEGV, SIGBUS, SIGFPE) but they should not kill whole process. Is there a way to do it proper way?
Currently I intercept aforementioned signals and in their signal handler then I call pthread_exit(NULL). It seems to work but since pthread_exit is not async-signal-safe function I'm a bit concerned about this solution.
I know that splitting this app into multiple processes would solve the problem but in this case it's not an feasible option.
EDIT: I'm aware of all the Bad Things™ that can happen (I'm experienced in low-level system and kernel programming) due to ignoring SIGSEGV/SIGBUS/SIGFPE, so please try to answer my particular question instead of giving me lessons about reliability.
The PROPER way to do this is to let the whole process die, and start another one. You don't explain WHY this isn't appropriate, but in essence, that's the only way that is completely safe against various nasty corner cases (which may or may not apply in your situation).
I'm not aware of any method that is 100% safe that doesn't involve letting the whole process. (Note also that sometimes just the act of continuing from these sort of errors are "undefined behaviour" - it doesn't mean that you are definitely going to fall over, just that it MAY be a problem).
It's of course possible that someone knows of some clever trick that works, but I'm pretty certain that the only 100% guaranteed method is to kill the entire process.
Low-latency code design involves a careful "be aware of the system you run on" type of coding and deployment. That means, for example, that standard IPC mechanisms (say, using SysV msgsnd/msgget to pass messages between processes, or pthread_cond_wait/pthread_cond_signal on the PThreads side) as well as ordinary locking primitives (adaptive mutexes) are to be considered rather slow ... because they involve something that takes thousands of CPU cycles ... namely, context switches.
Instead, use "hot-hot" handoff mechanisms such as the disruptor pattern - both producers as well as consumers spin in tight loops permanently polling a single or at worst a small number of atomically-updated memory locations that say where the next item-to-be-processed is found and/or to mark a processed item complete. Bind all producers / consumers to separate CPU cores so that they will never context switch.
In this type of usecase, whether you use separate threads (and get the memory sharing implicitly by virtue of all threads sharing the same address space) or separate processes (and get the memory sharing explicitly by using shared memory for the data-to-be-processed as well as the queue mgmt "metadata") makes very little difference because TLBs and data caches are "always hot" (you never context switch).
If your "processors" are unstable and/or have no guaranteed completion time, you need to add a "reaper" mechanism anyway to deal with failed / timed out messages, but such garbage collection mechanisms necessarily introduce jitter (latency spikes). That's because you need a system call to determine whether a specific thread or process has exited, and system call latency is a few micros even in best case.
From my point of view, you're trying to mix oil and water here; you're required to use library code not specifically written for use in low-latency deployments / library code not under your control, combined with the requirement to do message dispatch with nanosec latencies. There is no way to make e.g. pthread_cond_signal() give you nsec latency because it must do a system call to wake the target up, and that takes longer.
If your "handler code" relies on the "rich" environment, and a huge amount of "state" is shared between these and the main program ... it sounds a bit like saying "I need to make a steam-driven airplane break the sound barrier"...

Pushing code towards kernel or user space, for performance reasons?

Originally I thought to make code faster it would be better to try and reduce the transition between Kernel and user space- by pushing more of the code to run in the kernel. However, I have read in a few forums like SO that the opposite is actually done- more of the code is pushed into the user space. Why is this? It seems counter intuitive? Putting more of the code into the user space still requires kernel-user transitions, whereas putting the code in the kernel doesnt requite kernel-user transitions?
In case anyone asks- I am thinking about an application processing packet data.
EDIT
So more details, I am thinking about when packet data arrives- I want to re-write the network stack and cut out code which isn't applicable for my packet processing and have zero copy- putting the packet data somewhere where the user program can access it as quick as possible.
The kernel is a time sensitive area, it’s where your ISRs, time tick routines, and hardware critical sections reside. Because of this, the objective is to keep kernel code small and tight, get in, get your work done, and get out.
In your case you're getting packets from the network, that's a hardware dependent task (you need to get data from the lower network layers), so get your data, clear the buffers, and send it via a DMA transfer to user space; then do your processing in user space.
From my experiences: The preformance gained by executing your code in ther kernel will not outweigh the preformance lost overall by executing more code in the kernel.
If you expect your code to go into the official kernel release, "shuffling user mode parts of it into the kernel" is probably a bad idea as a rule.
Of course, if you can prove that by doing so is the BEST (subjective, I know) way to achieve better performance, and the cost is acceptable (in terms of extra code in kernel -> more burden of maintenance on the kernel, bigger kernel -> more complaints about kernel being "too big" etc), then by all means follow that route.
But in general, it's probably better to approach this by doing more work in user-mode, and make the kernel mode task smaller, if that is at all an alternative. Without knowing exactly what you are doing in the kernel and what you are doing in usermode, it's hard to say for sure what you should/shouldn't do. But for example batching up a dozen "items" into a block that is ONE request for the kernel to do something is a better option than calling the kernel a dozen times.
In response to your edit describing what you are doing:
Would it not be better to pass a user-mode memory region to receive the data, and then just copy into that when the packet arrives. Assuming "all memory is equal" [if it isn't, you have problems with "in place use" anyway], this should work just as well, with less time spent in the kernel.
Transitions from user-mode to kernel-mode take some time and resources, so keeping the code in only one of the modes may increase performance.
As mentioned: in your case probably the best option you have is to fetch the data as fast as possible and make it available in user-land right away and do the processing in user-land... moving all the processing to kernel-level seems to me unnecessary... Unless you have a good reason to do so... with no further information it seems to me you have no reason to believe you'll do it faster in kernel-mode than user-mode, all you could spare is a mode transition now and then, which shouldn't be relevant.

Select in socket programming

Is there any use in using the select() function ?
From my (small) experience I tend to believe that threads are enough.
So I wonder, is select() just a didactic tool for people who don't yet know threads ?
Consider the following example. You have a moderately busy web server with something like 100K connections. You're not using select or anything like it so you have one thread per connection, implying 100K threads which quickly becomes a problem.
Even if you tweak your system until it allows such a monstrosity, most of the threads will just wait on a socket. Wouldn't it be better if there was a mechanism to notify you when a socket becomes interesting ?
Put another way, threading and select-like mechanisms are complementary. You just can't use threads to replace the simple thing select does: monitoring file descriptors.
Single-threaded polling is by far simpler to use, implement and (most importantly) understand. Concurrent programming adds a huge intellectual cost to your project: Synchronising data is tricky and error-prone, locking introduces many opportunities for bugs, lock-free data structures cause performance hits, and the program flow becomes hard to visualize mentally (or "serialize" perhaps).
By contrast, single-threaded polling (maybe with epoll/kqueue rather than select) gives you generally very good performance (depending of course on what exactly you're doing in response to data) while remaining straight-forward.
In Linux in particular, you can have timerfds, eventfds, signalfds and inotify-fds, as well as nested epoll-fds, all sitting together in your polling set, giving you an very uniform way of dealing with all sorts of "asynchronous" events. If eventually you need more performance, you have a single point of parallelism by running several pollers concurrently, and much of the data synchronisation is done for you by the kernel, which promises that only one single thread receives a successful poll in the event of readiness.

Skip to end of time quanta

Is it possible to skip to the end of a process's allocated time-quantum? I have a program that works in parallel on a piece of shared memory, and then all of the processes need to wait for the others to finish and sync up before the next step. Each process will do a maximum of one iteration more than any other, so any timing differences are minimal.
microsleep almost works, but I'm pretty sure that even usleep(1) would take longer than I'd like (as of now I can do 5000 times through in about 1.5 seconds, so that would add around 20ms to a test).
Some kind of busy wait seems like a bad idea, though it's what I might end up going with.
What I would really like is something along the lines of
while(*everyoneDone != 0) {
//give up rest of this time-quantum
}
It doesn't need to be realtime, it just needs to be fast. Any ideas?
Note that this will be run on a multiproc machine, because if there's only one core to work with, the existing single-threaded version is going to perform better.
Don't do that, active waiting is almost always a bad idea in an application context. Use pthread_barrier_t, this is exactly the tool that is foreseen for your purpose.
You don't state what OS you're working with, but if it's POSIX then sched_yield() might be what you're looking for.
Really though, you'd almost certainly be better off using a proper synchronisation primitive, like a semaphore.

Resources