I'm trying to write an application that connects to multiple IPs/ports and the problem I'm having is that the number of IPs is unknown to me, so one department can use it connect to 2 ips and other department may connect to 8, so it has to be configurable during runtime, I'm thinking of using threads or fork inside loop but not sure which one is better for the job, hope some one can guide me here, I'm using C under Linux.
For example
one can run it like this a.out ip1 port1 ip2 port2 ip3 port3
and the other can run it like this a.out a.out ip1 port1
Thanks
I see four design choices here, each with advantages and disadvantages. Your choice will largely depend on what exactly your application does.
1 process / socket (fork): This has the advantage that a fatal error in one process (e.g., SEGFAULT) will not affect other processes. Disadvantages include the fact that the approach is more resource hungry and that processes are more difficult to coordinate (e.g., if you want to do dynamic load balancing).
1 thread / socket (pthreads): This has the advantage that it is pretty light and threads are easy to coordinate, since they share a common memory space. Disadvantages include the fact that an error in one thread may take your whole application down.
Finite-State Machine: You could use a single thread in a single process, that does a huge poll on all your sockets, then takes the right (non-blocking) action, i.e., read, write or close. I heard this is very fast on a single processor, however, it does not take advantage of multi-core architecture and is somewhat more difficult to program.
Hybrid: pick any of the three above and combine them. See for example the Apache server.
Related
My C program has two threads, both of which interact with two external interfaces. There's too much code for one source file, so I'm splitting it in two. What is the right split?
One thread, MtoD, takes a message off an IPC message queue, processes it, and then sends commands to the driver of a physical interface. The other thread, DtoM, receives interrupts from that driver, processes the input, and then posts the results in a message to an IPC queue.
The obvious ways to split the code in two are:
by thread: two source files, MtoD.c and DtoM.c, each holding all the functions of a single thread - but both files will have to deal with both of the interfaces
by interface: two source files, M.c and D.c, each doing all the business related to a certain external interface - but the threads run through both files.
My concerns are
code maintenance. Doing it by thread makes it easier to follow the logic of a thread (no switching between files). But someone who'd write this object-oriented would probably wrap the interface to the IPC queues in one class, which would be in one file, and the driver interface in another, in the other file.
performance. If you have object files M.o and D.o, each will have just one external library to deal with - but they have to call into each other during execution of a thread. Does that incur any overhead (if the linker has made them into one binary)? If you have MtoD.o and DtoM.o, you could declare most functions as static, which might enable some more compiler optimizations. But would they both need links with the external libraries?
Which way is optimal?
That's an interesting one and you probably get BOTH options being recommended, simply because both have advanteges and disadvantages and it much depends how one values these.
Ok, third option: one thread ? If I get you right you connect a interface to an IPC, so if one thread both reacts to input on either side and sends it out the other side ? I dont think you loose much responce time this way, if any and you have it all in one place. If source is too big you can look into which classes you may naturally separate rather than separating into threads or interfaces.
I have to do a program client-server in c where server can use n-threads that can work simultaneously for manage the request of clients.
For do it I use a socket that use a listener that put the new FD (of new connection request) in a list and then the threads can take it when they are able to do.
I know that I can use pipe too for communication between thread.
Is the socket the best way ? And why or why not?
Sorry for my bad English
To communicate between threads you can use socket as well as shared memory.
To do multithreading there are many libraries available on github, one of them I used is the below one.
https://github.com/snikulov/prog_posix_threads/blob/master/workq.c
I tried and tested the same way what you want. it works perfect!
There's one very nice resource related to socket multiplexing which I think you should stop and read after reading this answer. That resource is entitled The C10K problem, and it details numerous solutions to the problem people faced in the year 2000, of handling 10000 clients.
Of those solutions, multithreading is not the primary one. Indeed, multithreading as an optimisation should be one of your last resorts, as that optimisation will interfere with the instruments you use to diagnose other optimisations.
In general, here is how you should perform optimisations, in order to provide guaranteed justifications:
Use a profiler to determine the most significant bottlenecks (in your single-threaded program).
Perform your optimisation upon one of the more significant bottlenecks.
Use the profiler again, with the same set of data, to verify that your optimisation worked correctly.
You can repeat these steps ad infinitum until you decide the improvements are no longer tangible (meaning, good luck observing the differences between before and after). Following these steps will provide you with data you can show your employer, if he/she asks you what you've been doing for the last hour, so make sure you save the output of your profiler at each iteration.
Optimisations are per-machine; what this means is that an optimisation for your machine might actually be slower on another machine. For example, you may use a buffer of 4096 bytes for your machine, while the cache lines for another machine might indicate that 512 bytes is a better idea.
Hence, ideally, we should design programs and modules in such a way that their resources are minimal and can be easily be scaled up, substituted and/or otherwise adjusted for other machines. This can be difficult, as it means in the buffer example above you might start off with a buffer of one byte; you'd most likely need to study finite state machines to achieve that, and using buffers of one byte might not always be technically feasable (i.e. when dealing with fields that are guaranteed to be a certain width; you should use that width as your minimum limit, and scale up from there). The reward is ultra-portable and ultra-optimisable in all situations.
Keep in mind that extra threads use extra resources; we tend to assume that the stack space reserved for a thread can grow to 1MB, so 10000 sockets occupying 10000 threads (in a thread-per-socket model) would occupy about 10GB of memory! Yikes! The minimal resources method suggests that we should start off with one thread, and scale up from there, using a multithreading profiler to measure performance like in the three steps above.
I think you'll find, though, that for anything purely socket-driven, you likely won't need more than one thread, even for 10000 clients, if you study the C10K problem or use some library which has been engineered based on those findings (see your comments for one such suggestion). We're not talking about masses of number crunching, here; we're talking about socket operations, which the kernel likely processes using a single core, and so you can likely match that single core with a single thread, and avoid any context switching or thread synchronisation troubles/overheads incurred by multithreading.
Imagine that we have a client, which keeps sending lots of double data.
Now we are trying to make a server, which can receive and process the data from the client.
Here is the fact:
The server can receive a double in a very short time.
There is a function to process a double at the server, which needs more than 3 min to process only one double.
We need to make the server as fast as possible to process 1000 double data from the client.
My idea as below:
Use a thread pool to create many threads, each thread can process one double.
All of these are in Linux.
My question:
For now my server is just one process which contains multi-threads. I'm considering if I use fork(), would it be faster?
I think using only fork() without multithreading should be a bad idea but what if I create two processes and each of them contains multi-threads? Can this method be faster?
Btw I have read:
What is the difference between fork and thread?
Forking vs Threading
To a certain degree, this very much depends on the underlying hardware. It also depends on memory constraints, IO throughput, ...
Example: if your CPU has 4 cores, and each one is able to run two threads (and not much else is going on on that system); then you probably would prefer to have a solution with 4 processes; each one running two threads!
Or, when working with fork(), you would fork() 4 times; but within each of the forked processes, you should be distributing your work to two threads.
Long story short, what you really want to do is: to not lock yourself into some corner. You want to create a service (as said, you are building a server, not a client) that has a sound and reasonable design.
And given your requirements, you want to build that application in a way that allows you to configure how many processes resp. threads it will be using. And then you start profiling (meaning: you measure what is going on); maybe you do experiments to find the optimum for a given piece of hardware / OS stack.
EDIT: I feel tempted to say - welcome to the real world. You are facing the requirement to meet precise "performance goals" for your product. Without such goals, programmer life is pretty easy: most of the time, one just sits down, puts together a reasonable product and given the power of todays hardware, "things are good enough".
But if things are not good enough, then there is only one way: you have to learn about all those things that play a role here. Starting with things "which system calls in my OS can I use to get the correct number of cores/threads?"
In other words: the days in which you "got away" without knowing about the exact capacity of the hardware you are using ... are over. If you intend to "play this game"; then there are no detours: you will have to learn the rules!
Finally: the most important thing here is not about processes versus threads. You have to understand that you need to grasp the whole picture here. It doesn't help if you tune your client for maximum CPU performance ... to then find that network or IO issues cause 10x of "loss" compared to what you gained by looking at CPU only. In other words: you have to look at all the pieces in your system; and then you need to measure to understand where you have bottlenecks. And then you decide the actions to take!
One good reading about that would be "Release It" by Michael Nygard. Of course his book is mainly about patterns in the Java world; but he does a great job what "performance" really means.
fork ing as such is way slower than kicking off a thread. A thread is much more lightweight (traditionally, although processes have caught up in the last years) than a full OS process, not only in terms of CPU requirements, but also with regards to memory footprint and general OS overhead.
As you are thinking about a pre-arranged pool of threads or processes, setup time would not account much during runtime of your program, so you need to look into "what is the cost of interprocess communications" - Which is (locally) generally cheaper between threads than it is between processes (threads do not need to go through the OS to exchang data, only for synchronisation, and in some cases you can even get away without that). But unfortunately you do not state whether there is any need for IPC between worker threads.
Summed up: I cannot see any advantage of using fork(), at least not with regards to efficiency.
My application creates per connection thread . Application is ruinng under the non-zero user id and Sometimes number of threads surpasses default value 1024 . I want to edit this number so I have few options
run as root [very bad idea and also have to compromise with securty ,so dropping it]
run under underprivilaged user use setcap and give capability CAP_SYS_RESOURCE . then I can add code im my program
struct rlimit rlp; /* will initilize this later with values of nprocs(maximum number of desired threads)*/
setrlimit(RLIMIT_NPROC, &rlp);
/*RLIMIT_NPROC
*The maximum number of processes (or, more precisely on Linux, threads) that can
* created for the real user ID of the
*calling process. Upon encountering this limit, fork(2) fails with the error
*EAGAIN. */
Other thing is editing /etc/securitylimits.conf where simply I can make entry for the development user and can put lines e.g.
#devuser hard nproc 20000
#devuser soft nproc 10000
where 10k is enough .So being litle reluctant in chaning source code should I proceed with last option . And I am more curios to know what is more robust and standars approach.
seeking your opinions , and thank you in advance :)
PS: What will happen if a single process will be served with more than 1k threads . ofcource i have 32GB of Ram also
First, I believe you are wrong in having nearly a thousand threads. Threads are quite costly, and it is usually not reasonable to have so much of them. I would suggest having a few dozen threads at most (unless you run on a very costly super-computer).
You could have some event loop around a multiplexing syscall like poll(2). Then a single thread can deal with many thousands of connections. Read about the C10K problem and epoll. Consider using some event libraries like libevent or libev etc...
You could start your application as root (perhaps by using setuid techniques), set-up the required resources (in particular, opening privileged TCP/IP ports), and change the user with setreuid(2)
Read Advanced Linux Programming...
You could also wrap your application around a tiny setuid C program which increase the limits using setrlimit(2), change the user with setreuid, and at last execve(2) your real program.
I work on Linux for ARM processor for cable modem. There is a tool that I have written that sends/storms customized UDP packets using raw sockets. I form the packet from scratch so that we have the flexibility to play with different options. This tool is mainly for stress testing routers.
I actually have multiple interfaces created. Each interface will obtain IP addresses using DHCP. This is done in order to make the modem behave as virtual customer premises equipment (vcpe).
When the system comes up, I start those processes that are asked to. Every process that I start will continuously send packets. So process 0 will send packets using interface 0 and so on. Each of these processes that send packets would allow configuration (change in UDP parameters and other options at run time). Thats the reason I decide to have separate processes.
I start these processes using fork and excec from the provisioning processes of the modem.
The problem now is that each process takes up a lot of memory. Starting just 3 such processes, causes the system to crash and reboot.
I have tried the following:
I have always assumed that pushing more code to the Shared Libraries will help. So when I tried moving many functions into shared library and keeping minimum code in the processes, it made no difference to my surprise. I also removed all arrays and made them use the heap. However it made no difference. This maybe because the processes runs continuously and it makes no difference if it is stack or heap? I suspect the process from I where I call the fork is huge and that is the reason for the processes that I make result being huge. I am not sure how else I could go about. say process A is huge -> I start process B by forking and excec. B inherits A's memory area. So now I do this -> A starts C which inturn starts B will also not help as C still inherits A?. I used vfork as an alternative which did not help either. I do wonder why.
I would appreciate if someone give me tips to help me reduce the memory used by each independent child processes.
Given this is a test tool, then the most efficient thing to do is to add more memory to the testing machine.
Failing that:
How are you measuring memory usage? Some methods don't get accurate results.
Check you don't have any memory leaks. e.g. with Valgrind on Linux x86.
You could try running the different testers in a single process, as different threads, or even multiplexed in a single thread - since the network should be the limiting factor?
exec() will shrink the processes memory size as the new execution gets a fresh memory map.
If you can't add physical memory, then maybe you can add swap, maybe just for testing?
Not technically answering your question, but providing a couple of alternative solutions:
If you are using Linux have you considered using pktgen? It is a flexible tool for sending UDP packets from kernel as fast as the interface allows. This is much faster than a userspace tool.
oh and a shameless plug. I have made a multi-threaded network testing tool, which could be used to spam the network with UDP packets. It can operate in multi-process mode (by using fork), or multi-thread mode (by using pthreads). The pthreads might use less RAM, so might be better for you to use. If anything it might be worth looking at the source as I've spent many years improving this code, and its been able to generate enough packets to saturate a 10gbps interface.
What could be happening is that the fork call in process A requires a significant amount of RAM + swap (if any). Thus, when you call fork() from this process the kernel must reserve enough RAM and swap for the child process to have it's own copy (copy-on-write, actually) of the parent process's writable private memory, namely it's stack and heap. When you call exec() from the child process, that memory is no longer needed and your child process can have it's own, smaller private working set.
So, first thing to make sure is that you don't have more than one process at a time in the state between fork() and exec(). During this state is where the child process must have a duplicate of it's parent process virtual memory space.
Second, try using the overcommit settings which will allow the kernel to reserve more memory than actually exists. These are /proc/sys/vm/overcommit*. You can get away with using overcommit because your child processes only need the extra VM space until they call exec, and shouldn't actually touch the duplicated address space of the parent process.
Third, in your parent process you can allocate the largest blocks using shared memory, rather than the stack or heap, which are private. Thus, when you fork, those shared memory regions will be shared with the child process rather than duplicated copy-on-write.