I've got a C client listening to Tibco RV (using 8.4.0). The source pumps out messages on PREFIX1.* and PREFIX2.* pretty frequently (can be several times per second).
I have six threads, each listening to a particular SUFFIX, eg PREFIX1.SUFFIX_A and PREFIX2.SUFFIX_A. So each thread has a listener and its own queue for these two messages. I've got a queue size limit of 1000, dropping the oldest 200 if we hit that (but never have more than about 40 in the queue at busy times).
After running fine for many hours, each day the program suddenly stops receiving data. The source continues to publish but I no longer dispatch events from any queue. I don't understand what can have caused this (aside from deleting the listeners).
What might have caused the listening to stop? Or alternatively, given the system is high frequency how can this be investigated? Can I tell whether a listener is still active via the C interface? I couldn't see anything in the API for that.
Thanks for any help,
-Dave
It looks like the problem was that the machine had only a partial install of RV. In particular, there was no rv daemon in the package that we had for that machine. I'm actually a bit confused how we managed to get network data at all after re-reading the docs but it seems that without a daemon we can achieve networking until a minor network problem, then nothing; with the daemon we recover from network errors.
So the fix for this case was simply to install the full package and ensure the daemon runs constantly. Now the problem appears to have disappeared.
Related
Once in a while my server accept functions just stop working properly anymore.
There is a much deeper story behind this, I'm being flooded with SYN and SYN/ACK packets, my network router goes disco and accept keeps returning ECONNABORTED.... I already tried to debug and fix this specific attack, but without success. By now I gave up and rather look for a more generic server recover solution.
Anyway I figured out that simpy "restarting" the server socket by closing and calling socket again is helping. Theoretically very simple, but practically I'm facing here a huge challenge because (a) the server is quite complex by now and (b) when should I exactly restart the server socket.
My setup is one accept-thread that calls accept and feeds epoll, one listener-thread that listens for epoll read/write etc. events and feeds a queue of a thread pool.
I have not found any literature that guides one through restarting the server socket.
Particularly:
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
How does closing the server socket affect epoll and connected clients? Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
I am making some assumptions about the things you say in your question all being true and accurate even though some of them seems like they may be misdiagnosed. Unfortunately, you didn't really explain how you reached the conclusions presented, so I really can't do much other than assume they're true.
For example, you don't explain how or why you figured that closing and calling socket again will help. From just the information you gave, I would strongly suspect the opposite is true. But again, without knowing the evidence and rationale that lead you to figure that, all I can do is assume it's true despite my instinct and experience saying it's wrong.
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
If it really is the case that accepting connections will recover faster from a restart than without one and you really can't get any connections through, keep track of the last successful connection and the number of failures since the last successful connection. If, for example, you've gone 120 seconds or more without a successful connection and had at least four failed connections since the last successful one, then close and re-open. You may need to tune those parameters.
How does closing the server socket affect epoll and connected clients?
It has no effect on them unless you're using epoll on the server socket itself. In that case, make sure to remove it from the set before closing it.
Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
I would suggest "draining" the socket by calling accept without blocking until it returns EWOULDBLOCK. Then you can close it. If you get any legitimate connections in that process, don't close it since it's obviously still working.
A client that tries to get in between your close and getting around to calling listen on a new socket might get an error. But if they're getting errors anyway, that should be acceptable.
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
A long time ago, port DoS attacks were common because built-in defenses to things like SYN-bombs weren't as good as they are now. In those days, it was common for a server to support several different ports and for clients to try the ports in rotation. This is why IRC servers often accepted connections on ranges of ports such as 6660-6669. That meant an attacker had to do ten times as much work to make all the ports unusable. These days, it's pretty rare for an attack to take out a specific inbound port so the practice has largely gone away. But if you are facing an attack that can take out specific listening ports, it might make sense to open more listening ports.
Or you could work harder to understand the attack and figure out why you are having a problem that virtually nobody else is having.
I am using IBM WebSphere MQ to set up a durable subscription for Pub/Sub. I am using their C APIs. I have set up a subscription name and have MQSO_RESUME in my options.
When I set a wait interval for my subscriber and I properly close my subscriber, it works fine and restarts fine.
But if I force crash my subscriber (Ctrl-C) and I try to re open it, I get a MQSUB ended with reason code 2429 which is MQRC_SUBSCRIPTION_IN_USE.
I use MQWI_UNLIMITED as my WaitInterval in my MQGET and use MQGMO_WAIT | MQGMO_NO_SYNCPOINT | MQGMO_CONVERT as my MQGET options
This error pops up only when the topic has no pending messages for that subscription. If it has pending messages that the subscription can resume, then it resumes and it ignores the first published message in that topic
I tried changing the heartbeat interval to 2 seconds and that didn't fix it.
How do I prevent this?
This happens because the queue manager has not yet detected that your application has lost its connection to the queue manager. You can see this by issuing the following MQSC command:-
DISPLAY CONN(*) TYPE(ALL) ALL WHERE(APPLTYPE EQ USER)
and you will see your application still listed as connected. As soon as the queue manager notices that your process has gone you will be able to resume the subscription again. You don't say whether your connection is a locally bound connection or a client connection, but there are some tricks to help speed up the detection of connections depending on the type of connection.
You say that in the times when you are able to resume you don't get the first message, this is because you are retrieving this messages with MQGMO_NO_SYNCPOINT, and so that message you are not getting was removed from the queue and was on its way down the socket to the client application at the time you forcibly crashed it, and so that message is gone. If you use MQGMO_SYNCPOINT, (and MQCMIT) you will not have that issue.
You say that you don't see the problem when there are still messages on the queue to be processed, that you only see it when the queue is empty. I suspect the difference here is whether your application is in an MQGET-wait or processing a message when you forcibly crash it. Clearly, when there are no messages left on the queue, you are guarenteed with the use of MQWL_UNLIMITED, to be in the MQGET-wait, but when processing messages, you probably spend more time out of the MQGET than in it.
You mention tuning down the heartbeat interval, to try to reduce the time frame, this was a good idea. You said it didn't work. Please remember that you have to change it at both ends of the channel, or you will still be using the default 5 minutes.
I am currently experimenting with building an http server. The server is multi-threaded by one listening thread using select(...) and four worker threads managed by a thread pool. I'm currently managing around 14k-16k requests per second with a document length of 70 bytes, a response time of 6-10ms, on a Core I3 330M. But this is without keep-alive and any sockets I serve I immediatly close when the work is done.
EDIT: The worker threads processes 'jobs' that have been dispatched when activity on a socket is detected, ie. service requests. After a 'job' is completed, if there are no more 'jobs', we sleep until more 'jobs' gets dispatched or if there already are some available, we start processing one of these.
My problems started when I began to try to implement keep-alive support. With keep-alive activated I only manage 1.5k-2.2k requests per second with 100 open sockets. This number grows to around 12k with 1000 open sockets. In both cases the response time is somewhere around 60-90ms. I feel that this is quite odd since my current assumptions says that requests should go up, not down, and response time should hopefully go down, but definitely not up.
I've tried several different strategies for fixing the low performance:
1. Call select(...)/pselect(...) with a timeout value so that we can rebuild our FD_SET structure and listen to any additional sockets that arrived after we blocked, and service any detected socket activity.
(aside from the low performance, there's also the problem of sockets being closed while we're blocking, resulting in select(...)/pselect(...) reporting bad file descriptor.)
2. Have one listening thread that only accept new connections and one keep-alive thread that is notified via a pipe of any new sockets that arrived after we blocked and any new socket activity, and rebuild the FD_SET.
(same additional problem here as in '1.').
3. select(...)/pselect(...) with a timeout, when new work is to be done, detach the linked-list entry for the socket that has activity, and add it back when the request has been serviced. Rebuilding the FD_SET will hopefully be faster. This way we also avoid trying to listen to any bad file descriptors.
4. Combined (2.) and (3.).
-. Probably a few more, but they escape me atm.
The keep-alive sockets are stored in a simple linked List, whose add/remove methods are surrounded by a pthread_mutex lock, the function responsible for rebuilding the FD_SET also has this lock.
I suspect that it's the constant locking/unlocking of the mutex that is the main culprit here, I've tried to profile the problem but neither gprof or google-perftools has been very cooperative, either introducing extreme instability or plain refusing to gather any data att all (This could be me not knowing how to use the tools properly though.). But removing the locks risks putting the linked list in a non-sane state and probably crash or put the program into an infinite loop.
I've also suspected the select(...)/pselect(...) timeout when I've used it, but I'm pretty confident that this was not the problem since the low performance is maintained even without it.
I'm at a loss of how I should handle keep-alive sockets and I'm therefor wondering if you people out there has any suggestions on how to fix the low performance or have suggestions on any alternate methods I can use to go about supporting keep-alive sockets.
If you need any more information to be able to answer my question properly, don't hesitate to ask for it and I shall try my best to provide you with the necessary information and update the question with this new information.
Try to get rid of select completely. You can find some kind of event notification on every popular platform: kqueue/kevent on freebsd(), epoll on Linux, etc. This way you do not need to rebuild FD_SET and can add/remove watched fds anytime.
The time increase will be more visible when the client uses your socket for more then one request. If you are merely opening and closing yet still telling the client to keep alive then you have the same scenario as you did without keepalive. But now you have the overhead of the sockets sticking around.
If however you are using the sockets multiple times from the same client for multiple requests then you will lose the TCP connection overhead and gain performance that way.
Make sure your client is using keepalive properly. and likely a better way to get notification of the sockets state and data. Perhaps a poll device or queuing the requests.
http://www.techrepublic.com/article/using-the-select-and-poll-methods/1044098
This page has a patch for linux to handle a poll device. Perhaps some understanding of how it works and you can use the same technique in your application rather then rely on a device that may not be installed.
There are many alternatives:
Use processes instead of threads, and pass file descriptors via Unix sockets.
Maintain per-thread lists of sockets. You could even accept() directly on the worker threads.
etc...
Are your test clients reusing the sockets? Are they correctly handling keep alive?
I could see that case where you do the minimum change possible in your benchmarking code by just passing the keep alive header, but then not changing your code so that the socket is closed at the client end once the pay packet is received.
This would incure all the costs of keep-alive with none of the benefits.
What you are trying to do has been done before. Consider reading about the Leader-Follower network server pattern, http://www.kircher-schwanninger.de/michael/publications/lf.pdf
I have a TCP Svr process written in C and running on CentOS 5.5. It acts as a TCP Server for external clients and also does some IPC communication with other processes in the system using Unix Domain Sockets it has establised. It's not a multi threaded process. It does one task at a time. There's an epoll_wait() I use to listen for requests on either the TCP socket or any of the IPC sockets it has established with internal processes. When the epoll_wait() function breaks,I process the request for whoever it is and then go back into epoll_wait()
I have a TCP Client that connects to this Process from outside (not IPC). It connects sucessfully, sends a request msg, gets a response back. I've put this in an infinite loop
just to test out its robustness etc.
After a while, the TCP Server stops responding to requests coming from TCP Client. The TCP client connects successfully, sends a request message, but it doesnt get any response msg back from the TCP server.
So I reckon the TCP server is stuck somewhere else, trying to do something and has not returned to the epoll_wait() to process
other requests coming in. I've tried to figure it out using logs, but thats not helping me understand where exactly the process is stuck.
So I wanted to use any debugger that can give me some information (function name would be great), as to what the process is doing. Putting breakpoints, is overwhelming cause the TCP Server process has tons of files and functions....
I'm trying to use DDD on CentOS 5.5, to figureout whats going on. I attach to the process successfully. Then I click on "Step" or "Stepi" or "Next" button....
but nothing happens....
btw when I use Eclipse for debugging, and attach to this process (or any process), I always get "__kernel_vsyscall()"....Does this mean, the program breaks by default at
whatever its doing? If thats the case, how do I come out of the __kernel_vsyscall() call, to continue within my program? If I press f8, it comes out, but then I dont know where it was, since I loose the stack trace....Like I said earlier. Since I cant figure where it was, I dont know where to put breakpoint....
In summary, I want to figureout where my process is stuck or what its doing and try to debug from that point on....
How do I go about this?
Thanks
Amit
1) Attaching to a C process can often cause problems in itself, is there any way for you to start the process in the debugger?
2) Using the step functions of DDD need to be done after you've set a breakpoint and the program is stopped on a command. From reading your question, I'm not sure you've done that. You may not want to set many breakpoints, but is setting one or two in critical sections of code possible?
In summary, What I wanted to accomplish was to be able to find where my program is stuck, when it hangs. I figured it out - It was so simple. Create a configuration in Eclipse ...."Debug Configurations->C/C++ attach to application"...
Let the process run normally from shell (preferably with a terminal attached). When it hangs, open eclipse, click on the debug icon and run the configured process. It'll ask you to attach to a process. Look for your process name and attach to it.
Now, just look at the entire stack trace....you'll see some of your own function calls mixed with kernel function calls. That tells you where the program is stuck.
I've got a strange issue with a server accepting TCP connections. Even though there are normally some processes waiting, at some volume of connections it hangs.
Long version:
The server is written in Perl and binds a $srv socket with the reuse flag and listen == 5. Afterwards, it forks into 10 processes with a loop of $clt=$srv->accept(); do_processing($clt); $clt->shutdown(2);
The client written in C is also very simple - it sends some lines, then receives all lines available and does a shutdown(sockfd, 2); There's nothing async going on and at the end both send and receive queues are empty (as reported by netstat).
Connections last only ~20ms. All clients behave the same way, are the same implementation, etc. Now let's say I'm accepting X connections from client 1 and another X from client 2. Processes still report that they're idle all the time. If I add another X connections from client 3, suddenly the server processes start hanging just after accepting. The first blocking thing they do after accept(); is while (<$clt>) ... - but they don't get any data (on the first try already). Suddenly all 10 processes are in this state and do not stop waiting. On strace, the server processes seem to hang on read(), which makes sense.
There are loads of connections in TIME_WAIT state belonging to that server (~100 when the problem starts to manifest), but this might be a red herring.
What could be happening here?
After some more analysis: It turned out that the client was at fault, not closing previous connections properly before trying the next one. The servers at the beginning of the load-balancing list were left stale connections.
This probably isn't the solution to your problem, but it might solve a problem you'll experience in the future: don't forget to close() the sockets when you're done! shutdown() will disconnect the stream, but it'll still eat a file descriptor.
Since you said strace is showing processes stuck in read(), then your problem seems to be that the client isn't sending the data you expect it to be sending. You should either fix your client, or add an alarm() to your server processes so that they can survive dead clients.
Does it surge and then pause a long time (circa two minutes or so) and then surge again? If so you may not have your system max open files limit set high enough.