NodeJS - socket hang up error - angularjs

I am sending more than 50 requests to a server using node.js. However after 20-30 requests, I am getting a socket hang up error.
Error --
Error: socket hang up
at createHangUpError (http.js:1472:15)
at Socket.socketOnEnd [as onend] (http.js:1568:23)
at Socket.g (events.js:180:16)
at Socket.EventEmitter.emit (events.js:117:20)
at _stream_readable.js:920:16
at process._tickCallback (node.js:415:13)

Yeah, looks like your backend server is hanging up the socket, either due to timeout or capacity. Can you throttle the requests you are sending? Using a library like async (with Limit method) might help you throttle the connections in an easy way.

TL;DR:
This could be a problem caused by Node firing its GC and freezing all operations, which results in unpredictable occasional "socket hang up" errors.
I experienced a similar situation. I start a bunch of server calls wrapped in promises that run in parallel and then have a loop with a sleep that check periodically for their completion. I observe the following pattern:
The calls complete promptly for some time
then there is a slow-down
then eventually some "socket hang up" errors
then server calls continue promptly
This pattern repeats continuously. The DataDog stats on the server I call does not show variations in latency that could be a primary cause, so my conclusion is - it is something on the Node app side.
If the problem is due to GC - that is a bad news, cause GC is usually referred to as "Magic" :). You can't predict, when and how deep it goes.
HTH

Related

Apache Flink Stateful Functions forwarding the same message to N functions

I'm trying to send incoming messages to multiple stateful functions but I couldn't fully understand how to do. For the sake of understandability let's say one of my stateful function getting some integers and sending them to couple of remote functions. These functions adds this integers to their state values ​​and saves it as the new state.
When one of these 2 remote functions fails, the other should continue to work the same way.
When the failed function recovered, it should process messages that it cannot process during failure.
I thought about sending them one after another as below, but I don't think it will work
context.send(RemoteFuncType1,someID,someInteger);
context.send(RemoteFuncType2,someID,someInteger);
...
how can I do this in a fault tolerant way?
if possible how it works in the background?
The way you are suggesting to do it is the correct way!
StateFun would deliver the messages to the remote functions in a consistent manner. If one of the functions is experiencing a short downtime, StateFun would retry sending the message until:
It would successfully deliver it (with back off)
A maximum timeout for retries would be reached. When a timeout is reached the whole StateFun job would be rewind to a
previously consistent checkpoint.
Since StateFun is managing message delivery and the state of the functions (remote included) it would make sure that a consistent state and message would be delivered to each function.
In your example: the second remote function would receive someInteger with whatever state it had before, once recovered.
To get a deeper understanding of how checkpointing works in Flink and how it enables exactly once processing I’d recommend the following:
https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html

Tibco RV C client stops receiving messages

I've got a C client listening to Tibco RV (using 8.4.0). The source pumps out messages on PREFIX1.* and PREFIX2.* pretty frequently (can be several times per second).
I have six threads, each listening to a particular SUFFIX, eg PREFIX1.SUFFIX_A and PREFIX2.SUFFIX_A. So each thread has a listener and its own queue for these two messages. I've got a queue size limit of 1000, dropping the oldest 200 if we hit that (but never have more than about 40 in the queue at busy times).
After running fine for many hours, each day the program suddenly stops receiving data. The source continues to publish but I no longer dispatch events from any queue. I don't understand what can have caused this (aside from deleting the listeners).
What might have caused the listening to stop? Or alternatively, given the system is high frequency how can this be investigated? Can I tell whether a listener is still active via the C interface? I couldn't see anything in the API for that.
Thanks for any help,
-Dave
It looks like the problem was that the machine had only a partial install of RV. In particular, there was no rv daemon in the package that we had for that machine. I'm actually a bit confused how we managed to get network data at all after re-reading the docs but it seems that without a daemon we can achieve networking until a minor network problem, then nothing; with the daemon we recover from network errors.
So the fix for this case was simply to install the full package and ensure the daemon runs constantly. Now the problem appears to have disappeared.

Reconnecting with hiredis

I'm trying to reconnect to the Redis server on disconnect.
I'm using redisAsyncConnect and I've setup a callback on disconnect. In the callback I try to reconnect with the same command I use at the very start of the program to establish the connection but it's not working. Can't seem to reconnect.
Can anyone help me out with an example?
Managing Redis (re)connections asynchronously is a bit tricky when an event loop is used.
Here is an example implementing a small zset polling daemon connecting to a list of Redis instances, which is resilient to disconnection events. The ae event loop is used (it is the one used by Redis itself).
http://gist.github.com/4149768
Check the following functions:
connectCallback
disconnectCallback
checkConnections
reconnectIfNeeded
The main daemon loop does its activity only when the connection is available. Once per second, a second time initiated callback checks if some connections have to be reestablished. We have found this mechanism quite reliable.
Note: error management is crude in this example for brevity sake. Real production code should manage errors in a more graceful way.
One tricky point when dealing with multiple asynchronous connections is the fact there is no user defined contextual data passed as a parameter of the corresponding callbacks. Cleaning the data associated to a connection after a disconnection event can be a bit difficult.

Azure StorageClient Transient Connection testing - hanging

I am testing my WPF application connecting to Azure Blob Storage to download a bunch of images using TPL (tasks).
It is expected that in Live environment, there will be highly transient connection to the internet at deployed locations.
I have set Retry Policy and time-out in BlobRequestOptions as below:
//Note the values here are for test purposes only
//CloudRetryPolicy is a custom method returning adequate Retry Policy
// i.e. retry 3 times, wait 2 seconds between retries
blobClient.RetryPolicy = CloudRetryPolicy(3, new TimeSpan(0, 0, 2));
BlobRequestOptions bro = new BlobRequestOptions() { Timeout = TimeSpan.FromSeconds(20) };
blob.DownloadToFile(LocalPath, bro);
The above statements are in a background task that work as expected and I have appropriate exception handling in background task and the continuation task.
In order to test exception handling and my recovery code, I am simulating internet disconnection by pulling out the network cable. I have hooked up a method to System.Net.NetworkChange.NetworkAvailabilityChanged event on UI thread and I can detect connection/disconnection as expected and update UI accordingly.
My problem is: If I pull the network cable while a file is being downloaded (via blob.DownloadToFile), the background thread just hangs. It does not timeout, does not crash, does not throw exception, nothing!!! As I write, I have been waiting ~30 mins and no response/processing has happened in relation to background task.
If I pull the network cable, before download starts, execution is as expected. i.e. I can see retries happening, exceptions raised and passed ahead and so on.
Has anyone experienced similar behaviour? Any tips/suggestions to overcome this behaviour/problem?
By the way, I am aware that I can cancel the download task on detection of network connectivity loss, but I do not want to do this as network connectivity can get restored within the time-out duration and the download process can continue from where it was interrupted. I have tested this auto resumption and works nicely.
Below is a rough indication of my code structure (not syntactically correct, just a flow indication)
btnClick()
{
declare background_task
attach continuewith_task to background task
start background task
}
background_task()
{
try
{
... connection setup ...
blob.DownloadToFile(LocalPath, bro);
}
catch(exception ex)
{
... exception handling ....
// in case of connectivity loss while download is in progress
// this block is not getting executed
// debugger just sits idle without a current statement
}
}
continuewith_task()
{
check if antecedent task is faulted
{
... do recovery work ...
// this is working as expected if connectivity is lost
// before download starts
// this task does not get called if connectivity is lost
// while file transfer is taking place
}
else
{
.. further processing ...
}
}
Avkash is correct I believe. Also, to be clear, you will basically never see that network removed error so not a lot of point in testing for it. You will see a ton of connection rejected, conflicts, missing resources, read-only accounts, throttles, access denied, even DNS resolution failures depending on how you are handling storage accounts. You should test for those.
That being said, I would suggest you do not use the RetryPolicy at all with blob or table storage. For most of the errors you will actually encounter, they are not retryable to begin with (e.g. 404, 409, 403, etc.). When you have a retry policy in place, it will by default actually try it 4 more times over the next 2 minutes. There is no point in retrying bad credentials for instance.
You are far better off to simply handle the error and retry selectively yourself (timeouts and throttle are about the only thing that make sense here).
Your problem is mainly caused because Azure storage client libraries uses file streaming classes underneath and that why the API hang is not directly related with Windows Azure Blob client library. Calling file streaming API directly over network you can see the exact same behavior when network cable is suddenly removed, however removing network gracefully will return different behavior.
If you search on internet you will find streaming classes does not detect the network loss and that's why in your code you can check the network disconnect event and then stop the background streaming thread.

Problem supporting keep-alive sockets on a home-grown http server

I am currently experimenting with building an http server. The server is multi-threaded by one listening thread using select(...) and four worker threads managed by a thread pool. I'm currently managing around 14k-16k requests per second with a document length of 70 bytes, a response time of 6-10ms, on a Core I3 330M. But this is without keep-alive and any sockets I serve I immediatly close when the work is done.
EDIT: The worker threads processes 'jobs' that have been dispatched when activity on a socket is detected, ie. service requests. After a 'job' is completed, if there are no more 'jobs', we sleep until more 'jobs' gets dispatched or if there already are some available, we start processing one of these.
My problems started when I began to try to implement keep-alive support. With keep-alive activated I only manage 1.5k-2.2k requests per second with 100 open sockets. This number grows to around 12k with 1000 open sockets. In both cases the response time is somewhere around 60-90ms. I feel that this is quite odd since my current assumptions says that requests should go up, not down, and response time should hopefully go down, but definitely not up.
I've tried several different strategies for fixing the low performance:
1. Call select(...)/pselect(...) with a timeout value so that we can rebuild our FD_SET structure and listen to any additional sockets that arrived after we blocked, and service any detected socket activity.
(aside from the low performance, there's also the problem of sockets being closed while we're blocking, resulting in select(...)/pselect(...) reporting bad file descriptor.)
2. Have one listening thread that only accept new connections and one keep-alive thread that is notified via a pipe of any new sockets that arrived after we blocked and any new socket activity, and rebuild the FD_SET.
(same additional problem here as in '1.').
3. select(...)/pselect(...) with a timeout, when new work is to be done, detach the linked-list entry for the socket that has activity, and add it back when the request has been serviced. Rebuilding the FD_SET will hopefully be faster. This way we also avoid trying to listen to any bad file descriptors.
4. Combined (2.) and (3.).
-. Probably a few more, but they escape me atm.
The keep-alive sockets are stored in a simple linked List, whose add/remove methods are surrounded by a pthread_mutex lock, the function responsible for rebuilding the FD_SET also has this lock.
I suspect that it's the constant locking/unlocking of the mutex that is the main culprit here, I've tried to profile the problem but neither gprof or google-perftools has been very cooperative, either introducing extreme instability or plain refusing to gather any data att all (This could be me not knowing how to use the tools properly though.). But removing the locks risks putting the linked list in a non-sane state and probably crash or put the program into an infinite loop.
I've also suspected the select(...)/pselect(...) timeout when I've used it, but I'm pretty confident that this was not the problem since the low performance is maintained even without it.
I'm at a loss of how I should handle keep-alive sockets and I'm therefor wondering if you people out there has any suggestions on how to fix the low performance or have suggestions on any alternate methods I can use to go about supporting keep-alive sockets.
If you need any more information to be able to answer my question properly, don't hesitate to ask for it and I shall try my best to provide you with the necessary information and update the question with this new information.
Try to get rid of select completely. You can find some kind of event notification on every popular platform: kqueue/kevent on freebsd(), epoll on Linux, etc. This way you do not need to rebuild FD_SET and can add/remove watched fds anytime.
The time increase will be more visible when the client uses your socket for more then one request. If you are merely opening and closing yet still telling the client to keep alive then you have the same scenario as you did without keepalive. But now you have the overhead of the sockets sticking around.
If however you are using the sockets multiple times from the same client for multiple requests then you will lose the TCP connection overhead and gain performance that way.
Make sure your client is using keepalive properly. and likely a better way to get notification of the sockets state and data. Perhaps a poll device or queuing the requests.
http://www.techrepublic.com/article/using-the-select-and-poll-methods/1044098
This page has a patch for linux to handle a poll device. Perhaps some understanding of how it works and you can use the same technique in your application rather then rely on a device that may not be installed.
There are many alternatives:
Use processes instead of threads, and pass file descriptors via Unix sockets.
Maintain per-thread lists of sockets. You could even accept() directly on the worker threads.
etc...
Are your test clients reusing the sockets? Are they correctly handling keep alive?
I could see that case where you do the minimum change possible in your benchmarking code by just passing the keep alive header, but then not changing your code so that the socket is closed at the client end once the pay packet is received.
This would incure all the costs of keep-alive with none of the benefits.
What you are trying to do has been done before. Consider reading about the Leader-Follower network server pattern, http://www.kircher-schwanninger.de/michael/publications/lf.pdf

Resources