Intermittent connection timeouts to Solr server using SolrNet - solr

I have a production webserver hosting a search, and another machine which hosts the Solr search server (on a subnet which is in the same room, so no network problems). All is fine >90% of the time, but I consistently get a small number of The operation has timed out errors.
I've increased the timeout in the SolrNet init to 30 seconds (!)
SolrNet.Startup.Init<SolrDataObject>(
new SolrNet.Impl.SolrConnection(
System.Configuration.ConfigurationManager.AppSettings["URL"]
) {Timeout = 30000}
);
...but all that happened is I started getting this message instead of Unable to connect to the remote server which I was seeing before. It seems to have made no difference to the amount of timeout errors.
I can see nothing in any log (believe me I've looked!) and clearly my configuration is correct because it works most of the time. Anyone any ideas how I can find more information on this problem?
EDIT:
I have now increased the number of HttpRequest connections from 2 to 'a large number' (I see up to 10 connections) - but this has had no discernible effect on this problem.
The firewall is set to allow ANY connections between the two machines.
We've also checked the hardware with our server host and there are no problems on the connections, according to them.
EDIT 2:
We're still seeing this issue.
We're now logging the timeouts and they're mostly just over 30s - which is the SolrNet layer's timeout; some are 20s, though - which is the Tomcat default timeout period - which suggests it's something in the actual connection between the machines.
Not sure where to go from here, though - they're on a VLAN and we're specifically using the VLAN address - response time from pings is ALWAYS <1ms.

Without more information, I can only guess a few possible reasons:
You're fetching tons of documents per query, and it times out while transferring data.
You're hitting the ServicePoint.ConnectionLimit. If so, just increase this value. See also How can I programmatically remove the 2 connection limit in WebClient
You have some very facet-heavy requests or misusing Solr (e.g. not using filter queries). Check the qtime in the response. See the Solr performance wiki for more details.

Try setting this in .net.
ServicePointManager.Expect100Continue = false;
or this
ServicePointManager.SetTcpKeepAlive(true, 200000, 200000); - this sends requests to the server to keep the connection alive.

Related

Connection tab of Google Cloud SQL instance taking forever to load the console interface

I want to access my cloud database from my computer but the connection tab cannot load to finish so that I can enter my IPv6 address. This is the second time am experiencing this issue and my network is strong enough. It's now been 20 minutes, but still the three dots are just indicating progress that never ends.
The first time it happened I had to leave my computer and go for a walk. This really frustrates me since it's in production and rapid updates should not be delayed.
How can I fix this?
POSSIBLE CAUSE:
It happens after I re-open Mysql-workbench and it fails reason being my IPv6 has been changed possibly by my Internet Service Provider (ISP) (I dont know of other possible reasons). After Mysql-workbench fails, I go to the console to update the new one but this problem occurs.
I think Cloud SQL security (don't know exact name) is treating this a malicious access attempt hence initiating this weird delay for immediate subsequent access. If so, then this is purely impractical for b/s since my computer does not tell me that my IPv6 has changed, besides, that normal regular IPv6 updates can't be treated as malicious lest developers continue to suffer this issue.
EDIT: This time it finished loading after approximately 50 minutes.
Have you considered using the Cloud SQL proxy to connect to your instance instead of white-listing an IP? White-listing an IP can be insecure since it provides anyone on your network access, and inconvenient (as you have discovered) because if your IP changes you lose access.
The proxy uses a service account to provide authenticated access to your instance, so it will work regardless of your IP (as long as your service account has the correct permissions). Check out these instructions for a guide on starting it up.
(As a side note, it's a difficult problem to tell why your connectivity tab is failing to get load. It might be a browser add on or even a networking failure in your local network that is interfering. You can check the browser dev console to see if any errors appear)

Chrome network Timing , how to improve Content Download

I was checking for XHR calls timing in Chrome DevTools to improve slow requests but I found out that 99% of the response time is wasted on content download even though the content size is less than 5 KB and the application is running on localhost(Working on my local machine so no Network issues).
But when replaying the call using Replay XHR menu, the Content download period drops dramatically from 2.13 s to 2.11 ms(as shown in the screen shots below). Data is not cached at browser level.
Example of Call Timing
Same Example Replayed
Can someone explain why the content download timing is slow and how to improve it?
The Application is an ASP.NET mvc 5 solution combined with angularJS.
The Web Server Details:
- Windows Server 2012 R2
- IIS 8
Thank you in advance for your support!
I can't conclusively tell you the cause of this, but I can offer some variables that you can investigate, which might help you figure out what's going on.
Caching
I know you said that the data is not getting cached at the browser level, but I'd suggest checking that again. Because the fact that the initial request takes 2s, and then the repeat request only takes 2ms really does sound like caching.
How to check:
Go to Network panel.
Look at Size column for the request. If you see from memory or from disk cache, it was served from the cache.
Slow development server or machine
My initial thought was that you're doing more work on your development machine than it can handle. Maybe the server requires more resources than your machine can handle. Maybe you have a lot of other programs running and your memory / CPU is getting maxed.
How to check:
Run your app on a more powerful server and see if the pattern persists.
Frontend app is doing too much work
I'm not sure this last one actually makes sense, but it's worth a check. Perhaps your Angular app is doing a crazy amount of JS work during the initial request, and it's maxing out your CPU. So the entire browser is stalling when you make the initial request.
How to check:
Go to Performance panel.
Start recording.
Do the action that causes your app to make the initial request.
Stop recording.
Check the CPU chart. If it's completely maxed out, then your app is indeed doing a bunch of work.
Please leave a comment and let me know if any of these helped.
I have also been investigating this issue on Chrome (currently 91.0.4472.164) as the content download times appear to be vastly different based on the context of the download. When going directly to a resource or attempting to update rendered content as the result of a web call, the content download time can take up to 10x the duration when made from other client applications or when simply saving the data off as a variable in Chrome.
I created a quick, hacky Spring Boot web application that demonstrates the problem that I have made public on github: https://github.com/zielinskin/h2-training-simple
The steps in the readme should hopefully be sufficient to demonstrate the vast performance differences.
I believe that Chrome will need to resolve this performance issue as it has nothing to do with the webserver or ui framework being used.
The "Content Download" includes both the time taken to download the content and also the time for the server to upload the content. You can test out the following cases to see what is the cause. Usually it is a combination of all them.
Case 1: server delay
Assume running server and client on localhost with 0 network delay, and small data.
time0 client receives a response with header content-length = 20
time5 server > client: 10 bytes of data
time5 client receives data
Case 2: network delay
Use hard-coded dummy data to speed up server
time0 client receives a response with header content-length = 20
time0 server > client: 10 bytes of data
time5 client receives data
Case 3: client is too busy
Isolate the query by trying something like curl google.com -v in terminal to access the URL directly. You can use Chrome Dev tool and Firefox Dev tools to copy the request as shown below.

Nagios conditional checks

Currently I am monitoring my target windows hosts for a bunch of services (CPU, memory, disks, ssl certs, http etc). I'm using nsclient as the client that the nagios server will talk to.
My problem is that I deploy to those hosts three times every 24 hours. The deployment process requires the hosts to reboot. Whenever my hosts reboot I get nagios alerts for each service. This means a large volume of alerts, which makes it difficult to identify real issues.
Ideally I'd like to this:
If the host is down, don't send any alerts for the rest of the services
If the host is rebooting, this means that nsclient is not accessible. I want to only receive one alert (e.g CPU is not accessible) and mute everything else for a few minutes, so the host can finish booting and nsclient becomes available.
Implementing this would have me getting one email per host for each deployment. This is much better than everything turning red and me getting flooded with alerts that aren't worth checking (since they're only getting sent because the nagios client -nsclient- is not available during the reboot).
Got to love using a windows stack...
There are several ways to handle this.
If your deploys happens at the same time everyday:
1. you could modify your active time period to exclude those times (or)
2. schedule down time for your host via the Nagios GUI
If your deployments happen at different/random times, things become a bit harder to work-around:
1. when nrpe or nsclient is not reachable, Nagios will often throw an 'UNKNOWN' alert for the check. If you remove the 'u' option for the following entries:
host_notification_options [d,u,r,f,s,n]
service_notification_options [w,u,c,r,f,s,n]
That would prevent the 'UNKNOWN's from sending notifications. (or)
2. dynamically modify active checking of the impacted checks, by 'turning them off' before you start the deployment, and then 'turning them on' after the deployment. This can be automated using the Nagios 'external commands file'.
Jim Black's answer would work, or if you want to go even more in depth you can define dependencies with service notification escalation as described in the documentation below.
Escalating the alerts would mean that you could define: CPU/ssl etc check fail -> check host down -> Notifiy/don't notify.
Nagios Service Escalation (3.0)

OData HTTP400 Timeout Error

This is one of the most bizarre problems I've come across since I started using OData for my mobile apps. The OData server I've developed is backed by SQL Express 2008 and this combination has been installed on 50 different servers and/or PCs over the last 15 months. All 50 servers have been running stable with consistent function for large amounts of data.
A couple of days ago one of my clients contacted me indicating that my client app (running on iOS7) was having an odd error come up when POSTing data to their server. The error had an HTTP code of 400 and the error text is "The operation couldn't be completed. (Timeout error 400.)". My first question is: why is a timeout error coming back with a 400 code? Generally when I get timeouts (due to firewall, etc) they're in the 100x range. There is no indication in the event logs on the server of ANY problems occurring. My own logs (stored in the SQL database) show no error (which is odd because I'm using the generic exception catching method in my OData service to log any problems). I haven't got to the step of adding logging of all requests as yet.
The error is only being raised when posting one particular set of data. All other posts from the device are functioning perfectly. I got the client to re-install the app (deleting all data) and then to download the data set that was causing the error. The download worked fine. We began making changes to the data to replicate what the data looked like when the error occurred in incremental changes, posting the change to the server and observing the result. Most of the incremental changes work fine but certain combinations cause the error to occur. One of the increments involves a large volume of changes and that posts fine, but subsequent alteration of any of the objects (sometimes altering as little as 6 characters in a text field) cause the error to occur. And yet in some circumstances altering objects that have already been posted to the server works without a problem.
I wiped the service components from the server and undertook a fresh install. I shifted TCP ports in case 443 had another listener causing problems. I reset the server. None of these change the behaviour of the error.
My last ditch solution is to completely re-install IIS and .NET Framework but I'd obviously like to avoid this as it's not my server... The server is overseas from my current location so debugging isn't really an option. Hoping someone has an idea as to what I can do diagnostically to try and determine the source of this bizarre 'gremlin'.
Have you tried a more thorough traffic analysis using a tool like Fiddler? The "timeout" error does indeed seem odd and what stood from you post was that your server was "overseas". Could there be something with the "times" that are being used/generated, e.g. server time, local time, etc?
Just to confirm, the "same" exact set of data always fails? Can you replicate this via a remote debugger or via localhost? If so, can you turn on "verbose errors"?

App engine channel successfully created but unusable

I have been getting an intermittent problem using the App Engine channel API. For the most, maybe 90% of the time, everything works fine. But the remaining 10% of the time I get a channel that is unusable. Having looked at this code for months, I strongly believe that this problem is not due to a logical error. By unusable channel I mean that even though the client connects to it successfully, the server is not able to message it. Most of the operations involved on the client and server complete successfully:
On the server, I create a channel with a new client id unique to the session
The client fetches the corresponding token and connects to it
On the client, onOpen() is called on the channel socket
The one thing that doesn't succeed is the calling of /_ah/channel/connected for these defective channels. I've tried dozens of possible workarounds without success. Right now I deal with the problem by gracefully retrying till I succeed, but it would be really nice for it to work without these tricks.
I havent seen any code but from what you are saying could it be related to
Intermittent error code 400, description “” on client connecting to channel
I am using a kind of brute force loop messaging to all client sockets (even if they have been closed, its a bit redundant but the overhead seems low ) and haven't picked up any problems yet (I also havent tested it that well either)
It seems that they fixed a leak in the channels APi in the last release 1.8.2: https://code.google.com/p/googleappengine/issues/detail?id=9283
https://code.google.com/p/googleappengine/wiki/SdkForJavaReleaseNotes

Resources