OData HTTP400 Timeout Error - sql-server

This is one of the most bizarre problems I've come across since I started using OData for my mobile apps. The OData server I've developed is backed by SQL Express 2008 and this combination has been installed on 50 different servers and/or PCs over the last 15 months. All 50 servers have been running stable with consistent function for large amounts of data.
A couple of days ago one of my clients contacted me indicating that my client app (running on iOS7) was having an odd error come up when POSTing data to their server. The error had an HTTP code of 400 and the error text is "The operation couldn't be completed. (Timeout error 400.)". My first question is: why is a timeout error coming back with a 400 code? Generally when I get timeouts (due to firewall, etc) they're in the 100x range. There is no indication in the event logs on the server of ANY problems occurring. My own logs (stored in the SQL database) show no error (which is odd because I'm using the generic exception catching method in my OData service to log any problems). I haven't got to the step of adding logging of all requests as yet.
The error is only being raised when posting one particular set of data. All other posts from the device are functioning perfectly. I got the client to re-install the app (deleting all data) and then to download the data set that was causing the error. The download worked fine. We began making changes to the data to replicate what the data looked like when the error occurred in incremental changes, posting the change to the server and observing the result. Most of the incremental changes work fine but certain combinations cause the error to occur. One of the increments involves a large volume of changes and that posts fine, but subsequent alteration of any of the objects (sometimes altering as little as 6 characters in a text field) cause the error to occur. And yet in some circumstances altering objects that have already been posted to the server works without a problem.
I wiped the service components from the server and undertook a fresh install. I shifted TCP ports in case 443 had another listener causing problems. I reset the server. None of these change the behaviour of the error.
My last ditch solution is to completely re-install IIS and .NET Framework but I'd obviously like to avoid this as it's not my server... The server is overseas from my current location so debugging isn't really an option. Hoping someone has an idea as to what I can do diagnostically to try and determine the source of this bizarre 'gremlin'.

Have you tried a more thorough traffic analysis using a tool like Fiddler? The "timeout" error does indeed seem odd and what stood from you post was that your server was "overseas". Could there be something with the "times" that are being used/generated, e.g. server time, local time, etc?
Just to confirm, the "same" exact set of data always fails? Can you replicate this via a remote debugger or via localhost? If so, can you turn on "verbose errors"?

Related

Invalid JWT token happens randomly when using Snowflake+DBT

We've been using Snowflake+DBT with key pair authentication for a long time now, and we've never had any issues.
Recently, we started getting random connection errors on some models:
250001 (08001): Failed to connect to DB: account.region.snowflakecomputing.com:443. JWT token is invalid.
Most of the models will work, but some of them might fail. This can happen to any model, and it's never the same one -- sometimes it's the very first one, sometimes it's the very last one, and sometimes it's a bunch of them. It happens in runs with lots of models, or in runs with a single model.
It's very inconsistent, and there doesn't seems to be any kind of pattern to it. Sometimes it works, and sometimes it doesn't.
We've also tried with 1, 4, or 8 threads, and it happens regardless.
Obviously, there's nothing wrong with the credentials or configurations — otherwise, nothing would run at all. So I assume there must be something wrong with how DBT is handling the connection(s).
Interestingly, the errors only happens locally (so far). We haven't seen it in DBT Cloud runs.
DBT version is 0.20.2 in both cases. We tried with other versions (0.21.0, 0.20.0 & 0.19.1), and the issue persists. I don't know why we're just encountering this, as we've used these other versions previously without any issues.
It's similar to this question, except in our case it doesn't happen consistently at all. We tried connecting "without a region" (using Snowflake Organizations), but it doesn't make any difference:
250001 (08001): Failed to connect to DB: organization-account.snowflakecomputing.com:443. JWT token is invalid.
Is there anything we can do to resolve this?
EDIT: When the error happens, the model hangs for 60 seconds until the error appears.
EDIT 2: I think the error might have started happening when we started using the DBT-provided Docker images. Not sure exactly what might be wrong with them, but we'll try going back to our own custom images and see if that works.
This has since stopped happening. 🤷‍♂️
Possibly because of this change from Snowflake:
"An improvement has been added to Snowflake's Cloud Services layer to relax the validity restrictions."
Also, we are now using different Docker containers, so that might have something to do with it as well:
First, we switched to xemuliam/dbt, from Docker Hub.
But since that is no longer maintained, we are now using the official DBT Docker images from GitHub (which were not available at the time the question was posted):
dbt-core
dbt-postgres
dbt-snowflake

Network access was interrupted

The Access database just needs to be open and it will usually crash within the next 20-40mins, resulting in the following error message:
Your network access was interrupted. To continue, close the database, and then open it again.
More details:
The database is split, with the back end and front end on a server. The computers are then connected to the server via LAN (ethernet).
Although there are multiple computers connected to the server, the database only has one user at a time.
The database has been fine for almost a year, until this week where this error has started occurring.
We never have connectivity issues with the server.
I have seen several answers saying it is:
the databases fault, as it is starting to corrupt
the servers fault, as it broken, dropping my connection briefly
microsofts fault, they should patch it
I am hoping this is a problem with the database itself, as I am not responsible for the server.
Does anyone have a definitive solution?
I have recently experienced the same problem, and it all started when I moved my DB in an extrernal disk. The same db was working just fine in the local disk, or in the previous external disk. So, i am guessing is just a bug that has to do with the disk letter changing or something like this.
The problem sounds like an unstable LAN connection OR changes the LAN location (e.g. new hardware or changs to admin settings) causing increased latency.
If you have forms in the FE bound to BE tables the latency can cause the connection to be severed resulting in the error you see.
I'm not a network admin but the main culprits I've seen are:
Users connecting to the network using a VPN using an unstable connection (cell phones, crappy wifi, or just bad ISP service).
Network admins capping persistent connections to a share causing disconnects.
Unstable network hardware or bad hardware configuration.
"Switching" between wired and wireless LAN connections.
I don't think the issue is the database other than having bound forms to a BE database which is more of a fundemental design problem than anything else.
Good luck!
I use Access 2010. I had the same issue but solved it in the following ways.
On the external data ribbon, go to the Import & link group and click on Linked Table Manager.
Click on select all.
Click on Ok to refresh the links.
In cases where the path of the BackEnd database file has been changed, browse to the new location and select the new path. This will also refresh the links. This will solve the problem. It did for me.
You wrote, "The database has been fine for almost a year, until this week where this error has started occurring."
Clearly something has recently changed for this to be happening and without narrowing the field of possibilities it's anyone's guess. However, in my experience Jet DB crashes when two or more users are accessing and editing the same record(s) at the same time. So, if you've recently added new users this is a possibility.
Note: Jet is a file-server DB not a client server, which means the app was probably designed for a specific number of front-end users. Without knowing more I would start there.
I resolved my issue on this when I figured out that I had a offline directory setup and the sync was having an issue I turned off the sync and tested it and the error went away.

Linq-To-Sql and MARS woes - A severe error occurred on the current command. The results, if any, should be discarded

We have built a website based on the design of the Kigg project on CodePlex:
http://kigg.codeplex.com/releases/view/28200
Basically, the code uses the repository pattern, with a repository implementation based on Linq-To-Sql. Full source code can be found at the link above.
The site has been running for some time now and just about a year ago we started to get errors like:
There is already an open DataReader associated with this Command which must be closed first.
ExecuteNonQuery requires an open and available Connection. The connection's current state is closed.
These are the closest error examples I can find based on my memory. These errors started to occur when the site traffic started to pick up. After banging my head against the wall, I figured out assumed that the problem is inherit within Linq-To-Sql and how we are using the same connection to call multiple commands in a single web request.
Evenually, I discovered MARS (Multiple Active Result Sets) and added that to the data context's connection string and like magic, all of my errors went away.
Now, fast forward about 1 year and the site traffic has increased tremendously. Every week or so, I will get an error in SQL Server that reads:
A severe error occurred on the current command. The results, if any, should be discarded
Immediately after this error, I receive hundreds to thousands of InvalidCastException errors in the error logs. Basically, this error shows up for each and every call to the Linq-To-Sql data context. Only after I restart the web server do these errors clear up.
I read a post on the Micosoft Support site that descrived my problem (minus the InvalidCastException errors) and stating the solution is that if I'm going to use MARS that I should also use Asncronous Processing=True. I tried this, but it did not solve my problem either.
Not really sure where to go from here. Hopefully someone here has seen and solved this problem before.
I have the same issue. Once the errors start, I have to restart the IIS Application Pool to fix.
I have not been able to reproduce the bug in dev despite trying many different scenarios involving multi-threading, leaving connections open, etc etc.
One possible lead I do have is that amongst the errors in the server Event Log is an OutOfMemoryException for the Application Pool. Perhaps this is the underlying cause of the spurious SQL Datareader errors (a memory leak elsewhere). Although again I haven't been able to reproduce this in dev.
Obviously if you are using a 64 bit OS then this is probably not the cause in your case.
So after much refactoring and re-architecting, we figured out that problem all along is MARS (Multiple Active Result Sets) itself. Not sure why or what happens exactly but MARS somehow gets result sets mixed up and doesn't recover until the web app is restarted.
We removed MARS and the errors stopped.
If I remember correctly, we added MARS to solve the problem where a connection/command was already closed using LinqToSql and we tried to access an object graph that hadn't been loaded. Without MARS, we'd get an error. But when we added MARS, it seemed to not care about it. This is really a great example of us not really understanding what the heck we were doing and we learned some valuable (and expensive) lessons from this.
Hope this helps others who have experienced this.
Thanks to all how have contributed their comments and answers.
I understand you figured out the solution..
Following is not a direct solution to the problem; but it is good for others to take a look at
What does "A severe error occurred on the current command. The results, if any, should be discarded." SQL Azure error mean?
http://social.msdn.microsoft.com/Forums/en-US/bbe589f8-e0eb-402e-b374-dbc74a089afc/severe-error-in-current-command-during-datareaderread

Intermittent connection timeouts to Solr server using SolrNet

I have a production webserver hosting a search, and another machine which hosts the Solr search server (on a subnet which is in the same room, so no network problems). All is fine >90% of the time, but I consistently get a small number of The operation has timed out errors.
I've increased the timeout in the SolrNet init to 30 seconds (!)
SolrNet.Startup.Init<SolrDataObject>(
new SolrNet.Impl.SolrConnection(
System.Configuration.ConfigurationManager.AppSettings["URL"]
) {Timeout = 30000}
);
...but all that happened is I started getting this message instead of Unable to connect to the remote server which I was seeing before. It seems to have made no difference to the amount of timeout errors.
I can see nothing in any log (believe me I've looked!) and clearly my configuration is correct because it works most of the time. Anyone any ideas how I can find more information on this problem?
EDIT:
I have now increased the number of HttpRequest connections from 2 to 'a large number' (I see up to 10 connections) - but this has had no discernible effect on this problem.
The firewall is set to allow ANY connections between the two machines.
We've also checked the hardware with our server host and there are no problems on the connections, according to them.
EDIT 2:
We're still seeing this issue.
We're now logging the timeouts and they're mostly just over 30s - which is the SolrNet layer's timeout; some are 20s, though - which is the Tomcat default timeout period - which suggests it's something in the actual connection between the machines.
Not sure where to go from here, though - they're on a VLAN and we're specifically using the VLAN address - response time from pings is ALWAYS <1ms.
Without more information, I can only guess a few possible reasons:
You're fetching tons of documents per query, and it times out while transferring data.
You're hitting the ServicePoint.ConnectionLimit. If so, just increase this value. See also How can I programmatically remove the 2 connection limit in WebClient
You have some very facet-heavy requests or misusing Solr (e.g. not using filter queries). Check the qtime in the response. See the Solr performance wiki for more details.
Try setting this in .net.
ServicePointManager.Expect100Continue = false;
or this
ServicePointManager.SetTcpKeepAlive(true, 200000, 200000); - this sends requests to the server to keep the connection alive.

Site update, testing was fine, after deployment, again fine, once user load increases, FAIL?

We are using ASP.NET MVC with LINQ to SQL. We added some features and tested them all to perfection on our QA box. We are using Windows Server 2003 and SQL Server 2005. So when we pushed out changes to the Live web server we also used Red Gate SQL Compare to push new database changes to the LIVE database. We tested again between the few of us, no problems. Time for bed.
The morning comes and users are starting to hit the app, and BOOM. We have no idea why this would happen as we have not been doing any new types of code things that we were not doing before. However we did notice that during the SQL Compare sync the names of all the foreign keys were different between the two databases, not the IDs in the tables, FK_AssetAsset_A0EB67 to FK_AssetAsset_B67EF8 (for example, don't remember the exact number of trailing mixed characters during the SQL Compare), we are not sure why but that is another variable in this problem.
Strangely once this was all pushed out we could then replicate the errors on QA, but not before everything was pushed to LIVE.
QA and LIVE databases are on the same SQL Server, but the apps are on different instances of Windows Server 2003.
Errors generated:
Index was outside the bounds of the array.
Invalid attempt to call FieldCount when reader is closed.
Server failed to resume the transaction.
There is already an open DataReader associated with this Command which must be closed first.
A transport-level error has occurred when sending the request to the server.
A transport-level error has occurred when receiving results from the server.
Invalid attempt to call Read when reader is closed.
Invalid attempt to call MetaData when reader is closed.
Count must be positive and count must refer to a location within the string/array/collection. Parameter name: count
ExecuteReader requires an open and available Connection. The connection's current state is connecting.
Any one have any idea what the heck could have happened?
EDIT: Since we were able to replicate the errors all of a sudden on QA, it might not be a user load issue... Needless to say we all feel really screwed here.
Concurrency always brings bugs out of the woodwork. I'd recommend you check for objects that could be shared among requests (such as static members and singletons) and refactor your code so that as little as possible is shared.
As far as specifics go, for the error "There is already an open DataReader associated with this Command which must be closed first," you may want to try adding MultipleActiveResultSets=True to your connection strings.
It sounds like you're crossing the streams a bit and trying to share DataContexts across requests. My suggestion would be to wire in a dependancy injection framework that creates a new instance of the dependancy for each request.
I use Castle's IoC and wire it into the controller factory so that when it sees a dependancy on a repository it creates a new instance of that repository for each request. If you go this route let me know and I can shoot you a few more resources.

Resources