The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.
Related
I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.
I have inherited a legacy Delphi application that uses ADO to connect to SQL Server.
The application has a notion of a "Global Connection" -- that is a single connection that it opens at the start, and then keeps open all throughout the running of the application (which can be days, weeks, or longer....)
So my question is this: Should I keep this way of doing things or should I switch to a "connect-query-disconnect" mode of doing things? Does it matter?
Switching would be a non-trivial task, but I'll do it if it means better performance, data management, etc.
Well, it depends on what you're expecting to get out of it, and what kind of application it is.
There's nothing in particular wrong with using a single long-running connection, as long as the application can gracefully handle disconnections and recover or log/notify when it can't reconnect.
The problem with a connect-query-disconnect setup is that you're adding the overhead of connecting and disconnecting on every query. That's going to slow things down, and in an interactive GUI application users may notice the additional overhead. You also have to make sure that authorization is transparently handled if it isn't already.
At the same time, there may be interactive performance gains to be had if you can push all the queries off onto background threads and asynchronously update the GUI. If contention appears because the queries are serialized, you can migrate to a connection-pool system fairly readily as well and improve things even more. This has a fairly high complexity cost to it though, so now you're looking to balancing what the gains are compared to the work involved.
Right now, my ultimate response is "if it ain't broke, don't fix it." Changes along the lines you propose are a lot of work -- how much do the users of this application stand to gain? Are there other problems to solve that might benefit them more?
Edit: Okay, so it's broke. Well, slow at least, which is all the same to me. If you've ruled out problems with the SQL Server itself, and the queries are performing as fast as they can (i.e. DB schema is sane, the right indexes are available, queries aren't completely braindead, server has enough RAM and fast enough I/O, network isn't flaky, etc.), then yes, it's time to find ways to improve the performance of the app itself.
Simply moving to a connect-query-disconnect is going to make things worse, and the more queries you're issuing the bigger the drop off is going to be. It sounds like you're going to need to rearchitect the app so that you can run fewer queries, run them in the background, cache more aggressively on the client, or some combination of all 3.
Don't forget the making the clients perform better means that server side performance gets more important since it's probably going to be handling a higher load if clients start making multiple connections and issuing multiple queries in parallel.
As mr Frazier told before - the one global connection is not bad per se.
If you intend to change, first detect WHAT is the problem. Let's see some scenarios:
1
Some screens(IOW: an set of 1..n forms to operate in a business entity) are slow. Possible causes:
insuficient filtering resulting in a pletora of records being pulled from database without necessity.
the number of records are ok, but takes too much to render it. Solution: faster controls or intelligent rendering (ex.: Virtual list views)
too much queries each time you open an screen. Possible solutions: use TClientDatasets (or any in-memory dataset) to hold infrequently modified lookup tables. An more sophisticated cache for more extensive tables or opening those datasets in other threads can improve response times.
Scrolling on datasets with controls bound can be slow (just to remember, because those little details can be easily forgotten).
2
Whole app simply slows down. Checklist:
Network cards are ok? An few net cards mal-functioning can wreak havoc even on good structured networks as they create unnecessary noise on the line.
[MSSQL DBA HAT ON] The next on the line of attack is SQL Server. Ask the DBA to trace blocks and deadlocks. Register slow queries and work on them speed up. This relate directly to #1.1 and #1.3
Detect if some naive developer have done SELECT inside transactions. In read committed isolation, it's just overhead, as it'll create more network traffic. Open the query, retrieve the data and close the dataset.
Review the database schema, if you can.
Are any data-bound operations on a bulk of records (let's say, remarking the price of some/majority/all products) being done on the app? Make an SP or refactor the operation on an query, it'll be much faster and will reduce the load of the entire server.
Extensive operations on a group of records? Learn how to do that operations at once on the server instead of one-by-one record. See an examination of most used alternatives on the MSSQL MVP Erland Sommarskog's article on array and list on MSSQL.
Beware of queries with WHERE like : WHERE SomeFunction(table1.blabla) = #SomeParam . Most of time, that ones will not use an index causing to read the entire table to select the desired data. If is a big table.... Indexing on a persisted computed columns can make miracles...[MSSQL HAT OFF]
That's what I can think of without a little more detail... ;-)
Database connections are time consuming resources to create and the rule of thumb should be create as little as possible and reuse as much as possible. That's why some other technologies have database connection pools, which are typically established at application/service startup and then kept as long as possible and shared among threads.
From your comment, the application has performances issues, but it's difficult without more details to make any recommendation.
Should try to nail down what is slow - are all queries slow or just some specific ones?
If just some specific ones is there some correlation.
My 2 cents.
I have a very limited experience of database programming and my applications that access databases are simple ones :). Until now :(. I need to create a medium-size desktop application (it's called rich client?) that will use a database on the network to share data between multiple users. Most probably i will use C# and MSSQL/MySQL/SQLite.
I have performed a few drive tests and discovered that on low quality networks database access is not so smooth. In one company's LAN it's a lot of data transferred over network and servers are at constant load, so it's a common situation that a simple INSERT or SELECT SQL query will take 1-2 minutes or even fail with timeout / network error.
Is it any best practices to handle such situations? Of course i can split my app into GUI thread and DB thread so network problems will not lead to frozen GUI. But what to do with lots of network errors? Displaying them to user too often will be not very good :(. I'm thinking about automatic creating local copy of a database on each computer my app is running: first updating local database and synchronize it in background, simple retrying on network errors. This will allow an app to function event if network has great lags / problems.
Any hints and buzzwords what can i look into? Maybe it's some best practices already available that i don't know :)
Sorry this is prob not the answer you are looking for but you mention that a simple insert / update could take 1-2 minutes or even fail with timeout / network error.
This to me sounds like there may be another problem rather than the network itself. If your working on a corporate network there would have to be insane levels of traffic for this sort of behavior. I would do everything in your power to look at improving the network before proceeding. Can you post the result of a ping to the db box?
If your going to architect your application around this type of network it will significantly alter the end product and even possibly result in a poor quality product for other clients.
Depending upon the nature of the application maybe look at implementing an async persistence queue and caching data on startup or even embedding a copy of the db into your application.
Even though async behaviour/queues/caching/copying the database to each local instance etc will help solve the symptoms, the problem will still remain. If the network really is that bad then I'd address it with their I.T. department, or the project manager and build some performance requirement from their side of things into the contract.
With a distributed application, where you have lots of clients and one main server, should you:
Make the clients dumb and the server smart: clients are fast and non-invasive. Business rules are needed in only 1 place
Make the clients smart and the server dumb: take as much load as possible off of the server
Additional info:
Clients collect tons of data about the computer they are on. The server must analyze all of this info to determine the health of these computers
The owners of the client computers are temperamental and will shut down the clients if the client starts to consume too many resources (thus negating the purpose of the distributed app in helping diagnose problems)
You should do as much client-side processing as possible. This will enable your application to scale better than doing processing server-side. To solve your temperamental user problem, you could look into making your client processes run at a very low priority so there's no noticeable decrease in performance on the part of the user.
In a client-server setting, if you care about security, you should always program on the assumption that the client may have been compromised. Even if it hasn't, there is always the risk of somebody using an old version of the client, using a competing or modified version of the client, or just of the net connection being a bit screwy.
So while you do as much work on the client as possible, processing and marshalling information into the right form, the server then needs to do a thorough sanity check on anything the client gives it.
So the answer I guess is "both".
The server must analyze all of this
info to determine the health of these
computers
That is probably the biggest clue so far explaning what your application is kinda about. Are you able to provide a more elaborate briefing on what this application is seeking to achieve in this distributed environment? We do not even know if the client-side processing is disk I/O or processor intensive. How you design the solution is dependent on the nature of what needs to be done to help the users/business accomplish their jobs and objectives.
A discussion about Singletons in PHP has me thinking about this issue more and more. Most people instruct that you shouldn't make a bunch of DB connections in one request, and I'm just curious as to what your reasoning is. My first thought is the expense to your script of making that many requests to the DB, but then I counter myself with the question: wouldn't multiple connections make concurrent querying more efficient?
How about some answers (with evidence, folks) from some people in the know?
Database connections are a limited resource. Some DBs have a very low connection limit, and wasting connections is a major problem. By consuming many connections, you may be blocking others for using the database.
Additionally, throwing a ton of extra connections at the DB doesn't help anything unless there are resources on the DB server sitting idle. If you've got 8 cores and only one is being used to satisfy a query, then sure, making another connection might help. More likely, though, you are already using all the available cores. You're also likely hitting the same harddrive for every DB request, and adding additional lock contention.
If your DB has anything resembling high utilization, adding extra connections won't help. That'd be like spawning extra threads in an application with the blind hope that the extra concurrency will make processing faster. It might in some certain circumstances, but in other cases it'll just slow you down as you thrash the hard drive, waste time task-switching, and introduce synchronization overhead.
It is the cost of setting up the connection, transferring the data and then tearing it down. It will eat up your performance.
Evidence is harder to come by but consider the following...
Let's say it takes x microseconds to make a connection.
Now you want to make several requests and get data back and forth. Let's say that the difference in transport time is negligable between one connection and many (just ofr the sake of argument).
Now let's say it takes y microseconds to close the connection.
Opening one connection will take x+y microseconds of overhead. Opening many will take n * (x+y). That will delay your execution.
Setting up a DB connection is usually quite heavy. A lot of things are going on backstage (DNS resolution/TCP connection/Handshake/Authentication/Actual Query).
I've had an issue once with some weird DNS configuration that made every TCP connection took a few seconds before going up. My login procedure (because of a complex architecture) took 3 different DB connections to complete. With that issue, it was taking forever to log-in. We then refactored the code to make it go through one connection only.
We access Informix from .NET and use multiple connections. Unless we're starting a transaction on each connection, it often is handled in the connection pool. I know that's very brand-specific, but most(?) database systems' cilent access will pool connections to the best of its ability.
As an aside, we did have a problem with connection count because of cross-database connections. Informix supports synonyms, so we synonymed the common offenders and the multiple connections were handled server-side, saving a lot in transfer time, connection creation overhead, and (the real crux of our situtation) license fees.
I would assume that it is because your requests are not being sent asynchronously, since your requests are done iteratively on the server, blocking each time, you have to pay for the overhead of creating a connection each time, when you only have to do it once...
In Flex, all web service calls are automatically called asynchronously, so you it is common to see multiple connections, or queued up requests on the same connection.
Asynchronous requests mitigate the connection cost through faster request / response time...because you cannot easily achieve this in PHP without out some threading, then the performance hit is greater then simply reusing the same connection.
that's my 2 cents...