i do have a somewhat reproducable error, which i am no longer willing to bear, so i hope some of you might know a better workaround.
I have several larger cubes (around 10-50 GByte) which i process daily. The processing of only the partitions (about 20%) takes about 1h when i use in XMLA-Script - so processing dimensions and measures in paralellel-mode.
This works in only 2/5 runs.
So i have a procedure which detects the crash and if happened starts the serial-processing instead, which will run about 2-5 times slower - but at least works every time.
The error codes are very generic and not very helpful:
Operation canceled; HY008
Communication link failure; 08S01; TCP Provider: An existing connection was forcibly closed by the remote host.
And since it works every time in serial mode i know that there are no errors in principle.
I am using
Microsoft SQL Server 2014 - 12.0.2000.8 (X64)
Enterprise Edition: Windows NT 6.3 (Build 9600:)
Please, if you have any ideas how to solve (workaround) this.
I am thankful for every new insight or idea you might have to this behaviour.
It looks like concurrency issue and/or heavy SQL server workload (e.g. caused by mostly unlimited threads from SSAS server). I guess, it's better to set up MAX-values for necessary parameters:
ThreadPool\Process\MaxThreads = 4*cores
DB Number of Connections = 2*cores-1 (it's based on my own practice). You can tune this right before and after processing task if it's necessary to have huge amount of connections not in process time.
And maybe somehow play with affinity masks, but previous parameters tuning should be enough.
This article
http://phoebix.com/2014/07/01/what-you-need-to-know-about-ssas-processor-affinity/
and this book
http://msdn.microsoft.com/en-us/library/hh226085.aspx
describe whole technique in details.
UPDATE:
There is also small possibility of wrong TCP settings, described here: http://blogs.msdn.com/b/cindygross/archive/2009/10/22/sql-server-and-tcp-chimney.aspx
But this may cause fails even in serial processing.
Related
SQL Server 2008 Enterprise SP4 0.0.6547.0 x64
Running on Windows 2012R2 patched current.
A VM running on Cisco UCM blades and 6.0 Update 3 plus patches.
A Nimble CS700 SAN for the storage.
This is a large OLTP server with 12 vCPU. Normal CPU usage hovers around 6-11%
What happens is that, without warning, the IO Stall times will go through the roof (2000-1000ms) and most queries will stop returning results. Adam Machanic's sp_whoisactive will show dozens of active queries. CPU is at 90+%.
SAN shows almost zero activity and all other VMs on the same SAN are operating optimally.
We see massive blocking as the stalled processes hold blocks, with some timing out and sleeping with blocks hanging on the SPID. Killing the SPIDs in question provides temporary relief, but seconds later we are right back where we started.
The only thing that provides relief is a reboot of the server.
Management is rightly demanding an actual root cause. When this happened last summer, with visibility to the CEO level, we engaged Microsoft support, who were dumbfounded and offered no actual root cause.
What I can't do is upgrade the SQL server. The machine hosts a packaged application and the package publisher refuses to support their software if we implement any newer SQL Server version. I desperately want to go to 2014/2016/2017, and would feel that it would solve this problem and others.
In any event, I searched the bug reports and did not see anything that matched.
Has anyone run into this issue? If so did you suss out a root cause? I have a gut feel that there is a bug in either SQL 2008, Windows 2012R2 or how they interact. But I don't want to write that into the RCA without having some corroboration.
Would appreciate any pointers.
Here is my approach
1.) Try eliminate storage issues.We once had a storage issue(SAN) and root cause seemed to be some HBA.You can further check if your storage is performing with in acceptable limits
You should start with below counters and see if they are less than 15ms
Avg. Disk sec/Read - is the average time, in seconds, of a read of data from the disk.
Avg. Disk sec/Write - is the average time, in seconds, of a write of data to the disk.
There is more info here :https://www.mssqltips.com/sqlservertip/2460/perfmon-counters-to-identify-sql-server-disk-bottlenecks/
2.) Once you have eliminated storage issues, you can further check if SQLSERVER is the only causing IO spikes or if there are any other applications causing IO.You can use resource monitor to find this
3.) If you have reached here, SQLSERVER may be culprit..Go with below steps and try following same sequence and see if problem persists after each step.
Remember HIGH IO can be caused due to
Stale stats and missing indexes:You might not be updating stats regularly or some type of queries might need more frequent index rebuilds/stats update
gather queries causing HIGH IO and try tuning them,you can observe number of reads done and try adding indexes to minimize number of reads
Further Check memory pressure,some times high memory usage can cause Buffer pool flush and there by queries will go to disk..You can look for a counter called PLE and see what is good for your environment
Further research pointed to VMWare. The machine was allocated 304GB of RAM, 264GB of which was assigned to SQL Server. However the underlying host was overcommitted on RAM by a large amount. We suspect thrashing as page life drops, and as other VMs also need real RAM.
Thanks
John.
We are testing JDBC drivers from jTDS and Microsoft, and we are suffering from unwanted pauses in query execution. Our application opens many ResultSets and fetches only a few rows from each. Each query selects about 100k rows, but we fetch only about 50 (which is enough to fill a page). The problem is that every query after the first contains a pause of about 2s, on which the driver loads all rows from the previous ResultSet to a temporary storage (memory or disk), so they can be traversed later. Because we have about 6 queries in worst scenarios, there will be a pause of about 10s, which makes the application unresponsive to the user. MSSQL version is 2005.
To remove such pauses, we've tried to enable MARS (Multiple Active Result Sets) via connection string parameters of Microsoft JDBC driver (due to lack of documentation, we tried everything that is listed on https://sites.google.com/site/sqlconnect/sql2005strings). Example of connection string:
jdbc:sqlserver://TESTDBMACHINE;instanceName=S2005;databaseName=SampleDB;MarsConn=yes
But none of them solves the problem. Microsoft JDBC driver seems to accept anything at connection string - if you replace MarsConn=yes by PleaseBeFast=yes, MS driver ignores the parameter and doesn't even log the fact. I don't know if MARS is a client-only feature that just caches rows from a previously active result set, or if it's a server feature. I don't even know how to detect, from the server side, if a given connection is using MARS. If you can comment on this, it will be welcome.
Another solution for the pause was to use scrollable (bidirectional) result sets. This removes the pause, but makes fetch time 80% slower and more network consuming. We are now considering to implement a JDBC connection wrapper that keeps a pool of actual connections and automatically issue queries to distinct "ResultSet free" connections. But this is somewhat cumbersome because we need to keep a link between each connection and its active ResultSet. Also it would consume more connections from the server and may cause troubles to DBAs. And this solution doesn't help if there is an active transaction, on which all queries must be issued on the same connection.
Do you know some parameter, configuration, specific API, link or trick that can remove the pause from the second and subsequent query executions?
fix your SQL queries! why only use the first 50 or so from 100k rows?? use TOP 100 or something like that! There is no reason that the application should be filtering 100k rows, that is the job of the database.
Far more important that your client woes is what happens on the server side. Since you issue queries and then you stop reading the result the server will have to suspend the query in the middle of the execution because the network buffers are full and has no room to write the result into. A query suspended in the middle of the execution is consuming a lot of resources: memory, locks and, most importantly, a worker thread (there are very few of these lying around).
Issue queries for only the data you need, consume all the data, free the connection and, more importantly, the server resources. If your queries are to complex, go back to the drawing board and redesign your data model to properly answer, efficiently, the queries your requesting from it. Right now you are totally barking up the wrong tree, you're simply asking how to make a bad situation worse.
We've created an ODBC data source using the SQL Server driver (Native Client 10 - sqlncli10.dll). This ODBC data source was configured to enable MARS (key HKEY_CURRENT_USER\Software\ODBC\ODBC.INI\ datasource, value of MARS_Connection must be Yes). Then we used Sun's JDBC-ODBC bridge, and voilá! The pauses were gone, and surprisingly, fetch time became faster than JTDS and MSSQL pure Java JDBC drivers!
According to http://msdn.microsoft.com/en-us/library/ms345109(SQL.90).aspx, MARS is a server-aided feature. The pure Java drivers (from JTDS and MSSQL) doesn't seem to support served-based MARS (at least we couldn't enable it after many configuration changes). Because most of our user base that uses MSSQL Server runs on Windows (no surprise), we are about to make the switch from JTDS to JDBC-ODBC. Both Native Client ODBC driver and JDBC-ODBC bridge seems to be mature, full featured and up-to-date solutions, so I guess there should be no problems. If you know some, please comment!
Linux based users will still use JTDS. Since now we know that MARS is a server-aided feature, we'll fill a feature request for JTDS and Microsoft to support MARS in their pure Java JDBC drivers.
For a while now we've been having anecdotal slowness on our newly-minted (VMWare-based) SQL Server 2005 database servers. Recently the problem has come to a head and I've started looking for the root cause of the issue.
Here's the weird part: on the stored procedure that I'm using as a performance test case, I get a 30x difference in the execution speed depending on which DB server I run it on. This is using the same database (mdf) and log (ldf) files, detached, copied, and reattached from the slow server to the fast one. This doesn't appear to be a (virtualized) hardware issue: he slow server has 4x the CPU capacity and 2x the memory as the fast one.
As best as I can tell, the problem lies in the environment/configuration of the servers (either operating system or SQL Server installation). However, I've checked a bunch of variables (SQL Server config options, running services, disk fragmentation) and found nothing that has made a difference in testing.
What things should I be looking at? What tools can I use to investigate why this is happening?
Blindly checking variables and settings won't get you very far. You need to approach this methodically.
Are the two procedures executed the same way? Namely, is the plan different? A quick check is to SET STATISTICS IO ON and run the two cases. Is the number of logical-reads the same? Is the number of physical-reads the same? Is the number of writes the same? Differences in logical-reads or writes would indicate a different plan. Differences in physical-reads (while logical-reads is similar) indicate cache and memory problems. If the plans are different, you need to further investigate what is different in the actual execution plan. Does one plan uses a different degree of parallelism? Does one use different join types? Different access paths?
If the plans are similar yet the execution is still different, and you cannot blame the IO subsystem, then you need to check contention. Use SET STATISTICS TIME ON and compare the elapsed time and worker time in the two cases. Similar worker time but different elapsed time indicate that there is more waiting in one case. Use the wait_type and wait_resource info in sys.dm_exec_requests to identify the cause of contention.
The methodology of investigation is discussed in more detail in the Waits and Queues whitepaper.
Run SQL Server Profiler to gather information about running processes within SQL Server. This is probably the best start. This will give you a good idea of the things that are consuming a lot of resources.
If you still have issues after Indexing / Rebuilding Indexes, or rewriting queries, then the next step would be to run PerfMon.
We are having trouble with a Java web application running within Tomcat 6 that uses JDBC to connect to a SQL Server database.
After a few requests, the application server dies and the in the log files we find exceptions related to database connection failures.
We are not using any connection pooling right now and we are using the standard JDBC/ODBC/ADO driver bridge to connect to SQL Server.
Should we consider using connection pooling to eliminate the problem?
Also, should we change our driver to something like jTDS?
That is the correct behavior if you are not closing your JDBC connections.
You have to call the close() method of each JDBC resource when you are finished using it and the other JDBC resources you obtained with it.
That goes for Connection, Statement/PreparedStatement/CallableStatement, ResultSet, etc.
If you fail to do that, you are hoarding potentially huge and likely very limited resources on the SQL server, for starters.
Eventually, connections will not be granted, get queries to execute and return results will fail or hang.
You could also notice your INSERT/UPDATE/DELETE statements hanging if you fail to commit() or rollback() at the conclusion of each transaction, if you have not set autoCommit property to true.
What I have seen is that if you apply the rigor mentioned above to your JDBC client code, then JDBC and your SQL server will work wonderfully smoothly. If you write crap, then everything will behave like crap.
Many people write JDBC calls expecting "something" else to release each thing by calling close() because that is boring and the application and server do not immediately fail when they leave that out.
That is true, but those programmers have written their programs to play "99 bottles of beer on the wall" with their server(s).
The resources will become exhausted and requests will tend to result in one or more of the following happening: connection requests fail immediately, SQL statements fail immediately or hang forever or until some godawful lengthy transaction timeout timer expires, etc.
Therefore, the quickest way to solve these types of SQL problems is not to blame the SQL server, the application server, the web container, JDBC drivers, or the disappointing lack of artificial intelligence embedded in the Java garbage collector.
The quickest way to solve them is to shoot the guy who wrote the JDBC calls in your application that talk to your SQL server with a Nerf dart. When he says, "What did you do that for...?!" Just point to this post and tell him to read it. (Remember not to shoot for the eyes, things in his hands, stuff that might be dangerous/fragile, etc.)
As for connection pooling solving your problems... no. Sorry, connection pools simply speed up the call to get a connection in your application by handing it a pre-allocated, perhaps recycled connection.
The tooth fairy puts money under your pillow, the Easter bunny puts eggs & candy under your bushes, and Santa Clause puts gifts under your tree. But, sorry to shatter your illusions - the SQL server and JDBC driver do not close everything because you "forgot" to close all the stuff you allocated yourself.
I would definitely give jTDS a try. I've used it in the past with Tomcat 5.5 with no problems. It seems like a relatively quick, low impact change to make as a debugging step. I think you'll find it faster and more stable. It also has the advantage of being open source.
In the long term, I think you'll want to look into connection pooling for performance reasons. When you do, I recommend having a look at c3p0. I think it's more flexible than the built in pooling options for Tomcat and I generally prefer "out of container" solutions so that it's less painful to switch containers in the future.
It's hard to tell really because you've provided so little information on the actual failure:
After a few requests, the application
server dies and the in the log files
we find exceptions related to database
connection failures.
Can you tell us:
exactly what the error is that
you're seeing
give us a small
example of the code where you
connect and service one of your
requests
is it after a consistent
number of transactions that it
fails, or is it seemingly random
I have written a lot of database related java code (pretty much all my code is database related), and used the MS driver, the jdt driver, and the one from jnetDirect.
I'm sure if you provide us more details we can help you out.
I'm currently experiencing some problems on my DotNetNuke SQL Server 2005 Express site on Win2k8 Server. It runs smoothly for most of the time. However, occasionally (order once or twice an hour) it runs very slowly indeed - from a user perspective it's almost like there's a deadlock of some description when this occurs.
To try to work out what the problem is I've run SQL Profiler against the SQL Express database.
Looking at the results, some specific questions I have are:
The SQL trace shows an Audit Logon and Audit Logoff for every RPC:Completed - does this mean Connection Pooling isn't working?
When I look in Performance Monitor at ".NET CLR Data", then none of the "SQL client" counters have any instances - is this just a SQL Express lack-of-functionality problem or does it suggest I have something misconfigured?
The queries running when the slowness occur don't yet seem unusual - they run fast at other times. What other perfmon counters or other trace/log files can you suggest as useful tools for my further investigation.
Jumping straight to Profiler is probably the wrong first step. First, try checking the Perfmon stats on the server. I've got a tutorial online here:
http://www.brentozar.com/perfmon
Start capturing those metrics, and then after it's experienced one of those slowdowns, stop the collection. Look at the performance metrics around that time, and the bottleneck will show up. If you want to send me the csv output from Perfmon at brento#brentozar.com I can give you some insight as to what's going on.
You might still need to run Profiler afterwards, but I'd rule out the OS and hardware first. Also, just a thought - have you checked the server's System and Application event logs to make sure nothing's happening during those times? I've seen instances where, say, the antivirus client downloads new patches too often, and does a light scan after each update.
My spidey sense tells me that you may have SQL Server blocking issues. Read this article to help you monitor blocking on your server to check if its the cause.
If you think the issues may be performance related and want to see what your hardware bottleneck is, then you should gather some cpu, disk and memory stats using perfmon and then co-relate them with your profiler trace to see if the slow response is related.
no
nothing wrong with that...it shows that you're not using the .NET functionality embed in SQL Server.
You can check http://www.xsqlsoftware.com/Product/xSQL_Profiler.aspx for more detailed analysis of profiler trace. It has reports that show top queries by time or CPU (Not one single query, but sum of all execution of a single query).
Some other things to check:
Make sure your datafiles or log files
are not auto-extending.
Make sure your anti-virus is set to
ignore your sql data and log
files.
When looking at the profiler output, be sure the check the queries that finished just prior to your targets,
they could've been blocking.
Make sure you've turned off Auto-close on the database; re-opening after closing takes some
time.