How do I trace database writes in SQL Server?

How do I trace database writes in SQL Server? - sql-server

I'm using SQL Server 2008 R2, trying to reverse-engineer an opaque application and duplicate some of its operations, so that I can automate some massive data loads.
I figured it should be easy to do -- just go into SQL Server Profiler, start a trace, do the GUI operation, and look at the results of the trace. My problem is that the filters aren't working as I'd expect. In particular, the "Writes" column often shows "0", even on statements that are clearly making changes to the database, such as INSERT queries. This makes it impossible to set a Writes >= 1 filter, as I'd like to do.
I have verified that this is exactly what's happening by setting up an all-inclusive trace, and running the app. I have checked the table beforehand, run the operation, and checked the table afterward, and it's definitely making a change to the table. I've looked through the trace, and there's not a single line that shows any non-zero number in the "Writes" column, including the line showing the INSERT query. The query is nothing special... Just something like
exec sp_executesql
N'INSERT INTO my_table([a], [b], [c])
values(#newA, #newB, #newC)',
N'#newA int,#newB int,#newC int', #newA=1, #newB=2, #newC=3
(if there's an error in the above, it's my typo here -- the statement is definitely inserting a record in the table)
I'm sure the key to this behavior is in the description of the "Writes" column: "Number of physical disk writes performed by the server on behalf of the event." Perhaps the server is caching the write, and it happens outside of the Profiler's purvue. I don't know, and perhaps it's not important.
Is there a way to reliably find and log all statements that change the database?

Have you tried a Server Side Trace? It also works to document read and writes, which - if I'm reading you correctly - you are wanting to document writes.

Related

Updating varchar column over linked server with parameterized query causes remote scan and cursorfetch

I'm issuing a fairly simple update of a single varchar column against a remote linked server - like this:
UPDATE Hydrogen.CRM.dbo.Customers
SET EyeColor = 'Blue'
WHERE CustomerID = 619
And that works fine when is written as an ad-hoc query:
Parameterized queries bad
When we do what we're supposed to do, and have our SqlCommand issue it as a parameterized query, the SQL ends up being: (not strictly true, but close enough)
EXEC sp_executesql N'UPDATE [Hydrogen].[CRM].[dbo].[Customers]
SET [EyeColor] = #P1
WHERE [CustomerID] = #P5',
N'#P1 varchar(4),#P5 bigint',
'Blue',619
This parameterized form of the query ends up performing a remote scan against the linked server:
It creates a cursor on the linked server, and takes about 35 seconds to pull back 1.2M rows to the local server through a series of hundreds of sp_cursorfetch - each pulling down a few thousand rows.
Why, in the world, would the local SQL Server optimizer ever decide to pull back all 1.2M rows to the local server in order to update anything? And even if it was going to decide to pull back rows to the local server, why in the world would it do it using a cursor?
It only fails on varchar columns. If I try updating an INT column, it works fine. But this column is varchar - and it fails.
I tried other parametrizing the column as nvarchar, and it's still bad.
Every answer I've seen actually are questions:
"is the collation the same?"
"What if you change the column type?"
"Have you tried OPENQUERY?"
"Does the login have sysadmin role on the linked server?"
I already have my workaround: parameterized queries bad - use ad-hoc queries.
I was hoping for an explanation of the thing that makes no sense. And hopefully if we have an explanation we can fix it - rather than workaround it.
Of course I can't reproduce it anywhere except the customer's live environment. So it is going to require knowledge of SQL Server to come up with an explanation of what's happening.
Bonus Reading
Stackoverflow: Remote Query is slow when using variables vs literal
Stackoverflow: Slow query when connecting to linked server
https://dba.stackexchange.com/q/36893/2758
Stackoverflow: Parameter in linked-server query is converted from varchar to nvarchar, causing index scan and bad performance
Performance Issues when Updating Data with a SQL Server Linked Server
Update statements causing lots of calls to sp_cursorfetch?
Remote Scan on Linked Server - Fast SELECT/Slow UPDATE

SQL server, pyodbc and deadlock errors

I have some code that writes Scrapy scraped data to a SQL server db. The data items consist of some basic hotel data (name, address, rating..) and some list of rooms with associated data(price, occupancy etc). There can be multiple celery threads and multiple servers running this code and simultaneously writing to the db different items. I am encountering deadlock errors like:
[Failure instance: Traceback: <class 'pyodbc.ProgrammingError'>:
('42000', '[42000] [FreeTDS][SQL Server]Transaction (Process ID 62)
was deadlocked on lock resources with another process and has been
chosen as the deadlock victim. Rerun the transaction. (1205) (SQLParamData)')
The code that actually does the insert/update schematically looks like this:
1) Check if hotel exists in hotels table, if it does update it, else insert it new.
Get the hotel id either way. This is done by `curs.execute(...)`
2) Python loop over the hotel rooms scraped. For each room check if room exists
in the rooms table (which is foreign keyed to the hotels table).
If not, then insert it using the hotel id to reference the hotels table row.
Else update it. These upserts are done using `curs.execute(...)`.
It is a bit more complicated than this in practice, but this illustrates that the Python code is using multiple curs.executes before and during the loop.
If instead of upserting the data in the above manner, I generate one big SQL command, which does the same thing (checks for hotel, upserts it, records the id to a temporary variable, for each room checks if exists and upserts against the hotel id var etc), then do only a single curs.execute(...) in the python code, then I no longer see deadlock errors.
However, I don't really understand why this makes a difference, and also I'm not entirely sure it is safe to run big SQL blocks with multiple SELECTS, INSERTS, UPDATES in a single pyodbc curs.execute. As I understand pyodbc is suppose to only handle single statements, however it does seem to work, and I see my tables populates with no deadlock errors.
Nevertheless, it seems impossible to get any output if I do a big command like this. I tried declaring a variable #output_string and recording various things to it (did we have to insert or update the hotel for example) before finally SELECT #output_string as outputstring, but doing a fetch after the execute in pyodbc always fails with
<class 'pyodbc.ProgrammingError'>: No results. Previous SQL was not a query.
Experiments within the shell suggest pyodbc ignores everything after the first statement:
In [11]: curs.execute("SELECT 'HELLO'; SELECT 'BYE';")
Out[11]: <pyodbc.Cursor at 0x7fc52c044a50>
In [12]: curs.fetchall()
Out[12]: [('HELLO', )]
So if the first statement is not a query you get that error:
In [13]: curs.execute("PRINT 'HELLO'; SELECT 'BYE';")
Out[13]: <pyodbc.Cursor at 0x7fc52c044a50>
In [14]: curs.fetchall()
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
<ipython-input-14-ad813e4432e9> in <module>()
----> 1 curs.fetchall()
ProgrammingError: No results. Previous SQL was not a query.
Nevertheless, except for the inability to fetch my #output_string, my real "big query", consisting of multiple selects, updates, inserts actually works and populates multiple tables in the db.
Nevertheless, if I try something like
curs.execute('INSERT INTO testX (entid, thecol) VALUES (4, 5); INSERT INTO testX (entid, thecol) VALUES (5, 6); SELECT * FROM testX; '
...: )
I see that both rows were inserted into the table tableX, even a subsequent curs.fetchall() fails with the "Previous SQL was not a query." error, so it seems that pyodbc execute does execute everything...not just the first statement.
If I can trust this, then my main problem is how to get some output for logging.
EDIT Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?

Setting autocommit=True in the dbargs seems to prevent the deadlock errors, even with the multiple curs.executes. But why does this fix it?
When establishing a connection, pyodbc defaults to autocommit=False in accordance with the Python DB-API spec. Therefore when the first SQL statement is executed, ODBC begins a database transaction that remains in effect until the Python code does a .commit() or a .rollback() on the connection.
The default transaction isolation level in SQL Server is "Read Committed". Unless the database is configured to support SNAPSHOT isolation by default, a write operation within a transaction under Read Committed isolation will place transaction-scoped locks on the rows that were updated. Under conditions of high concurrency, deadlocks can occur if multiple processes generate conflicting locks. If those processes use long-lived transactions that generate a large number of such locks then the chances of a deadlock are greater.
Setting autocommit=True will avoid the deadlocks because each individual SQL statement will be automatically committed, thus ending the transaction (which was automatically started when that statement began executing) and releasing any locks on the updated rows.
So, to help avoid deadlocks you can consider a couple of different strategies:
continue to use autocommit=True, or
have your Python code explicitly .commit() more often, or
use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED to "loosen up" the transaction isolation level and avoid the persistent locks created by write operations, or
configure the database to use SNAPSHOT isolation which will avoid lock contention but will make SQL Server work harder.
You will need to do some homework to determine the best strategy for your particular usage case.

SQL 2008 All records in column in table updated to NULL

About 5 times a year one of our most critical tables has a specific column where all the values are replaced with NULL. We have run log explorers against this and we cannot see any login/hostname populated with the update, we can just see that the records were changed. We have searched all of our sprocs, functions, etc. for any update statement that touches this table on all databases on our server. The table does have a foreign key constraint on this column. It is an integer value that is established during an update, but the update is identity key specific. There is also an index on this field. Any suggestions on what could be causing this outside of a t-sql update statement?

I would start by denying any client side dynamic SQL if at all possible. It is much easier to audit stored procedures to make sure they execute the correct sql including a proper where clause. Unless your sql server is terribly broken, they only way data is updated is because of the sql you are running against it.
All stored procs, scripts, etc. should be audited before being allowed to run.
If you don't have the mojo to enforce no dynamic client sql, add application logging that captures each client sql before it is executed. Personally, I would have the logging routine throw an exception (after logging it) when a where clause is missing, but at a minimum, you should be able to figure out where data gets blown out next time by reviewing the log. Make sure your log captures enough information that you can trace it back to the exact source. Assign a unique "name" to each possible dynamic sql statement executed, e.g., each assign a 3 char code to each program, and then number each possible call 1..nn in your program so you can tell which call blew up your data at "abc123" as well as the exact sql that was defective.
ADDED COMMENT
Thought of this later. You might be able to add / modify the update trigger on the sql table to look at the number of rows update prevent the update if the number of rows exceeds a threshhold that makes sense for your. So, did a little searching and found someone wrote an article on this already as in this snippet
CREATE TRIGGER [Purchasing].[uPreventWholeUpdate]
ON [Purchasing].[VendorContact]
FOR UPDATE AS
BEGIN
DECLARE #Count int
SET #Count = ##ROWCOUNT;
IF #Count >= (SELECT SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE OBJECT_ID = OBJECT_ID('Purchasing.VendorContact' )
AND index_id = 1)
BEGIN
RAISERROR('Cannot update all rows',16,1)
ROLLBACK TRANSACTION
RETURN;
END
END
Though this is not really the right fix, if you log this appropriately, I bet you can figure out what tried to screw up your data and fix it.
Best of luck

Transaction log explorer should be able to see who executed command, when, and how specifically command looks like.
Which log explorer do you use? If you are using ApexSQL Log you need to enable connection monitor feature in order to capture additional login details.

This might be like using a sledgehammer to drive in a thumb tack, but have you considered using SQL Server Auditing (provided you are using SQL Server Enterprise 2008 or greater)?

how can I test performance in Sql Server Mgmt Studio without outputting data?

Using SQL Server Management Studio.
How can I test the performance of a large select (say 600k rows) without the results window impacting my test? All things being equal it doesn't really matter, since the two queries will both be outputting to the same place. But I'd like to speed up my testing cycles and I'm thinking that the output settings of SQL Server Management Studio are getting in my way. Output to text is what I'm using currently, but I'm hoping for a better alternative.
I think this is impacting my numbers because the database is on my local box.
Edit: Had a question about doing WHERE 1=0 here (thinking that the join would happen but no output), but I tested it and it didn't work -- not a valid indicator of query performance.

You could do SET ROWCOUNT 1 before your query. I'm not sure it's exactly what you want but it will avoid having to wait for lots of data to be returned and therefore give you accurate calculation costs.
However, if you add Client Statistics to your query, one of the numbers is Wait time on server replies which will give you the server calculation time not including the time it takes to transfer the data over the network.

You can SET STATISTICS TIME ON to get a measurement of the time on server. And you can use the Query/Include Client Statistics (Shift+Alt+S) on SSMS to get detail information about the client time usage. Note that SQL queries don't run and then return the result to the client when finished, but instead they run as they return results and even suspend execution if the communication channel is full.
The only context under which a query completely ignores sending the result packets back to the client is activation. But then the time to return the output to the client should be also considered when you measure your performance. Are you sure your own client will be any faster than SSMS?

SET ROWCOUNT 1 will stop processing after the first row is returned which means unless the plan happens to have a blocking operator the results will be useless.
Taking a trivial example
SELECT * FROM TableX
The cost of this query in practice will heavily depend on the number of rows in TableX.
Using SET ROWCOUNT 1 won't show any of that. Irrespective of whether TableX has 1 row or 1 billion rows it will stop executing after the first row is returned.
I often assign the SELECT results to variables to be able to look at things like logical reads without being slowed down by SSMS displaying the results.
SET STATISTICS IO ON
DECLARE #name nvarchar(35),
#type nchar(3)
SELECT #name = name,
#type = type
FROM master..spt_values
There is a related Connect Item request Provide "Discard results at server" option in SSMS and/or TSQL

The best thing you can do is to check the Query Execution Plan (press Ctrl+L) for the actual query. That will give you the best guesstimate for performance available.

I'd think that the where clause of WHERE 1=0 is definitely happening on the SQL Server side, and not Management Studio. No results would be returned.
Is you DB engine on the same machine that you're running the Mgmt Studio on?
You could :
Output to Text or
Output to File.
Close the Query Results pane.
That'd just move the cycles spent on drawing the grid in Mgmt Studio. Perhaps the Resuls to Text would be more performant on the whole. Hiding the pane would save the cycles on Mgmt Studio on having to draw the data. It's still being returned to the Mgmt Studio, so it really isn't saving a lot of cycles.

How can you test performance of your query if you don't output the results? Speeding up the testing is pointless if the testing doesn't tell you anything about how the query is going to perform. Do you really want to find out this dog of a query takes ten minutes to return data after you push it to prod?
And of course its going to take some time to return 600,000 records. It will in your user interface as well, it will probably take longer than in your query window because the info has to go across the network.

There is a lot of more correct answers of answers but I assume real question here is the one I just asked myself when I stumbled upon this question:
I have a query A and a query B on the same test data. Which is faster? And I want to check quick and dirty. For me the answer is - temp tables (overhead of creating temp table here is easy to ignore). This is to be done on perf/testing/dev server only!
Query A:
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS (to clear statistics
SELECT * INTO #temp1 FROM ...
Query B
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SELECT * INTO #temp2 FROM ...

Table Spool/Eager Spool

I have a query
select * into NewTab from OpenQuery(rmtServer, 'select c1, c2 from rmtTab')
When I look at the execution plan, it tells me that it performs a 'Table Spool/Eager Spool' that 'stores the data in a temporary table to optimize rewinds'
Now I don't anticipate any rewinds. If there is a crash of some sort, I can just drop newTab and start over.
Is there any way I can stop it from storing the data in a temporary table?

It's probably the openquery causing it.
There is no information on how many rows, no statistics, nothing so SQL Server will simply spool the results to allow it to evaluate the later bits I suspect. That's the basic idea.
I'd suggest separating the creation and fill of newtab.
By the way, rewind is not rollback. Rewind has nothing to do with the transaction safety. It's SQL Server anticipating reuse of the rows. Which is correct, because the openquery is a black box.
Look near the bottom of this Simple Talk article for rewinds. You have a "Remote query".
Edit
Based one something I found last week only, look at sp_tableoption.
When used with the OPENROWSET bulk
rowset provider to import data into a
table without indexes, TABLOCK enables
multiple clients to concurrently load
data into the target table with
optimized logging and locking
Try TABLOCK on your fill. We had some fun with a client developer using .net SQLBulkCopy giving very bad performance.
Also this from, Kalen Delaney
It's not intuitive.

create the NewTab first and then do insert into... from openquery.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight