SQL Server Using TableDiff on large tables

SQL Server Using TableDiff on large tables - sql-server

We have a process which uses uses SQL Server's amazing tableDiff via:
Microsoft SQL Server\100\COM\Tablediff.exe
It's SQL Server 2008 R2. It connects from one instance to another identical instance. It works very well!
I have a situation where a table which now has 10767594 records is taking 2.5 hours to complete, it only has one table in the job. How can I improve this?
The process is triggered by a Windows Scheduled Task, this calls a .bat file, the .bat file contains the recommended code which has no issue. We have a couple of these in place and have had for some time. It's just the one job that deals with the big table from instance to instance that is taking too long.
I have realised that the source table does have an index but the destination table does not. I will put an index on this table, what else can I do?
Does table diff run better with indexes?
Is there a ways to use table diff more effectively?
E.g. if I capture the lastProcessedID can I run tableDiff next time for all records where id > lastProcessedID?
Any advice would be great. Thank you in advance
EDITED:
MY SOLUTION - This was a very very big surprise. As I mentioned above, the 10 million+ record table which was identical on the source and destination except for 2 indexes (on the source). After waiting for out of hours since this is an internal production server I applied the indexes to the source. Now I run the tableDiff job which has not been changed at all and it completes in under 2 minutes. 2.5 hours to 2 mins!
I have accepted the answer below because it very very helpful. I did go down the Merge Replication path however after setting up replication and publishing I found out that the production instance was not able to be a subscriber due to the replication not be ticked on install. As Jason says its a reasonable amount of research, learning and setting up. Since I am not a DBA and had not looked at this before it was a worth while experience.

The performance issue is because the remote queries pull every record from each place to do the comparison to generate the output. Indexes can help slightly to make the pull a little faster from each location, but it's not likely to be significant.
An incremental approach is definitely better. I don't believe tablediff directly supports comparing 2 queries. If it did, you could do something like EXCEPT or INTERSECT to do the comparisons. If you're trying to keep these databases in sync, why not consider other solutions, like log shipping, mirroring, SSIS, replication, clustering, etc.

Related

What are the best practices for auto index recommendations in SQL

I am reviewing a SQL Server 2008 R2 instance with 30+ databases with the goal of moving to SQL Server 2014. In reviewing this I found a SQL job that a previous employee implemented. The job utilizes a set of scripts from this article https://www.sqlservercentral.com/forums/topic/indexing-views-1, to automatically create and drop all recommended indexes every half hour 24/7. When this was implemented the databases were roughly 40gb, but since have grown to over 1TB as we are a highly transactional company. With one of the databases running our primary ERP/ordering system. From everything I understand about indexing, this seems like a terrible idea as it could be creating and dropping indexes on very large tables. Is this a good practice, am I missing something?

Found this in a related post
"How to use it? Run AutoIndex.sql to install the SPs and sql agent
job. Upon every 30 minutes, the sql agent job will run the auto create
index and auto drop index scripts to make recommendations. Same
recommendation will not be stored multiple times, instead we just bump
up the count and change the latest recommendation time. You can view
the recommendations using the simple commands in the
viewrecommendations.sql. Look for the recommendations in the
recommendation table that have high counts, which means they have been
repetitively recommended thus are more valuable. You can also look at
the initial recommendation time and the last recommendation time to
get a sense of the freshness and the time range this recommendation is
valid for. After you made a decision to implement a recommendation,
simply run execute_recommendation with the recommendation id and the
recommendation will be implemented automatically." Thank U Snehal
Link Here
According to that user the script you're proc you have in your system should just aggregate index recommendations over time and allow you too see what indexes are constantly being recommended.
I believe the important distinction here is SQL doesn't log how many times it suggest a particular index so you may get a suggested index based on a one off query, which probably isn't something you want to implement. Instead you run this for a period and see what's being hit frequently and create those indexes.

Two separate instances of SQL Server running a different explain plan

Here's one I need help from the SQL administrators out there. I have two separate SQL Server instances on Amazon EC2. One is our staging environment, and the other is our production environment, but they are configured exactly the same way (spawned from the same image).
We had a database that we copied from staging to our production environment last week. The way we copy a db to production is we take a backup of it on our staging site, and restore the backup in production. Anyways, we found that in production, one particular complex query was timing out after an hour, but that exact query in our staging environment completed in 10 minutes.
The explain plan on both were almost the same, except in one server it was doing a PK scan on a large table (8M rows), and on the other table it was doing an index seek. We're assuming this was the difference. So one server was doing a lot of disk IO, and the other was not.
So my question is, what are the reasons that one installation of SQL server would decide to use an index, while another one ignores it--assuming same versions of SQL server, and same data set? Even better, what are the best ways to find out why SQL is ignoring an index?

SQL Server uses statistics to determine the query execution plan.
Normally, they should be the same on the same datasets, but there is a chance of outdated statistics on one of the machines.
Use sp_updatestats to update statistics on both machines.
Also, I'm not familiar with Amazon EC2, but there may be a chance that the machines running the two instances have different number of CPU installed (or made available for use by SQL Server). This is also taken into account by the optimizer.

Parameter Sniffing?
An SP will use the query plan that was deemed most appropriate based on the parameters passed to it when it was executed (and so compiled) for the first time.
Restoring a database wipes the plan cache; if the SP on the copy of the database was run with parameters that favored an index seek, then that's what will subsequently be used.
You can check this by sp_recompile'ing both and running them again with identical parameters.

This was our mistake.
After much digging investigation, we found that one of our devs had added a couple additional indexes to the production db after the transfer. This was a case where the additional indexes actually caused the query optimizer to pick a less efficient route in the production environment.
Removing those additional indexes appeared to have addressed the performance issue for the particular query, and both explain plans are now the same.

sql server replication algorithm

Anyone know how the underlying replication model in sql server works? Do they essentially depend on UTC datetime values to determine if something is new or do they keep a table of all the changes (like a table of tableID+rowid that have changed).
I am building my own "replication" system and was planning on using the dates to know what to replicate. Then I started wondering what would happen if the date got off in the computer for some reason. The obvious choice is to keep a log of the changes as you go and once you replicate those changes, you remove from the log of changes. But thats a lot of extra work, instead of just checking dates.
I figure if sql server replication works by just checking the dates, then that should be good enough for me.
Any wisdom here?
thanks

As a transaction occurs in SQL Server, it is written to the transaction log along with information pertinent to the transaction.
SQL Server replication uses this transaction log to determine which transactions have not yet been processed and to move them to the subscriber. There is a lot more going on under the hood to keep track of the intersection between transactions, publications, subscriptions, etc. but I will leave that to MSDN documentation about SQL Server replication http://msdn.microsoft.com/en-us/library/ms151198.aspx
Moving on to your point about building your own replication system:
Do not build your own replication system. There are too many complications involved that will cause you to spend many many days working. You will be much better off using the items that are shipped with SQL Server.
SQL Server replication methods are pretty impressive out of the box.
If you outline what causes you to think in terms of building your own replication system, we can help you figure out how to use existing items to provision what you need.
Also, read up as much as you can here to get an idea of what it can do for you http://msdn.microsoft.com/en-us/library/ms151198.aspx

SQL Server has a LogReader job that is aptly named. Replication reads the transaction log and applies appropriate transactions to the subscribing databases.

For one thing SQLServer (and it's not the only one) supports multiple replication algorithms.
You can find here details about the ones implemented in SQLServer 2008. Read first the X Replication Overview then follow the How X Replication works for more details.

SQL Server vs. Access insert performance, in particular when using GUID

I'm interested to know how I could improve the performance of SQL Server when using sequential GUID when using Access 2007 as a front end to SQL Server 2008 (please note it's the only context I'm interested in).
I have made some tests (and gotten some fairly surprising results, in particular from SQL Server when using sequential GUID: the insert performance degrades very very quickly and it doesn't seem right to degrade so quickly to me.
Basically the test is as follow:
From the Access front-end, using VBA only, insert 100,000 records in batches of 1000,
sequentially.
I tried it both with a Identity and a sequential GUID as the PK.
I tried it in SQL Server 2008 Standard (no special tweaking just default install) as and an Access 2007 database as the back-end. All tables linked back to the front-end.
Some of the results (more, with raw data available on my blog entry about the test):
It's clear that, as the database grows, the insert performance is reduced but SQL Server isn't performing very well at all here.
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart02.png
Expanded view of the results for SQL Server:
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart03.png
Edit 13APR2009
I've found an issue with my server configuration and I updated the tests on my blog.
Thanks to all for your replies, they helped me a lot.

There's two things at play here. First, it's important to point out that SQL doesn't necessarily work very well, for a specific use case, out of the box. It is a professional product designed to be tuned by a person who knows what they're doing.
By comparison, Access is designed to work very well for most use cases without any configuration. The downside of this trade-off is covered in the second point:
SQL Server is designed for scalability. Notice how Access severely degrades with only 100,000 records. It would probably drop very steeply below SQL's line before a million. By comparison, SQL server holds almost perfectly steady, with the variation stabilizing after about 45,000 records and will continue to hold at many millions.
Edit I think there also may be something else at play here we're not seeing. I thought your SQL numbers looked bad, so I ran a test of my own. On my desktop running Windows Vista 3.6 ghz and 2gb of RAM, inserts with sequential GUID on SQL Server performed:
Average of 1382 inserts per second at 0 records
Average of 1426 inserts per second at 500k records
Averaging 1609.6 inserts per second from 0 to 500k with an average floor of 992 inserts/sec and an average ceiling of 1989 inserts/sec.
So accounting for the normal variance incurred by running this on an in-use desktop, I'd say SQL Server inserts basically scale linearly from 0 records to half a million. On a dedicated, tuned server I'd expect even more consistency (not to mention far better performance):
Excel chart, inserts per second http://img24.imageshack.us/img24/9485/insertspersecond.jpg

My question is whether your test setup represents the reality of your application or not. In short, are you testing the right thing?
Is your app going to be appending large numbers of records one at a time?
Or is it going to be appending batches of records based on a SQL SELECT?
If the latter, you might look at trying to do it all server-side, particularly if the source table(s) in the SELECT are on the server. It's important to realize that with ODBC, a batch append is going to be sent to the SQL Server as a single insert for every single row (every similar to the recordset-based approach in your test code). If you move the same process entirely server-side, it can be done as a batch operation.
Also, you should test again using ADO instead of DAO. It may optimize the operation completely differently.
Last of all, someone brought to my attention just this past week this fascinating article by Andy Baron:
Optimizing Microsoft Office Access Applications Linked to SQL Server
I'm still absorbing the contents of that very useful article, and it discusses several issues in regard to non-GUID-specific topics that may help you optimize your process for maximum efficiency.

You realize at least part of the decreasing performance is the log filling up, and that a GUID id what, 40 bytes longer than an int?
But I'm not quibbling; it's good to see someone taking actual metrics rather than just handwaving. Modded up.

Where are you getting the data from?
Does it change the numbers if you use the Access Export menu options rather than record-at-a-time-in-a-loop?
VBA is really sensitive to the connection paramters too, and there are lots of options that aren't necessarily intuitive.
If an identity column is acceptable, why are you even considering a sequential GUID (which is something of a tacked-on facility in MSSQL last I checked).
EDIT:
Looking at your code and briefly reviewing the Recordset docs on MSDN, I see you may be able to use more efficient parameters. E.g. your dbSeeChanges and dbOpenDynaset, which are appropriate if you are trying to allow for other users messing with the same rows (or needing to get back the inserted IDENTITY value or probably GUID), but I don't think you need those. In essence, after every INSERT or UPDATE, you're reading the record back from the database into VBA. I'd read through those connection config settings carefully, and I bet you'll come up with something a lot more satisfactory.

The last time I saw something like that (really slow insertion with GUID PK) was because of the log-file filling up. Insertion performance was dropping like a stone, pretty fast (no hard measurement, just looking at live traces, but it sure looked like it was kinda logarithmic). This was pre-loading of historical data.
Moved over to identity PK, took care of actually cleaning up the log file, and everything went much better afterwards (a couple of hours where the first version took several hours and was not finished).
Also, just a thought, are there any transactions involved? Maybe SQL Server transactions create a big performance hit that access does not have (given that access is not really geared towards concurrent access).

SQL Server Maintenance Suggestions?

I run an online photography community and it seems that the site draws to a crawl on database access, sometimes hitting timeouts.
I consider myself to be fairly compentent writing SQL queries and designing tables, but am by no means a DBA... hence the problem.
Some background:
My site and SQL server are running on a remote host. I update the ASP.NET code from Visual Studio and the SQL via SQL Server Mgmt. Studio Express. I do not have physical access to the server.
All my stored procs (I think I got them all) are wrapped in transactions.
The main table is only 9400 records at this time. I add 12 new records to this table nightly.
There is a view on this main table that brings together data from several other tables into a single view.
secondary tables are smaller records, but more of them. 70,000 in one, 115,000 in another. These are comments and ratings records for the items in #3.
Indexes are on the most needed fields. And I set them to Auto Recompute Statistics on the big tables.
When the site grinds to a halt, if I run code to clear the transaction log, update statistics, rebuild the main view, as well as rebuild the stored procedure to get the comments, the speed returns. I have to do this manually however.
Sadly, my users get frustrated at these issues and their participation dwindles.
So my question is... in a remote environment, what is the best way to setup and schedule a maintenance plan to keep my SQL db running at its peak???

My gut says you are doing something wrong. It sounds a bit like those stories you hear where some system cannot stay up unless you reboot the server nightly :-)
Something is wrong with your queries, the number of rows you have is almost always irrelevant to performance and your database is very small anyway. I'm not too familiar with SQL server, but I imagine it has some pretty sweet query analysis tools. I also imagine it has a way of logging slow queries.
I really sounds like you have a missing index. Sure you might think you've added the right indexes, but until you verify the are being used, it doesn't matter. Maybe you think you have the right ones, but your queries suggest otherwise.
First, figure out how to log your queries. Odds are very good you've got a killer in there doing some sequential scan that an index would fix.
Second, you might have a bunch of small queries that are killing it instead. For example, you might have some "User" object that hits the database every time you look up a username from a user_id. Look for spots where you are querying the database a hundred times and replace it with a cache--even if that "cache" is nothing more then a private variable that gets wiped at the end of a request.
Bottom line is, I really doubt it is something mis-configured in SQL Server. I mean, if you had to reboot your server every night because the system ground to a halt, would you blame the system or your code? Same deal here... learn the tools provided by SQL Server, I bet they are pretty slick :-)
That all said, once you accept you are doing something wrong, enjoy the process. Nothing, to me, is funner then optimizing slow database queries. It is simply amazing you can take a query with a 10 second runtime and turn it into one with a 50ms runtime with a single, well-placed index.

You do not need to set up your maintenance tasks as a maintenance plan.
Simply create a stored procedure that carries out the maintenance tasks you wish to perform, index rebuilds, statistics updates etc.
Then create a job that calls your stored procedure/s. The job can be configured to run on your desired schedule.
To create a job, use the procedure sp_add_job.
To create a schedule use the procedure sp_add_schedule.
I hope what I have detailed is clear and understandable but feel free to drop me a line if you need further assistance.
Cheers, John

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight