SQL Server transactional replication for very large tables - sql-server

I have set up transactional replication between two SQL Servers on different ends of a relatively slow VPN connection. The setup is your standard "load snapshot immediately" kind of thing where the first thing it does after initializing the subscription is to drop and recreate all tables on the subscriber side and then start doing a BCP of all the data. The problem is that there are a few tables with several million rows in them, and the process either a) takes a REALLY long time or b) just flat out fails. The messages I keep getting when I look in Replication Monitor are:
The process is running and is waiting for a response from the server.
Query timeout expired
Initializing
It then tries to restart the bulk loading process (skipping any BCP files that it has already loaded).
I am currently stuck where it just keeps doing this over and over again. It's been running for a couple days now.
My questions are:
Is there something I could do to improve this situation given that the network connection is so slow? Maybe some setting or something? I don't mind waiting a long time as long as the process doesn't keep timing out.
Is there a better way to do this? Perhaps make a backup, zip it, copy it over and then restore? If so, how would the replication process know where to pick up when it starts applying the transactions, since updates will be occurring between the time I make the backup and get it restored and running on the other side.

Yes.
You can apply the initial snapshot manually.
It's been a while for me, but the link (into BOL) has alternatives to setting up the subscriber.
Edit: From BOL How-tos, Initialize a Transactional Subscriber from a Backup

In SQL 2005, you have a "compact snapshot" option, that allow you to reduce the total size of the snapshot. When applied over a network, snapshot items "travel" compacted to the suscriber, where they are then expanded.
I think you can easily figure the potential speed gain by comparing sizes of standard and compacted snapshots.
By the way, there is a (quite) similar question here for merge replication, but I think that at the snapshot level there is no difference.

Related

Using database snapshot vs snapshot issolation level transaction

I maintain an an MVC application which incorporates some long running batch processes for sending newsletters, generating reports etc.
I had previously encountered a lot of issues with deadlocks, where one of these long running queries might be holding a lock on a row which then needs to be updated by another process.
The solution I originally came up with, was to have a scheduled task, which creates database snapshots, like so...
CREATE DATABASE MyDatabase_snapshot_[yyyyMMddHHmmss]... AS SNAPSHOT OF MyDatabase
My application then has some logic which will find the latest available snapshot, and use this for the readonly connection for the long-running processes, or anywhere else where a read-only connection was required.
The current setup is perfectly functional, and reliable. However being dependent on that scheduled task doesn't make me happy. I can imagine, at some stage in the future, if someone else is looking after this project, this could be an easy source of confusing issues. If the database was moved to another server, for example, and the snapshot creation scheduled task wasn't setup correctly.
I've since realised I could achieve a similar result by using snapshot transaction issolation, and avoid all the extra complexity of managing the creation and cleanup of the database snapshots.
However I'm now wondering whether there may be any performance drawbacks for doing this using transactions vs continuing to use the static snapshots.
Consider the following scenario.
The system periodically sends personalised job lists to approximately 20K subscribers. For each of these subscribers it does database lookups to create the matching jobs list.
What is has been doing, is looping through the full recipient list, and for each one...
Open a connection to the snapshot db
Run the query to find matching jobs
Close the snapshot db connection
If instead, it does the following...
Open the database connection to the normal database
(non-snapshot)
Create a snapshot issolated transaction
Run the query to find matching jobs
Close the transaction
Close the database connection
Does this actually translate to more work for the database server?
Specifically I'm wondering about what's involved at step #2.
Removing complexity from the application is a good thing, but not at the expense of performance. Particularly since this particular process is already quite server intensive, and takes quite a long time to run.

sql server replication algorithm

Anyone know how the underlying replication model in sql server works? Do they essentially depend on UTC datetime values to determine if something is new or do they keep a table of all the changes (like a table of tableID+rowid that have changed).
I am building my own "replication" system and was planning on using the dates to know what to replicate. Then I started wondering what would happen if the date got off in the computer for some reason. The obvious choice is to keep a log of the changes as you go and once you replicate those changes, you remove from the log of changes. But thats a lot of extra work, instead of just checking dates.
I figure if sql server replication works by just checking the dates, then that should be good enough for me.
Any wisdom here?
thanks
As a transaction occurs in SQL Server, it is written to the transaction log along with information pertinent to the transaction.
SQL Server replication uses this transaction log to determine which transactions have not yet been processed and to move them to the subscriber. There is a lot more going on under the hood to keep track of the intersection between transactions, publications, subscriptions, etc. but I will leave that to MSDN documentation about SQL Server replication http://msdn.microsoft.com/en-us/library/ms151198.aspx
Moving on to your point about building your own replication system:
Do not build your own replication system. There are too many complications involved that will cause you to spend many many days working. You will be much better off using the items that are shipped with SQL Server.
SQL Server replication methods are pretty impressive out of the box.
If you outline what causes you to think in terms of building your own replication system, we can help you figure out how to use existing items to provision what you need.
Also, read up as much as you can here to get an idea of what it can do for you http://msdn.microsoft.com/en-us/library/ms151198.aspx
SQL Server has a LogReader job that is aptly named. Replication reads the transaction log and applies appropriate transactions to the subscribing databases.
For one thing SQLServer (and it's not the only one) supports multiple replication algorithms.
You can find here details about the ones implemented in SQLServer 2008. Read first the X Replication Overview then follow the How X Replication works for more details.

Is it possible to have secondary server available read-only in a log shipping scenario?

I am looking into using log shipping in a SQL Server 2005 environment. The idea was to set up frequent log shipping to a secondary server. The intent: Use the secondary server to serve report queries, thereby offloading the primary db server.
I came across this on a sqlservercentral forum thread:
When you create the log shipping you have 2 choices. You can configure restore log operation to be done with norecovery or with standby option. If you use the norecovery option, you can not issue select statements on it. If instead of norecovery you use the standby option, you can run select queries on the database.
Bear in mind with the standby option when log file restores occur users will be kicked out without warning by the restore process. Acutely when you configure the log shipping with standby option, you can also select between 2 choices – kill all processes in the secondary database and perform log restore or don’t perform log restore if the database is being used. Of course if you select the second option, the restore operation might never run if someone opens a connection to the database and doesn’t close it, so it is better to use the first option.
So my questions are:
Is the above true? Can you really not use log shipping in the way I intend?
If it is true, could someone explain why you cannot execute SELECT statements to a database while the transaction log is being restored?
EDIT:
First question is duplicate of this serverfault question. But I still would like the second question answered: Why is it not possible to execute SELECT statements while the transaction log is being restored?
could someone explain why you cannot
execute SELECT statements to a
database while the transaction log is
being restored?
Short answer is that RESTORE statement takes an exclusive lock on the database being restored.
For writes, I hope there is no need for me to explain why they are incompatible with a restore. Why does it not allow reads either? First of all, there is no way to know if a session that has a lock on a database is going to do a read or a write. But even if it would be possible, restore (log or backup) is an operation that updates directly the data pages in the database. Since these updates go straight to the physical location (the page) and do not follow the logical hierarchy (metadata-partition-page-row), they would not honor possible intent locks from other data readers, and thus have the possibility to change structures as they are read. A SELECT table scan following the page next-prev pointers would be thrown into disarray, resulting in a corrupted read.
Well yes and no.
You can do exactly what you wish to do, in that you may offload reporting workloads to a secondary server by configuring Log Shipping to a read only copy of a database. I have set this type of architecture up on a number of occasions previously and it works very well indeed.
The caveat is that in order to perform a restore of a Transaction Log Backup file there must be no other connections to the database in question. Hence the two choices being, when the restore process runs it will either fail, thereby prioritising user connections, or it will succeed by disconnecting all user connection in order to perform the restore.
Dependent on your restore frequency this is not necessarily a problem. You simply educate your users to the fact that, say every hour at 10 past the hour, there is a possibility that your report may fail. If this happens simply re-run the report.
EDIT: You may also want to evaluate alternative architeciture solutions to your business need. For example, Transactional Replication or Database Mirroring with a Database Snapshot
If you have enterprise version, you can use database mirroring + snapshot to create read-only copy of the database, available for reporting, etc. Mirroring uses "continuous" log shipping "under the hood". It is frequently used in scenario you have described.
Yes it's true.
I think the following happens:
While the transaction log is being restored, the database is locked, as large portions of it are being updated.
This is for performance reasons more then anything else.
I can see two options:
Use database mirroring.
Schedule the log shipping to only occur when the reporting system is not in use.
Slight confusion in that, the norecovery flag on the restore means your database is not going to be brought out of a recovery state and into an online state - that is why the select statements will not work - the database is offline. The no-recovery flag is there to allow you to restore multiple log files in a row (in a DR type scenario) without bringing the database back online.
If you did not want to log ship / have the disadvantages you could swap to a one way transactional replication, but the overhead / set-up will be more complex overall.
Would peer-to-peer replication work. Then you can run queries on one instance and so save the load on the original instance.

How do I ensure SQL Server replication is running?

I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.
I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.
We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.
My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.
Does anyone else monitor replication and how do you do it?
Just a little bit of extra information:
It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.
As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.
We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.
The easiest way is to set it up through Replication Monitor.
Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.
For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.
This works well enough for alerting us to the problem so that we can fix it right away.
You could run a regular check that data changes are taking place, though this could be complex depending on your application.
If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:
SELECT CHECKSUM_AGG(*)
FROM audit_base
WHERE action_timestamp BETWEEN <time1> AND BETWEEN <time2>
where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.
By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.
Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.
You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).
This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.

SQL Server Merge Replication Schedule

We're replicating a database between London and Hong Kong using SQL Server 2005 Merge replication. The replication is set to synchronise every one minute and it works just fine. There is however the option to set the synchronisation to be "Continuous". Is there any real difference between replication every one minute and continuously?
The only reason for us doing every one minute rather than continuous in the first place was that it recovered better if the line went down for a few minutes, but this experience was all from SQL Server 2000 so it might not be applicable any more...
We have been trying the continuous replication solution on SQL SERVER 2005 and it appeared to be less efficient than a scheduled solution: as your process is continuous, you will not get all the info related to your passed replications (how many replications failed, how long did the process take, why was the process stopped, how many records were updated, how many database structure modifications were replicated to suscribers, and so on), making the replication follow-up a lot more difficult.
We have also been experiencing troubles while modifying database structure (ALTER TABLE instructions) and/or making bulk updates on one of the databases with continuous replication going on.
Keep you "every minute" synchro as it is and just forget about this "continuous" option.

Resources