Implementing a failover heartbeat mechanism in SQL Server - sql-server

this may sound like a strange long-winded question...I'm implementing a failover mechanism.
This is written in C# .NET 4.5 Console application using SQL Server.
My requirements in order of priority:
Only one instance of the program can be running at a time (NEVER two),
If one instance fails/hangs, start the other instance (on a separate VM) as soon as possible but ensure the original one stops (to meet requirement #1 above).
What I did:
Before updating the database during normal operations, the program will:
BEGIN TRANSACTION
Check to ensure the database's MasterName table shows it is still master,
Modify Database
COMMIT TRANSACTION
The procedure above implements requirement #1 but ensuring that the program is still the Master before modifying the database.
To implement requirement #2, the active program issues a heartbeat to the db and the program which is not the master (the slave) checks these timestamps.
If the Slave notices the timestamp has not been updated in 2X seconds, it will write its own name in the database MasterName table and take over.
Here's the rub: In the above flow,
1) If the Master hangs for a minute (say Windows is indexing or something) after it has checked to ensure it is still Master, and in that time, ServerB takes over, ServerA won't know that ServerB took over because it is in the middle of its TRANSACTION and will complete.
I could "Lock some table".
Ok, say I lock:
LOCK MasterNameTable
BEGIN TRANSACTION
Check to ensure the database's MasterName table shows it is still master,
Modify Database
COMMIT TRANSACTION
UNLOCK MasterNameTable
Now, if ServerA crashes, or hangs for a really long time, ServerB can't take over since it is locked out.
This feels like a catch-22. I have to believe this has been implemented millions of times in the past and there probably is some standard out there. I'd appreciate any pointers.
EDIT 1: Providing more details:
This is a .NET 4.5 Console Application. I have two instances of it running on two separate VMs so that if one fails, the other can "take over".
This is an algorithmic stock trading application so it would be bad to have two running at the same time (duplicate trading) and it would be bad if there were a timespan when it was not running (missing opportunities and missing ways to "get out")... All failover mechanisms I have researched to-date involve a timespan when both "could" be running at the same time (however short) so this is the only solution I have found to-date that prevents this.

Related

SQL Process without Task State bocks other processes

I'm having an issue with processes that lock my SQL Server even though they appear to be finished.
The blocking Processes are 4 a simple SELECT GETDATE() commands that just don't finish for some reason unknown to me. The SQL Server Profiler doesn't really show any activity except for repeating the SELECT GETDATE() every four minutes. - It is possible that the same connection sent a UPDLOCK request before that.
I mostly want to find an explaination for that behaviour. I can't really influence those blocking requests. - As you can see, they are sent by an external company.
The suspended process is also called from within the Business Central Server, but is circumventing the standard interpretation layer to optimize performance. - To do this, it's calling the SQL .Net Class to execute the SQL query directly.
If i kill the process, the server throws an error and the whole execution falls appart.
p.s. for the PPL that work with BC in here and think this is not a good idea:
This code won't run on a daily business. I it's just for migrating data during upgrades from NAV to BC. We Build a tool that allows us to map Fields between C/AL and AL solutions and generate AL-Extension based on those mappings. These extensions grab the data from a copy of the original DB and write them directly into their destination files. We need the SQL commands because some of our customers have so much data accumulated over the years that an upgrade would otherwise take more than a Week if the data was processed in AL

Using database snapshot vs snapshot issolation level transaction

I maintain an an MVC application which incorporates some long running batch processes for sending newsletters, generating reports etc.
I had previously encountered a lot of issues with deadlocks, where one of these long running queries might be holding a lock on a row which then needs to be updated by another process.
The solution I originally came up with, was to have a scheduled task, which creates database snapshots, like so...
CREATE DATABASE MyDatabase_snapshot_[yyyyMMddHHmmss]... AS SNAPSHOT OF MyDatabase
My application then has some logic which will find the latest available snapshot, and use this for the readonly connection for the long-running processes, or anywhere else where a read-only connection was required.
The current setup is perfectly functional, and reliable. However being dependent on that scheduled task doesn't make me happy. I can imagine, at some stage in the future, if someone else is looking after this project, this could be an easy source of confusing issues. If the database was moved to another server, for example, and the snapshot creation scheduled task wasn't setup correctly.
I've since realised I could achieve a similar result by using snapshot transaction issolation, and avoid all the extra complexity of managing the creation and cleanup of the database snapshots.
However I'm now wondering whether there may be any performance drawbacks for doing this using transactions vs continuing to use the static snapshots.
Consider the following scenario.
The system periodically sends personalised job lists to approximately 20K subscribers. For each of these subscribers it does database lookups to create the matching jobs list.
What is has been doing, is looping through the full recipient list, and for each one...
Open a connection to the snapshot db
Run the query to find matching jobs
Close the snapshot db connection
If instead, it does the following...
Open the database connection to the normal database
(non-snapshot)
Create a snapshot issolated transaction
Run the query to find matching jobs
Close the transaction
Close the database connection
Does this actually translate to more work for the database server?
Specifically I'm wondering about what's involved at step #2.
Removing complexity from the application is a good thing, but not at the expense of performance. Particularly since this particular process is already quite server intensive, and takes quite a long time to run.

How does SQL server insert data parallely between applications?

I have two applications.
One inserts data into database continuously like it is having an infinity loop.
When the second application inserts data to same database and table what will happen.
If it waits till the other application to complete inserting which will handle this?
Or it will say it is busy?
Or code throws an exception?
SQL servers have something called a connection pool which means that more than once connection to the database can be made at any particular time, and that's where the easy bit ends.
If you were to for example connect to the database on two applications at the same time and insert data in to different tables from each application then the two could happily happen at the same time without issue.
If however those applications wanted to do something like edit the same row then there's an issue with "locking" ...
Essentially any operation on a SQL database requires "acquiring a lock" on a "set" or "row" or "cell" depending on the configuration of the server its hard to say what might happen in your case.
So the simple answer is:
Yes, SQL can make stuff happen (like inserts) at the same time but with some clauses.
And long answer ...
requires in depth knowledge of locking and your database and server configuration.

How do I ensure SQL Server replication is running?

I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.
I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.
We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.
My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.
Does anyone else monitor replication and how do you do it?
Just a little bit of extra information:
It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.
As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.
We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.
The easiest way is to set it up through Replication Monitor.
Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.
For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.
This works well enough for alerting us to the problem so that we can fix it right away.
You could run a regular check that data changes are taking place, though this could be complex depending on your application.
If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:
SELECT CHECKSUM_AGG(*)
FROM audit_base
WHERE action_timestamp BETWEEN <time1> AND BETWEEN <time2>
where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.
By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.
Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.
You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).
This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.

SQL Server transactional replication for very large tables

I have set up transactional replication between two SQL Servers on different ends of a relatively slow VPN connection. The setup is your standard "load snapshot immediately" kind of thing where the first thing it does after initializing the subscription is to drop and recreate all tables on the subscriber side and then start doing a BCP of all the data. The problem is that there are a few tables with several million rows in them, and the process either a) takes a REALLY long time or b) just flat out fails. The messages I keep getting when I look in Replication Monitor are:
The process is running and is waiting for a response from the server.
Query timeout expired
Initializing
It then tries to restart the bulk loading process (skipping any BCP files that it has already loaded).
I am currently stuck where it just keeps doing this over and over again. It's been running for a couple days now.
My questions are:
Is there something I could do to improve this situation given that the network connection is so slow? Maybe some setting or something? I don't mind waiting a long time as long as the process doesn't keep timing out.
Is there a better way to do this? Perhaps make a backup, zip it, copy it over and then restore? If so, how would the replication process know where to pick up when it starts applying the transactions, since updates will be occurring between the time I make the backup and get it restored and running on the other side.
Yes.
You can apply the initial snapshot manually.
It's been a while for me, but the link (into BOL) has alternatives to setting up the subscriber.
Edit: From BOL How-tos, Initialize a Transactional Subscriber from a Backup
In SQL 2005, you have a "compact snapshot" option, that allow you to reduce the total size of the snapshot. When applied over a network, snapshot items "travel" compacted to the suscriber, where they are then expanded.
I think you can easily figure the potential speed gain by comparing sizes of standard and compacted snapshots.
By the way, there is a (quite) similar question here for merge replication, but I think that at the snapshot level there is no difference.

Resources