Solr intermitent failure of replication with no error - solr

I have an issue with Solr replication.
I have one master and two slaves.
Every so often the replication failes on one on the slaves.
There is no error in the log file, I have upated the settings to record ALL for replication.
The file replication.properties is not "updated" (on the Slave that is failing but it is updated on the other Slave) which suggests that the replication did not start, but according to UI replication took place and "Next Run" is counting time to the next replication, but the same time the replication worked for the other Slave. Both Slaves have connection to Master.
Command "replication?command=details" displays different index versions between Master and Slave.
If I use the "Replicate now" button to force the replication - it will work fine and the next occurance will be also fine, but after few hours/days it will start to fail again on either of the Slaves.
How can I investigate this issue further?
Thank you

Adding extra CPU and increasing RAM helped this issue, since the upgrade the replication is working fine.

Related

Pausing Transactional Replication

Scenario:
I'm working with a customer that has a live database. On a separate server, they have a copy of this database and they have transactional replication setup, which runs constantly. I have an SSIS package that runs on the copy of the database for up to an hour to export data to a reporting database.
When I've tested the package with replication enabled, it occasionally fails as it reads from various tables at different points of the execution. The problem is that if some data is read at an early stage, which subsequently gets deleted/inserted, other related records that are read later on effectively become orphaned and cause lookup failures. Whilst I have various safeguards to combat this, it's difficult to cater for every case as not all records have dates that I can use to limit data.
Plan:
I have been looking at pausing the replication job, so that the package can run with static data and then re-enable it once the package has run. Once the replication is enabled again, all of the transactions from the live database that were generated during the package execution should then be applied to the copy.
Problem:
I've done some reading around the various Replication Agents used for transactional replication, but I'm not entirely sure what the minimum requirement is for pausing the replication.
At the moment I'm looking at pausing the Distribution Agent and the Log Reader Agent to achieve what I want to do. The question is, do I need to pause both agent jobs or can I pause one or the other so that the transactions build up and are applied once the agent is enabled?
I'm not sure if some of this is dependent on specific configurations or setup, but I can provide further information if required, so please comment if more information is required.
but I'm not entirely sure what the minimum requirement is for pausing the replication.
Replication works like below
Log reader agent reads the transactional log from publisher and inserts those records in distributor DB and also marks those log as inactive(so that Tlog space can be reused)
Now Distributor DB reads those records and inserts it into subscriber Database..
Now When you want to stop/pause Replication,you can stop
1.Log reader agent
Right click job and stop
or
2.Distributor agent
Right click job and stop
or
both
The question is, do I need to pause both agent jobs or can I pause one or the other so that the transactions build up and are applied once the agent is enabled?
If you pause only Distributor agent ( i would do the same),Log reader will do it's job and also there will be no impact to Log reusabilty on publisher
there are also caveats like ,if replication latency xrosses maximum limit,you will need to reinitialize replication.Though this will be huge like 24 hours
You also can use below link to monitor replication,after it has been enabled
https://www.brentozar.com/archive/2014/07/monitoring-sql-server-transactional-replication/

Implementing a failover heartbeat mechanism in SQL Server

this may sound like a strange long-winded question...I'm implementing a failover mechanism.
This is written in C# .NET 4.5 Console application using SQL Server.
My requirements in order of priority:
Only one instance of the program can be running at a time (NEVER two),
If one instance fails/hangs, start the other instance (on a separate VM) as soon as possible but ensure the original one stops (to meet requirement #1 above).
What I did:
Before updating the database during normal operations, the program will:
BEGIN TRANSACTION
Check to ensure the database's MasterName table shows it is still master,
Modify Database
COMMIT TRANSACTION
The procedure above implements requirement #1 but ensuring that the program is still the Master before modifying the database.
To implement requirement #2, the active program issues a heartbeat to the db and the program which is not the master (the slave) checks these timestamps.
If the Slave notices the timestamp has not been updated in 2X seconds, it will write its own name in the database MasterName table and take over.
Here's the rub: In the above flow,
1) If the Master hangs for a minute (say Windows is indexing or something) after it has checked to ensure it is still Master, and in that time, ServerB takes over, ServerA won't know that ServerB took over because it is in the middle of its TRANSACTION and will complete.
I could "Lock some table".
Ok, say I lock:
LOCK MasterNameTable
BEGIN TRANSACTION
Check to ensure the database's MasterName table shows it is still master,
Modify Database
COMMIT TRANSACTION
UNLOCK MasterNameTable
Now, if ServerA crashes, or hangs for a really long time, ServerB can't take over since it is locked out.
This feels like a catch-22. I have to believe this has been implemented millions of times in the past and there probably is some standard out there. I'd appreciate any pointers.
EDIT 1: Providing more details:
This is a .NET 4.5 Console Application. I have two instances of it running on two separate VMs so that if one fails, the other can "take over".
This is an algorithmic stock trading application so it would be bad to have two running at the same time (duplicate trading) and it would be bad if there were a timespan when it was not running (missing opportunities and missing ways to "get out")... All failover mechanisms I have researched to-date involve a timespan when both "could" be running at the same time (however short) so this is the only solution I have found to-date that prevents this.

How to handle solr replication when master goes down

I have solr setup, which is configured for Master and slave. The indexing is happening in master and slave is replicating the index at every 2 Min interval from master. So there is a delay of 2 Minutes in getting data from master to slave. Lets assume that my master was indexing at 10:42 some data but due to some hardware issue, master went down at 10:43. So now the data which was indexing at 10:42 was suppose to replicate on Slave by 10:44 (as we have set two minutes interval) Since now the master is not available, how to identify what the last indexed data in solr Master server. Is there way in solr log to track the index activity.
Thanks in Advance
Solr does log the indexing operations if you have the Solr log set to INFO. Any commit/add will show up in the log, so you can check the log for when the last addition was made. Depending on the setup, it might be hard to get the last log when then server is down, though.
You can reduce the time between replications to get more real time replication, or use SolrCloud instead (which should distribute the documents as they're being indexed).
There are also API endpoints (see which connections the Admin interface makes when browsing to the 'replication' status page) for getting the replication status, but those wouldn't help you if the server is gone.
In general - if the server isn't available, you'll have a hard time telling when it was last indexed to. You can work around a few of the issues by storing the indexing time outside of Solr from the indexing task, for example updating a value in memcache or MySQL every time you send something to be indexed from your application.

Solr Replication Starts periodically

Under what conditions does solr starts replicating from the start, we have noticed that in our master slave setup solr periodically start replicating the entire index from the beginning.
We have not made any changes to schema or config files, in-spite of that full replication get's triggered. How can this be avoided.
Regards,
Ayush

SQL Server Merge Replication Problems

I have Merge replication setup on a CRM system. Sales reps data merges when they connect to the network (I think when SQL detects the notebooks are connected), and then they take the laptops away and merge again when they come back (there are about 6 laptops in total merging via 1 server).
This system seems fine when initially setup, but then it almost grinds to a halt after about a month passes, with the merge job taking nearly 2 hours to run, per user, the server is not struggling in any way.
If I delete the whole publication and recreate all the subscriptions it seems to work fine until about another month passes, then I am back to the same problem.
The database is poorly designed with a lack of primary keys/indexes etc, but the largest table only has about 3000 rows in it.
Does anyone know why this might be happening and if there is a risk of losing data when deleting and recreating the publication?
The problem was the metadata created by sql server replication, there is an overnight job that emptys and refills a 3000 row table. This causes replication to replicate all of these rows each day.
The subscriptions were set to never expire which means the old meta data was never being deleted by sql server.
I have set the subscription period to 7 days now in the hope tat it will now clean up the meta data after this period. I did some testing and proved that changes were not lost if a subscription expired. But any updates on the server took priority over the client.
I have encountered with "Waiting 60 second(s) before polling for further changes" recently in 2008 R2.
Replication monitor shows "In progress state" for replication but only step 1 (Initialization) and step 2 (Schema changes and bulk inserts) were performed.
I was very puzzled why other steps do not executed?
The reason was simple - it seems that for merge replication demands tcp/ip (and or not sure) named pipes protocols activation.
No errors were reported.
Probably the similar problem (some sort of connection problem) became apparent in Ryan Stephens case.

Resources