Postgres RDS read replica takes too long - database

We currently running PostgresRDS on AWS with two read replica, the instance is db.m4.4xlarge. We created new read replica with similar size yesterday to help with the load but it's been in this status: "Streaming replication has stopped." for over a day and the lag time only goes up.
Previously when we created read replica, after creation they were in similar state but it caught up after few hours. Any idea if there is a better way to get the real status of the read replica? why is the lag only grows even when we stopped writing to the master postgres.
It's not the first time we encounter the problem and the lack of real status/state is really annoying.
Thanks for the help.

Related

Restarting due to SPID stuck on RUNNING, KILLED/ROLLEDBACK status

I'm a new "accidental" DBA and I'm currently trying to resolve a lockup caused by a trigger I created on a production database supporting a front end application.
I created a trigger, and then I decided I'd be best off creating a job to do the work instead, so tried to delete the trigger in object explorer. The delete failed with the message:
An exception occurred while executing a Transact-SQL statement or batch.
Lock request time out period exceeded.
I then tried to manually drop it and it failed at 0%, 0s left to go. I checked for the longest running transaction and then tried to kill the process in activity monitor. Since then the process has been stuck on "Task State:RUNNING and Command:KILLED/ROLLBACK". After some googling it sounds like I have two options.
Option 1: Restart DTC on the SQL server.... didn't work, still stuck.
Option 2: Restart the SQL service. Uh-oh.
This is the first time I've ever had to do anything like this and I'm pretty nervous being the only SQL guy in the office. Please can anyone let me know what the potential implications of restarting the service are, in terms of data loss and impact to front end users? Am I better off waiting to restart after business hours?
Thanks, and apologies if I've asked this question badly, first time for everything.
Cheers
Wait. It's rolling back and has to finish the rollback. Don't restart SQL, that will just result in the rollback continuing after the restart, possibly with the database offline.
If this is a production system and you do bounce the database, all users of your user interface will get weird and wonderful errors. Unless your application can handle it, your users will have a bad experience and then you will start getting phone calls from the boss....
As a side note, check for locking\blocking processes. The message in the question "Lock request time out period exceeded. " seems to suggest there is locking/blocking happening.

Symmetric DS taking too long on DataExtractorService

I'm using SDS to migrate data from a SQL server to a Mysql database. My tests of moving the data of a database that was not in use worked correctly though they took like 48 hours to migrate all the existing data. I configured dead triggers to move all current data and triggers to move the new added data.
When moving to a live database that it is in use the data is being migrated too slow. On the log file I keep getting the message:
[corp-000] - DataExtractorService - Reached one sync byte threshold after 1 batches at 105391240 bytes. Data will continue to be synchronized on the next sync
I have like 180 tables and I have created 15 channels for the dead triggers and 6 channels for the triggers. For the configuration file I have:
job.routing.period.time.ms=2000
job.push.period.time.ms=5000
job.pull.period.time.ms=5000
I have none foreign key configuration so there wont be an issue with that. What I would like to know is how to make this process faster. Should I reduce the number of channels?
I do not know what could be the issue since the first test I ran went very well. Is there a reason why the threshold is not being clreared.
Any help will be apreciated.
Thanks.
How large are your tables? How much memory does the SymmetricDS instance have?
I've used SymmetricDS for a while, and without having done any profiling on it I believe that reloading large databases went quicker once I increased available memory (I usually run it in a Tomcat container).
That being said, SymmetricDS isn't by far as quick as some other tools when it comes to the initial replication.
Have you had a look at the tmp folder? Can you see any progress in file size. That is, the files which SymmetricDS temporarily writes to locally before sending the batch off to the remote side? Have you tried turning on more fine grained logging to get more details? What about database timeouts? Could it be that the extraction queries are running too long, and that the database just cuts them off?

Failover strategy for database application

I've got a writing and reading database application holding a local cache. In case of an application server fault a backup server shall start working.
The primary and backup application can only run exclusively because of its local cache and some low isolation level on the database.
As far as my communication knowledge goes it is impossible to let both servers always figure out who is allowed to run exclusively.
Can I somehow solve this communication conflict through using the database as a third entity? I think this is a quite typical problem and there might not be a 100% safe method, but I would be happy to know how other people recommend to solve such issues? Or if there is some best practice to this.
It's okay if both application are not working for 30 minutes or so, but there is not enough time to get people out of bed and let them figure out what the problem is.
Can you set up a third server which is monitoring both application servers for health? This server could then decide appropriately in case one of the servers appears to be gone: Instruct the hot standby to start processing.
if i get the picture right, your backup server constantly polls the primary server for data updates, it wouldn't be hard to check if the poll fails, schedule it again for 30s later 3 times and in the third failure dynamically update the DNS entry to the database server to reflect the change in active server. Both Windows DNS and Bind accept dynamic updates signed and unsigned.

Is writing server log files to a database a good idea?

After reading an article about the subject from O'Reilly, I wanted to ask Stack Overflow for their thoughts on the matter.
Write locally to disk, then batch insert to the database periodically (e.g. at log rollover time). Do that in a separate, low-priority process. More efficient and more robust...
(Make sure that your database log table contains a column for "which machine the log event came from" by the way - very handy!)
I'd say no, given that a fairly large percentage of server errors involve problems communicating with databases. If the database were on another machine, network connectivity would be another source of errors that couldn't be logged.
If I were to log server errors to a database, it would be critical to have a backup logger that wrote locally (to an event log or file or something) in case the database was unreachable.
Log to DB if you can and it doesn't slow down your DB :)
It's way way way faster to find anything in DB then in log files. Especially if you think ahead what you will need. Logging in db let's you query log table like this:
select * from logs
where log_name = 'wcf' and log_level = 'error'
then after you find error you can see the whole path that lead to this error
select * from logs
where contextId = 'what you get from previous select' order by timestamp
How will you get this info if you log it in text files?
Edit:
As JonSkeet suggested this answer would be better if I stated that one should consider making logging to db asynchronous. So I state it :) I just didn't need it. For example how to do it you can check "Ultra Fast ASP.NET" by Richard Kiessig.
If the database is production database, this is a horrible idea.
You will have issues with backups, replication, recovery. Like more storage for DB itself, replicas, if any, and backups. More time to setup and restore replication, more time to verify backups, more time to recover DB from backups.
It probably isn't a bad idea if you want the logs in a database but I would say not to follow the article's advice if you have a lot of log file entries. The main issue is that I've seen file systems have issues keeping up with logs coming from a busy site let alone a database. If you really want to do this I would look at loading the log files into the database after they are first written to disk.
Think about a properly setup database that utilizes RAM for reads and writes? This is so much faster than writing to disk and would not present the disk IO bottleneck you see when serving large numbers of clients that occurs when threads start locking down due to the OS telling them to wait on currently executing threads that are using all available IO handles.
I don't have any benchmarks to prove this data, although my latest application is rolling with database logging. This will have a failsafe as mentioned in one of these responses. If the database connection can not be created, create local database (h2 maybe?) and write to that. Then we can have a periodic check of database connectivity that will re-establish the connection, dump the local database, and push it to the remote database.
This could be done during off hours if you do not have a H-A site.
Sometime in the future I hope to develop benchmarks to demonstrate my point.
Good Luck!

SQL Server transactional replication for very large tables

I have set up transactional replication between two SQL Servers on different ends of a relatively slow VPN connection. The setup is your standard "load snapshot immediately" kind of thing where the first thing it does after initializing the subscription is to drop and recreate all tables on the subscriber side and then start doing a BCP of all the data. The problem is that there are a few tables with several million rows in them, and the process either a) takes a REALLY long time or b) just flat out fails. The messages I keep getting when I look in Replication Monitor are:
The process is running and is waiting for a response from the server.
Query timeout expired
Initializing
It then tries to restart the bulk loading process (skipping any BCP files that it has already loaded).
I am currently stuck where it just keeps doing this over and over again. It's been running for a couple days now.
My questions are:
Is there something I could do to improve this situation given that the network connection is so slow? Maybe some setting or something? I don't mind waiting a long time as long as the process doesn't keep timing out.
Is there a better way to do this? Perhaps make a backup, zip it, copy it over and then restore? If so, how would the replication process know where to pick up when it starts applying the transactions, since updates will be occurring between the time I make the backup and get it restored and running on the other side.
Yes.
You can apply the initial snapshot manually.
It's been a while for me, but the link (into BOL) has alternatives to setting up the subscriber.
Edit: From BOL How-tos, Initialize a Transactional Subscriber from a Backup
In SQL 2005, you have a "compact snapshot" option, that allow you to reduce the total size of the snapshot. When applied over a network, snapshot items "travel" compacted to the suscriber, where they are then expanded.
I think you can easily figure the potential speed gain by comparing sizes of standard and compacted snapshots.
By the way, there is a (quite) similar question here for merge replication, but I think that at the snapshot level there is no difference.

Resources