Weird SQL Server slowdown when transferring files to AWS S3

Weird SQL Server slowdown when transferring files to AWS S3 - sql-server

I'm an experienced SQL Server DBA, but I just can't figure this out and wondering if anyone has experienced the same.
I have a 1 TB database running on an m4.10xlarge AWS instance. I have Full/Diff/Trans backups that happen on various schedules to a separate drive that's an "sc1" AWS Volume Type -- basically the cheap cold HDD storage type.
As an additional failsafe, I started experimenting with copying the files on this backup drive to S3. This is where my trouble and this great mystery began. I'm using the AWS CLI "sync" command to transfer files from the sc1 volume to S3, using a simple command like:
aws s3 sync "j:\sqlbackups" s3://****bucketname****/sqlbackups/
Because of the size of the backup files, an initial sync can take a few hours. Whenever I have this command running, at some point, SQL Server starts grinding down to being ridiculously slow. Queries timeout. Connections can't be established. CPU/Memory usage remains the same -- nothing out of the ordinary there. The only thing that's out of the ordinary is the Read Throughout on the Drive containing the Data file spikes:
https://www.dropbox.com/s/avqcyug700jdjzw/Screenshot%202019-12-08%2015.07.45.png?dl=0
BUT, there's no relation between this drive (the one with the SQL Server data file), and the sc1 backup drive. And the aws sync command shouldn't touch the drive with the data file.
It's not just SQL Server that slows down -- the whole Windows server that it's on slows down. Launching Chrome takes 30 seconds instead of 1. Reading an event in Event Viewer takes 60 seconds instead of <1 second.
Once this odd slowdown happens, I kill the aws sync command, but the problem still remains. I have to stop/restart the SQL Server service for the problem to go away.
I'm so confused as to how this could be happening. Any ideas?
EDIT 12/8/19 5:39PM EST
I've realized that it's not just the drive containing the data file whose throughput goes off the charts during an S3 transfer -- it's all the drives on the server. This led me to read about how EBS storage works, and I understand that EBS storage is network attached storage, and therefore uses network bandwidth.
This makes me think that the transfer of multi-TB files to S3 is saturating my network bandwidth, making my C/D/E drives perform really poorly. That would explain why every part of the Windows server, not just SQL Server, slows during during an S3 transfer.
What doesn't make sense is that as an m4.10xlarge instance, this server has a 10 Gbps connection, which means more than 1 Gigabyte / second bandwidth. My S3 transfers are only showing a usage of a max of 150 Megabytes/second. Plus, the m4.10xlarge instance type is EBS-Optimized, meaning there's dedicated bandwidth available for the EBS volumes that shouldn't conflict with the bandwidth being used for the S3 transfer.

Related

MSSQL on ECS with EFS (NFS) Drive Performance?

I've a single container hosting an MSSQL instance. I'm streaming a ~90Mb Gzip CSV via a Lambda from S3 and writing a large amount of inserts in a transaction.
So far the lambda has taken 5 minutes but I have larger files then this and will likely exceed the 15 minute lambda timeout.
I'm thinking that the mounted EFS drive inside the docker container will be an IO bottleneck. Anyone ran into this before? I'm also unsure what the outcome of ECS starting a second container with both MSSQL instances using the same data folder would be.

Why do best practices tell you to disable caching on the log drive of SQL Server on an Azure VM

According to this article and several others I've found:
Performance best practices for SQL Server
It is considered a best practice to disable caching on premium storage disks for the SQL Server Log disks. However, I can't find anywhere that explains why.
Does anyone have some insight?
Let me add that the reason I see disabling read-only cache on the log drive as an issue is because it makes you have to set up two separate Storage Pools inside the VM, which makes upgrading/downgrading VM's inside Azure more problematic and considerably less performant.
For example, say you start with a DS V13 which has a 16 drive limit, but about 6 of those drives can be maxed before you're throttled (25000 IOPs). Since best practices says read-only cache for data and no cache for logs, you give 8 of those drives to the data and 8 to the log.
Now, the server needs to be upgraded, so you upgrade it to a DS V14. Now you can max out 12 drives before being throttled (50000 IOPs). However, your data drive's Storage Spaces column size is only 8, which is throttled to 40000 IOPs. So you're not using the IO's full potential.
However, if you can start with a DS V13 and assign all 16 of those drives to a single Storage Pool then put both the log and data on it. You can upgrade/downgrade all the way up to a DS V15 without any concern for not using up your full IOP's potential.
Another way to put it is: If you create a single Storage Pool for all 16 drives, you have considerably more flexibility in upgrading/downgrading the VM. If you have to create two Storage Pools, you do not.

We recommend configuring “None” cache on premium storage disks hosting the log files.. Log files have primarily write-heavy operations and they do not benefit from the ReadOnly cache. Two reasons:
Legitimate Cache contents will get evicted by useless data from the log.
Log writes will also consume Cache BW/IOPS. If the cache is enabled on a disk (ReadOnly or ReadWrite), every write on that disk will also write that data into the cache. Every read will also access/put the data in the cache. Thus every IO will hit cache if the cache is ON.
Thanks,
Aung

Logs files are used as part of the recovery, and can help restore a database to a point in time. Having corrupt data on a log file from a power outage or hard reboot is not good with MSSQL. See the below article from MS, they relate to older versions of SQL but the purpose of the log file has not changed.
Information about using disk drive caches with SQL Server that every database administrator should know
https://support.microsoft.com/en-us/kb/234656
Description of caching disk controllers in SQL Server
https://support.microsoft.com/en-us/kb/86903

Microsoft Access database - queries run on server or client?

I have a Microsoft Access .accdb database on a company server. If someone opens the database over the network, and runs a query, where does the query run? Does it:
run on the server (as it should, and as I thought it did), and only the results are passed over to the client through the slow network connection
or run on the client, which means the full 1.5 GB database is loaded over the network to the client's machine, where the query runs, and produces the result
If it is the latter (which would be truly horrible and baffling), is there a way around this? The weak link is always the network, can I have queries run at the server somehow?
(Reason for asking is the database is unbelievably slow when used over network.)

The query is processed on the client, but that does not mean that the entire 1.5 GB database needs to be pulled over the network before a particular query can be processed. Even a given table will not necessarily be retrieved in its entirety if the query can use indexes to determine the relevant rows in that table.
For more information, see the answers to the related questions:
ODBC access over network to *.mdb
C# program querying an Access database in a network folder takes longer than querying a local copy

It is the latter, the 1.5 GB database is loaded over the network
The "server" in your case is a server only in the sense that it serves the file, it is not a database engine.
You're in a bad spot:
The good thing about access is that it's easy to create forms and reports and things by people who are not developers. The bad is everything else about it. Particularly 2 things:
People wind up using it for small projects that grow and grow and grow, and wind up in your shoes.
It sucks for multiple users, and it really sucks over a network when it gets big
I always convert them to a web-based app with SQL server or something, but I'm a developer. That costs money to do, but that's what happens when you use a tool that does not scale.

Backup PostgreSQL database hosted on AWS EC2 without shutting down or restarting the master

I'm using PostgreSQL v9.1 for my organization. The database is hosted in Amazon Web Services (EC2 instance) below a Django web-framework which performs tasks on the database (read/write data). The problem is, to backup this database in a periodic fashion in a specified format (see Requirements).
Requirements:
A standby server is available for backup purposes.
The master-db is to be backed up every hour. Once the hour is ticked, the db is quickly backed up in entirety and then copied to slave in a file-system archive.
Along with hourly backups, I need to perform a daily backup of the database at midnight and a weekly backup on midnight of every Sunday.
Weekly-backups will be the final backups of the db. All weekly-backups will be saved. Daily-backups of the last week will only be saved and Hourly-backups of the last day will only be saved.
But I have the following constraints too.
Live data comes into the server every day (rate of insertion is per 2 seconds).
The database now hosting critical customer data which implies that it cannot be turned off.
Usually, data stops coming into the db during nights, but there's a good chance that data might be coming into master-db during some nights for which I have no control over to stop the insertions (Customer-data will be lost)
If I use traditional backup mechanisms/software (example, barman), I've to configuring archiving mode in postgresql.conf and authenticate users in pg_hba.conf which implies I need a server-restart to turn it on which again, stops the incoming data for some minutes. This is not permitted (see above constraint).
Is there a clever way to backup the master-db for my needs? Is there a tool which can automate this job for me?
This is a very crucial requirement as data has begun to appear into the master-db since few days and I need to make sure there's replication of master-db on some standby-server all the time.

Use EBS snapshots
If, and only if, your entire database including pg_xlog, data, pg_clog, etc is on a single EBS volume, you can use EBS snapshots to do what you describe because they are (or claim to be) atomic. You can't do this if you stripe across multiple EBS volumes.
The general idea is:
Take an EBS snapshot using the EBS APIs using command line AWS tools or a scripting interface like the wonderful boto Python library.
Once the snapshot completes, use AWS API commands to create a volume from it and attach the volume your instance, or preferably to a separate instance, and then mount it.
On the EBS snapshot you will find a read-only copy of your database from the point in time you took the snapshot, as if your server crashed at that moment. PostgreSQL is crashsafe, so that's fine (unless you did something really stupid like set fsync=off in postgresql.conf). Copy the entire database structure to your final backup, e.g archive it to S3 or whatever.
Unmount, unlink, and destroy the volume containing the snapshot.
This is a terribly inefficient way to do what you want, but it will work.
It is vitally important that you regularly test your backups by restoring them to a temporary server and making sure they're accessible and contain the expected information. Automate this, then check manually anyway.
Can't use EBS snapshots?
If your volume is mapped via LVM, you can do the same thing at the LVM level in your Linux system. This works for the lvm-on-md-on-striped-ebs configuration. You use lvm snapshots instead of EBS, and can only do it on the main machine, but it's otherwise the same.
You can only do this if your entire DB is on one file system.
No LVM, can't use EBS?
You're going to have to restart the database. You do not need to restart it to change pg_hba.conf, a simple reload (pg_ctl reload, or SIGHUP the postmaster) is sufficient, but you do indeed have to restart to change the archive mode.
This is one of the many reasons why backups are not an optional extra, they're part of the setup you should be doing before you go live.
If you don't change the archive mode, you can't use PITR, pg_basebackup, WAL archiving, pgbarman, etc. You can use database dumps, and only database dumps.
So you've got to find a time to restart. Sorry. If your client applications aren't entirely stupid (i.e. they can handle waiting on a blocked tcp/ip connection), here's how I'd try to do it after doing lots of testing on a replica of my production setup:
Set up a PgBouncer instance
Start directing new connections to the PgBouncer instead of the main server
Once all connections are via pgbouncer, change postgresql.conf to set the desired archive mode. Make any other desired restart-only changes at the same time, see the configuration documentation for restart-only parameters.
Wait until there are no active connections
SIGSTOP pgbouncer, so it doesn't respond to new connection attempts
Check again and make sure nobody made a connection in the interim. If they did, SIGCONT pgbouncer, wait for it to finish, and repeat.
Restart PostgreSQL
Make sure I can connect manually with psql
SIGCONT pgbouncer
I'd rather explicitly set pgbouncer to a "hold all connections" mode, but I'm not sure it has one, and don't have time to look into it right now. I'm not at all certain that SIGSTOPing pgbouncer will achieve the desired effect, either; you must experiment on a replica of your production setup to ensure that this is the case.
Once you've restarted
Use WAL archiving and PITR, plus periodic pg_dump backups for extra assurance.
See:
WAL-E
PgBarman
... and of course, the backup chapter of the user manual, which explains your options in detail. Pay particular attention to the "SQL Dump" and "Continuous Archiving and Point-in-Time Recovery (PITR)" chapters.
PgBarman automates PITR option for you, including scheduling, and supports hooks for storing WAL and base backups in S3 instead of local storage. Alternately, WAL-E is a bit less automated, but is pre-integrated into S3. You can implement your retention policies with S3, or via barman.
(Remember that you can use retention policies in S3 to shove old backups into Glacier, too).
Reducing future pain
Outages happen.
Outages of single-machine setups on something as unreliable as Amazon EC2 happen a lot.
You must get failover and replication in place. This means that you must restart the server. If you do not do this, you will eventually have a major outage, and it will happen at the worst possible time. Get your HA setup sorted out now, not later, it's only going to get harder.
You should also ensure that your client applications can buffer writes without losing them. Relying on a remote database on an Internet host to be available all the time is stupid, and again, it will bite you unless you fix it.

SQL Server Table > MS Access Local Copy?

I'm looking for a little advice.
I have some SQL Server tables I need to move to local Access databases for some local production tasks - once per "job" setup, w/400 jobs this qtr, across a dozen users...
A little background:
I am currently using a DSN-less approach to avoid distribution issues
I can create temporary LINKS to the remote tables and run "make table" queries to populate the local tables, then drop the remote tables. Works as expected.
Performance here in US is decent - 10-15 seconds for ~40K records. Our India teams are seeing >5-10 minutes for the same datasets. Their internet connection is decent, not great and a variable I cannot control.
I am wondering if MS Access is adding some overhead here than can be avoided by a more direct approach: i.e., letting the server do all/most of the heavy lifting vs Access?
I've tinkered with various combinations, with no clear improvement or success:
Parameterized stored procedures from Access
SQL Passthru queries from Access
ADO vs DAO
Any suggestions, or an overall approach to suggest? How about moving data as XML?
Note: I have Access 7, 10, 13 users.
Thanks!

It's not entirely clear but if the MSAccess database performing the dump is local and the SQL Server database is remote, across the internet, you are bound to bump into the physical limitations of the connection.
ODBC drivers are not meant to be used for data access beyond a LAN, there is too much latency.
When Access queries data, is doesn't open a stream, it fetches blocks of it, wait for the data wot be downloaded, then request another batch. This is OK on a LAN but quickly degrades over long distances, especially when you consider that communication between the US and India has probably around 200ms latency and you can't do much about it as it adds up very quickly if the communication protocol is chatty, all this on top of the connection's bandwidth that is very likely way below what you would get on a LAN.
The better solution would be to perform the dump locally and then transmit the resulting Access file after it has been compacted and maybe zipped (using 7z for instance for better compression). This would most likely result in very small files that would be easy to move around in a few seconds.
The process could easily be automated. The easiest is maybe to automatically perform this dump every day and making it available on an FTP server or an internal website ready for download.
You can also make it available on demand, maybe trough an app running on a server and made available through RemoteApp using RDP services on a Windows 2008 server or simply though a website, or a shell.
You could also have a simple windows service on your SQL Server that listens to requests for a remote client installed on the local machines everywhere, that would process the dump and sent it to the client which would then unpack it and replace the previously downloaded database.
Plenty of solutions for this, even though they would probably require some amount of work to automate reliably.
One final note: if you automate the data dump from SQL Server to Access, avoid using Access in an automated way. It's hard to debug and quite easy to break. Use an export tool instead that doesn't rely on having Access installed.

Renaud and all, thanks for taking time to provide your responses. As you note, performance across the internet is the bottleneck. The fetching of blocks (vs a continguous DL) of data is exactly what I was hoping to avoid via an alternate approach.
Or workflow is evolving to better leverage both sides of the clock where User1 in US completes their day's efforts in the local DB and then sends JUST their updates back to the server (based on timestamps). User2 in India, also has a local copy of the same DB, grabs just the updated records off the server at the start of his day. So, pretty efficient for day-to-day stuff.
The primary issue is the initial DL of the local DB tables from the server (huge multi-year DB) for the current "job" - should happen just once at the start of the effort (~1 wk long process) This is the piece that takes 5-10 minutes for India to accomplish.
We currently do move the DB back and forth via FTP - DAILY. It is used as a SINGLE shared DB and is a bit LARGE due to temp tables. I was hoping my new timestamped-based push-pull of just the changes daily would have been an overall plus. Seems to be, but the initial DL hurdle remains.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight