Database Backup best practices - database

I am working in a production environment, where we process XML files daily. Our database size is quite big. we are taking a daily backup. I learned that Marklogic adds up changes to your previous backup to create new backup.
I wanted to confirm that is it the best way to keep daily backup or there is any other better way to do it. Also is there any limit to the process, that I am following. My Database size is around 350 GB and increasing daily. So I am looking for a faster and easier solution.

This question is fairly open-ended: there is no single "best way". MarkLogic supports full online backups, and journal archiving for continuous incremental backup. The docs at http://docs.marklogic.com/guide/admin/backup_restore discuss these options.
Instead of a full daily backup, you might consider a full weekly backup plus journal archiving. As you start a new week, you can do whatever you like with the data the from previous week: retain it, delete it, move it onto cheaper storage, etc.
As MarkLogic databases go, 350-GB is not so large. However at that point you should have already configured multiple forests: see http://docs.marklogic.com/guide/cluster/scalability#id_96443 for guidelines. Assuming you have multiple CPU cores, storing the content in a proportional number of forests will improve performance throughout the system. That includes backup, because multiple forests will back up in parallel - though of course the disk may still be the bottleneck. If storage is the bottleneck, separating the I/O for forests and backup is advisable.
If having multiple forests is a new idea, you might also be interested in https://github.com/mblakele/task-rebalancer

Related

Are CoW snapshots the solution to safely pull data from critial OLTP databases for reporting?

Our IT team copies data from mission-critical SQL Server OLTP databases in what seems to be a naive way - basically just INSERT INTO ... SELECT * every night. We use this copied data database for reporting. This is unsatisfactory for various reasons but we're told it is the only way because uncontrolled user query execution could compromise OLTP performance & data integrity. I want an improvement that still addresses their concerns.
Copy-on-write snapshots are the best solution I've read about (we don't need up-to-the-minute data for reporting), but please comment on the following:
The snapshot's sparse files should be placed on a separate physical drive (so that snapshot reads/writes can occur without limiting disk throughput for OLTP tasks).
There should be a single NTFS filesystem spanning all physical disks (on a hunch that would work better than putting the online database its snapshots on logically separated volumes).
Create the filesystem with the /L:enable flag (so it works better with large sparse files).
Avoid multiple snapshots (since original data would have to be copied to each one).
We could use a single snapshot MyDB_LatestSnapshot that could be deleted and very quickly re-created every day, or even throughout the day (so long as kicking users running reports off it is acceptable).
Since the database snapshots will always be recent, most data will not have changed and so it will still have to be retrieved from the same drive as the online OLTP database, so increased resource (CPU/RAM) use is inevitable. Won't a long-running reporting query that pulls years of historical data (including data that hasn't changed and therefore doesn't exist in the snapshot) block writes just as if it were running against the online database?
Is there any way to tell SQL Server to prioritize resources for the needs of the OLTP database?
I've found examples of how original rows are copied from the online database when they're updated, but how do snapshots handle structural changes in the new database, like new/altered tables, indexes, etc.?
Can snapshots have different user permissions versus the online database (so that users can read from the snapshot, but not the online database)?
The OLTP system runs core banking applications, so I understand utmost caution is justified, but I can't believe the current approach is best practice in 2022.

How can I back up Google Datastore for efficient restoration?

Google Datastore has a backup utility. But it is far too slow for a operational database, taking hours to run a backup or restoration of a few dozen GB. Also, Google recommends disabling Cloud Datastore writes during backup, which again is impossible for an operational database.
How can I backup my Datastore so that if there is data corruption, I can rapidly restore, losing at most a few minutes of transactions?
It seems that this is an essential part of any full-strength database system.
(Other databases provide this with
append-only storage or
periodic backups augmented with a differential backup or transaction log or
realtime mirroring, though that doesn't handle the case of data corruption from a bug that writes to the database.)
The backup utility starts MapReduce jobs in the background to parallelize the backup/restore and perform it faster. However, it seems to shard the entities by namespace. If your data is on one or few namespaces the process can be very slow.
You could implement your own parallel backup/restore mecanism by using a tool like AppEngine MapReduce1 or Cloud Dataflow2 to make it faster.
You aren't going to get a "few minutes" latency with an "eventually consistent" nosql datastore like this in order to protect yourself from yourself with a naive back up of everything. You really should invest in good testing to make sure that no bugs like this exist in the first place.
Your only real solution is to use immutable data and versioning, since this is how all the other nosql systems do it as well. The datastore key system that exists already can works well for this and is extremely fast.
Never update anything and you never corrupt anything or lose anything and can roll back to individual records to previous revisions very quickly.
Archive the older revisions at some point to the cloud bucket store at some point to save money on storage or delete after a fixed number of revisions.

Maintenance window and recovery for a large database

One of our teams is developing a database that will be somewhat large (~500GB) and grow from there (I know 500 Gigs may seem small to many of you, but it will be one of the larger databases in our shop). One of the issues they are grappling with is backing up and restoring the database. Basically, the database will have several "data" tables and one table used for storing images / documents. We need to accomplish the following:
Be able to quickly backup and restore only the data tables (sans images) to our test server for debugging and testing purposes.
In the event of a catastrophic database failure, restore the data tables only to get most of the application up and running ASAP. Then, restore the images table when possible.
Backup the database within the allotted nightly time window (a few hours).
My questions are:
Is it possible to accomplish the first two goals while still having the images stored in the same database? If so, would we use filegroups, filestream, or something else?
How do other shops backup their databases in a reasonable time window while maintaining high availability? Do you replicate to a second server and backup from there?
We have dealt with similar issues. We are a $2.5B solar manufacturing company and disaster recovery is critical for us, as well as keeping our databases backed up. Our main database is our plant floor production database. We decided to strip this database to the absolutely essential data needed to maintain production, and move other data off into its own database. This has allowed us high availability and reasonable backup/restore times.
In your case, is it really necessary to store images in the same database as your other data? I suspect it's not, and is just a case of making some issues easier to deal with. I think separate file groups would also help your problem. But you might want to seriously reconsider whether everything needs to be in a single DB.

Backup SQL Server while minimizing bandwidth

I want to implement an automated backup system for my site's SQL Server 2005 DB that will backup nightly to Amazon's S3 service. But since S3 charges both for space and bandwidth used, I would like to minimize the size of the files that I transfer in. What is the best way to achieve this?
I should clarify that I'm not really talking so much about compression, which is pretty straightforward, but concerning backup strategies like whether to do differential backups all the time, whether I need to copy transaction logs, etc.
Differential backups will be smaller than full backups, of course. However, you should consider the restoration side as well. You'll need your last full backup as well as your differentials to perform the restore which can add up to a lot of bandwidth/transfer time for a restore. One option is to perform a full backup weekly and do differentials daily (or a similar type of schedule).
As for transaction logs, it depends on what granularity you're looking for in restoring your data. If restoring to the last full or differential backup is sufficient, then you don't need to worry about taking transaction log backups. If that's not the case, then transaction log backups will be necessary.
Either use a commercial product do compress the backups like Red Gate Backup Pro or just zip-compress it after you're done.
Write a .batch script or powershell script that will find the file/s created in the past day and zip them up. Then FTP or whatever you have to do.
A powershell example that I just came across.

DB2 Online Database Backup

I have currently 200+ GB database that is using the DB2 built in backup to do a daily backup (and hopefully not restore - lol) But since that backup now takes more than 2.5 hours to complete I am looking into a Third party Backup and Restore utility. The version is 8.2 FP 14 But I will be moving soon to 9.1 and I also have some 9.5 databases to backup and restore. What are the best tools that you have used for this purpose?
Thanks!
One thing that will help is going to DB2 version 9 and turn on compression. The size of the backup will then decrease (by up to 70-80% on table level) which should shorten the backup time. Of course, if your database is continuosly growing you'll soon run into problems again, but then data archiving might be the thing for you.
Before looking at third party tools, which I doubt would help too much, I would consider a few optimizations.
1) Have you used REORG on your tables and indexes? This would compact the information and minimize the amount of pages used;
2) If you can, backup on multiple disks at the same time. This can easily be achieved by running db2 backup db mydb /mnt/disk1 /mnt/disk2 /mnt/disk3 ...
3) DB2 should do a good job at fine tuning itself, but you can always experiment with the WITH num_buffers BUFFERS, BUFFER buffer-size and PARALLELISM n options. But again, usually DB2 does a better job on its own;
4) Consider performing daily incremental backups, and a full backup once on Saturdays or Sundays;
5) UTIL_IMPACT_PRIORITY and UTIL_IMPACT_LIM let you throttle the backup process so that it doesn't affect your regular workload too much. This is useful if your main concern is not the time per se, but rather the performance of your datasever while you backup;
6) DB2 9's data compression can truly do wonders when it comes to reducing the dimensions of the data that needs to be backed up. I have seen very impressive results and would highly recommend it if you can migrate to version 9.1 or, even better, 9.5.
There really are only two ways to make backup, and more important recovery, run faster:
1. backup less data and/or
2. have a bigger pipe to the backup media
I think you got a lot of suggestions on how to reduce the amount of data that you back up. Basically, you should be creating a backup strategy that relies on relatively infrequent full backup and much more frequent backups of changed (since last full backup) data. I encourage you to take a look at the "Configure Automatic Maintenance" wizard in the DB2 Control Center. It will help you with creating automatic backups and with other utilities like REORG that Antonio suggested. Things like compression obviously can help as the amount of data is much lower. However, not all DB2 editions offer compression. For example, DB2 Express-C does not. Frankly, doing compression on a 200GB database may not be worth while anyway and that is precisely why free DBMS like DB2 Express-C don't offer compression.
As far as openign a bigger pipe for your backup you first have to decide if you are going to backup to disk or to tape. There is a big difference in speed (obviously disk is a lot faster). Second, DB2 can paralelize backups. So, if you have multiple devices to back to, it will backup to all of them at the same time i.e. your elapsed time will be a lot less depending how many devices you have to throw at the problem. Again, DB2 Control Center can help you have it set up.
Try High Performance Unload (HPU) - this was a standalone product from Infotel is now available as part of the Optim data studio - posting here https://www.ibm.com/developerworks/mydeveloperworks/blogs/idm/date/200910?lang=en
It's not a "third-party" product but anyone that I have ever seen using DB2 is using Tivoli Storage Manager to store their database backups.
Most shops will set up archive logging to TSM so you only have to take the "big" backup every week or so.
Since it's also an IBM product you won't have to worry about it working with all the different flavors of DB2 that you have.
The downside is it's an IBM product. :) Not sure if that ($) makes a difference to you.
I doubt that you can speed things up using another backup tool. As Mike mentions, you can add TSM to the stack, but that will hardly make the backup run any faster.
If I were you, I'd look into where backup files are stored: Are they using the same disk spindles as the database itself? If so: See if you can store the backup files on a storage area which isn't contented for access during your backup window.
And consider using incremental backups for daily backups, and then a long full backup on Saturdays.
(I assume that you are already running online backups, so that your data aren't unavailable during backup.)
A third party backup package probably won't help your speed much. Making sure that you are not doing full backups every 2 hours is probably the first step.
After that, look at where you are writing your backup to. Is it a local drive, instead of a network drive? Are the spindles used for anything else? Backups don't involve a lot of seek activity, but do involve a lot of big writes, so you probably want to avoid RAID 5 and go for large stripe sizes to help maximize throughput.
Naturally, you have to do full backups sooner or later, but hopefully you can find a window when load is light and you can live with a longer time period between backups. Do your full backup during a 4-6 hour period when the normal incrementals are off and then do incrementals based off of that the rest of the time.
Also, until you get your backup copied to a completely separate system you really aren't backed up. You'll have to experiment to figure out if you're better off compressing it before, during or after sending.

Resources