Relative newcomer to sql and pg here so this is a relatively open question regarding backing up daily data from a stream. Specific commands / scripts would be appreciated if it's simple, otherwise I'm happy to be directed to more specific articles/tutorials on how to implement what needs to be done.
Situation
I'm logging various data streams from some external servers on the amount of a few GB/day every day. I want to be able to store this data onto larger harddrives which will then be used to pull information from for analysis at a later date.
Hardware
x1 SSD (128GB) (OS + application)
x2 HDD (4TB each) (storage, 2nd drive for redundancy)
What needs to be done
The current plan is to have the SSD store a temporary database consisting of the daily logged data. When server load is low (early morning), dump the entire temporary database onto two separate backup instances on each of the two storage disks. The motivation for storing a temp db is to reduce the load on the harddrives. Furthermore, the daily data is small enough that it will be able to copy over to the storage drives before server load picks up.
Questions
Is this an acceptable method?
Is it better/safer to just push data directly to one of the storage drives, consider that the primary database, and automate a scheduled backup from that drive to the 2nd storage drive?
What specific commands would be required to do this to ensure data integrity (i.e. while a backup is in progress, new data will still be being logged)
At a later date when budget allows the hardware will be upgraded but the above is what's in place for now.
thanks!
First rule when building a backup system - do the simplest thing that works for you.
Running pg_dump will ensure data integrity. You will want to pay attention to what the last item backed up is to make sure you don't delete anything newer than that. After deleting the data you may well want to run a CLUSTER or VACUUM FULL on various tables if you can afford the logging.
Another option would be to have an empty template database and do something like:
Halt application + disconnect
Rename database from "current_db" to "old_db"
CREATE DATABASE current_db TEMPLATE my_template_db
Copy over any other bits you need (sequence numbers etc)
Reconnect the application
Dump old_db + copy backups to other disks.
If what you actually want is two separate live databases, one small quick one and one larger for long-running queries then investigate tablespaces. Create two tablespaces - the default on the big disks and the "small" one on your SSD. Put your small DB on the SSD. Then you can just copy from one table to another with foreign-data-wrappers (FDW) or dump/restore etc.
Related
Our IT team copies data from mission-critical SQL Server OLTP databases in what seems to be a naive way - basically just INSERT INTO ... SELECT * every night. We use this copied data database for reporting. This is unsatisfactory for various reasons but we're told it is the only way because uncontrolled user query execution could compromise OLTP performance & data integrity. I want an improvement that still addresses their concerns.
Copy-on-write snapshots are the best solution I've read about (we don't need up-to-the-minute data for reporting), but please comment on the following:
The snapshot's sparse files should be placed on a separate physical drive (so that snapshot reads/writes can occur without limiting disk throughput for OLTP tasks).
There should be a single NTFS filesystem spanning all physical disks (on a hunch that would work better than putting the online database its snapshots on logically separated volumes).
Create the filesystem with the /L:enable flag (so it works better with large sparse files).
Avoid multiple snapshots (since original data would have to be copied to each one).
We could use a single snapshot MyDB_LatestSnapshot that could be deleted and very quickly re-created every day, or even throughout the day (so long as kicking users running reports off it is acceptable).
Since the database snapshots will always be recent, most data will not have changed and so it will still have to be retrieved from the same drive as the online OLTP database, so increased resource (CPU/RAM) use is inevitable. Won't a long-running reporting query that pulls years of historical data (including data that hasn't changed and therefore doesn't exist in the snapshot) block writes just as if it were running against the online database?
Is there any way to tell SQL Server to prioritize resources for the needs of the OLTP database?
I've found examples of how original rows are copied from the online database when they're updated, but how do snapshots handle structural changes in the new database, like new/altered tables, indexes, etc.?
Can snapshots have different user permissions versus the online database (so that users can read from the snapshot, but not the online database)?
The OLTP system runs core banking applications, so I understand utmost caution is justified, but I can't believe the current approach is best practice in 2022.
I need a bit of a help with the following.
Note: in the following scenario, I do not have access to the application's source code, therefore I can only make changes at the database level.
Our database uses dbo.[BLOB] to store all kinds of files and documents. The table uses an IMAGE (yeah, obsolete) data type. Since this particular table is growing quite fast, I was thinking to implement some archiving feature.
My idea is to move all files older than X months to a second database, and then somehow link from the dbo.[BLOB] table to the external/archiving database.
Is this even possible? The goal is to reduce the database size, in order to improve backup and query performance.
Any ideas and hints much appreciated.
Thanks.
Fabian
There are 2 features to help you with backup speed and database size in this case:
Filestream will allow you to store BLOBS as files on the file system instead of in database file. It complicates backup scenario, you have to backup both database and files but you get smaller database file along with faster access time to documents. It is much faster to read file from filesystem than from blob column. Additionally filestream allows for files bigger than 2GB.
Partitioning will split table into smaller chunks on physical level. This way you do not need to access application code to change where particular rows are stored physically and decide which data needs to be accessed fast and put it on SSD drive and which can land on slower archive. This way you can have more frequent backups on current partition, while less frequent on archive.
Prior to SQL Server 2016 SP1 - this feature was available in Enterprise version only. For SQL Server 2016 SP1 this is available in all editions.
In your case most likely you should go with filestream first.
W/o modifying the application you can do, basically, nothing. You may try to see if changing the column type will be tolerated by the application (very unlikely, 99.99% it will break the app) and try to use FILESTREAM, but even if you succeed it does not give much benefits (backup size will be the same, for example).
A second thing you can try is to replace the table with a view, using INSTEAD OF triggers for updates. It is still very likely to break the application (lets say 99.98%). The goal would be to have a distributed partitioned view (or cross DB partitioned view) which presents to the application an unified view of the 'cold' and 'hot' data. Is complex, error prone, but it will reduce the size of the backups (as long as data is moved from hot to cold and cold data is immutable, requiring few backups).
The goal is to reduce the database size, in order to improve backup and query performance.
To reduce the backup size, as I explained above, you can do, basically, nothing. But performance you need to investigate it and address it appropriately, based on your findings. Saying the the database is slow 'because of BLOBs' is hand-waving.
I need to install a SQL server for prod environment. There are only two drives in the system one drive with 120 GB and another with 50 GB. How to choose the drives to keep the user defined db data and log files and temp db files.
Your question too broad to have simple answer.
Take this points into consideration:
What is the size of user data database?
What is expected growth of user data database?
Do you have a lot of queries with #-tables? (tempdb strain)
What is expected transaction count in second/minute?
Use SQLIO to measure speed of your drives.
Do you have ready load tests? (Use them and look in Resource Monitor, check disk queues)
Which recovery model you are using? (Growth of your log files)
Which backup strategy you are planning?
It is equally possible:
you don't have to worry with all DBs on default location
you need faster hardware
Which is the better practice to store file? Directly store the file in database or just the location to that file?
Avoid storing files in your database. Most don't deal with them well.
It depends. You need to consider several things.
If you have a mickey mouse freeware database, meaning that it does not handle blobs appropriately (reads the blobs on every SELECT; does not store the blobs in a separate physical structure to the row; very slow with blobs; etc)
keep the files outside, store only the location
manually deal with the syncing of row.location vs the file system
If you have an enterprise SQL Platform, it is no problem at all to keep the blobs inside the database. In fact, retrieval is faster. These do not read the blobs on every SELECT, they are stored in a separate physical structure to the rows. The one extra read to get the blob if the SELECT requests it, is not a "performance problem".
The PAGESIZE in genuine SQL databases can be set as 2k; 4k; 8k; or 16k.
2k is perfect for OLTP (small rows, small Transactions: you do not want to move 8K on every IO operation)
larger sizes are relevant based on how much OLAP you cater for
in your case, the average size of the files
there will be some waste in the unused portion of the last page, per row/blob.
The disadvantage of keeping the blobs in the database is, your database backups will be significantly larger.
Some enterprise databases (eg. SAP/Sybase) recognise that a page has not changed, and excludes it from the incremental backups
others have no incremental database backups.
The advantage of keeping the blobs in the database is:
data and referential integrity. You will not have the problem of having the rows that are out of synch with the blobs
the blobs are included in the backup: otherwise, upon a restore, the task of syncing the restored database with the restored files is a major problem.
I completed an assignment last year, where the customer had 130GB of data in the db, and 700GB of documents stored outside the db. After ten years of problems, they bit the bullet, and moved the documents into the db.
Guess what, what was supposed to be a simple job (long but simple, because the references were supposed to be absolutely correct), ended up being massive, because there were so many (a) duplicates, and (b) invalid references.
The resulting database was 630GB, there were 100GB of dupes. 2K pagesize.
Responses to Comments
Slash or Backslash
Easy.
In the database, store slash only.
You need a way of identifying the target system, and an IsWindoze indicator. It should be higher up in the table hierarchy, not at the level where the Filename is located.
Whenever you report or display the Filename column, if IsWindoze, change the slashes to backslashes.
You will have a similar problem with the DriveLetter and colon D:, which Unix does not have. Allow it only if IsWindoze.
Late answer: it depends on your engine.
A page size of 2k hasn't been used since the 1990s for SQL Server. Oracle defaults to 8K, SQL Server is 8K. Only Sybase AFAIK is still in the last century.
SQL Server now offers FILESTREAM which combines the best of both worlds, as Oracle has done for longer with BFILE
SQL Server and Oracle offer on disk and backup compression
I'm sure PostgresSQL at least offers similar features.
Note: this is mainly to offer alternatives to PerformanceDBA's FUD
The preferred method is to store the file in the filesystem and store the location of the file in the database. The reasoning for this has to do with how databases physically allocate space on disk (usually in 8k or 16k chunks). Dropping large files in there causes your database to use different mechanisms to store the files (SQL Server calls this row overflow data). Typically these kind of pages are located out of the normal table, so every logical read for a row results in two physical reads on disk. Needless to say, this isn't good for performance.
One of our teams is developing a database that will be somewhat large (~500GB) and grow from there (I know 500 Gigs may seem small to many of you, but it will be one of the larger databases in our shop). One of the issues they are grappling with is backing up and restoring the database. Basically, the database will have several "data" tables and one table used for storing images / documents. We need to accomplish the following:
Be able to quickly backup and restore only the data tables (sans images) to our test server for debugging and testing purposes.
In the event of a catastrophic database failure, restore the data tables only to get most of the application up and running ASAP. Then, restore the images table when possible.
Backup the database within the allotted nightly time window (a few hours).
My questions are:
Is it possible to accomplish the first two goals while still having the images stored in the same database? If so, would we use filegroups, filestream, or something else?
How do other shops backup their databases in a reasonable time window while maintaining high availability? Do you replicate to a second server and backup from there?
We have dealt with similar issues. We are a $2.5B solar manufacturing company and disaster recovery is critical for us, as well as keeping our databases backed up. Our main database is our plant floor production database. We decided to strip this database to the absolutely essential data needed to maintain production, and move other data off into its own database. This has allowed us high availability and reasonable backup/restore times.
In your case, is it really necessary to store images in the same database as your other data? I suspect it's not, and is just a case of making some issues easier to deal with. I think separate file groups would also help your problem. But you might want to seriously reconsider whether everything needs to be in a single DB.