Postgres daily data dump and hydration clogs disk space? - database

I take a daily dump from my production environtment by doing:
pg_dump <database name> > dump_<date>.sql
then I transfer this over to staging and import the staging db by first dropping the tables:
drop schema public cascade;
create schema public;
and then doing the following:
psql <database name> < dump_<date>.sql
However it seems like the staging DB is getting unusually bigger and bigger everyday. At this point even after I drop the tables & data, there's 150 gb of space taken in the DB folder.
It feels like something like logs or metadata is clogging the folders.
What's the proper way to do this or is there a good way to clean this extra data other than deleting the DB and reinitiating it everytime.
Thanks!

There is a better way, a much much better way.
https://www.postgresql.org/docs/9.5/static/high-availability.html
Database servers can work together to allow a second server to take
over quickly if the primary server fails (high availability), or to
allow several computers to serve the same data (load balancing).
Ideally, database servers could work together seamlessly. Web servers
serving static web pages can be combined quite easily by merely
load-balancing web requests to multiple machines. In fact, read-only
database servers can be combined relatively easily too. Unfortunately,
most database servers have a read/write mix of requests, and
read/write servers are much harder to combine. This is because though
read-only data needs to be placed on each server only once, a write to
any server has to be propagated to all servers so that future read
requests to those servers return consistent results.
Now when you read the documentation it seems very intimidating at first. However in reality all you need to do is take one dump of the entire cluster and enable WAL logging on postgresql.conf then you can copy the WAL archive files daily, weekly or monthly to another server.

Related

Automating Postgresql database backup from small SSD to multiple harddrives daily

Relative newcomer to sql and pg here so this is a relatively open question regarding backing up daily data from a stream. Specific commands / scripts would be appreciated if it's simple, otherwise I'm happy to be directed to more specific articles/tutorials on how to implement what needs to be done.
Situation
I'm logging various data streams from some external servers on the amount of a few GB/day every day. I want to be able to store this data onto larger harddrives which will then be used to pull information from for analysis at a later date.
Hardware
x1 SSD (128GB) (OS + application)
x2 HDD (4TB each) (storage, 2nd drive for redundancy)
What needs to be done
The current plan is to have the SSD store a temporary database consisting of the daily logged data. When server load is low (early morning), dump the entire temporary database onto two separate backup instances on each of the two storage disks. The motivation for storing a temp db is to reduce the load on the harddrives. Furthermore, the daily data is small enough that it will be able to copy over to the storage drives before server load picks up.
Questions
Is this an acceptable method?
Is it better/safer to just push data directly to one of the storage drives, consider that the primary database, and automate a scheduled backup from that drive to the 2nd storage drive?
What specific commands would be required to do this to ensure data integrity (i.e. while a backup is in progress, new data will still be being logged)
At a later date when budget allows the hardware will be upgraded but the above is what's in place for now.
thanks!
First rule when building a backup system - do the simplest thing that works for you.
Running pg_dump will ensure data integrity. You will want to pay attention to what the last item backed up is to make sure you don't delete anything newer than that. After deleting the data you may well want to run a CLUSTER or VACUUM FULL on various tables if you can afford the logging.
Another option would be to have an empty template database and do something like:
Halt application + disconnect
Rename database from "current_db" to "old_db"
CREATE DATABASE current_db TEMPLATE my_template_db
Copy over any other bits you need (sequence numbers etc)
Reconnect the application
Dump old_db + copy backups to other disks.
If what you actually want is two separate live databases, one small quick one and one larger for long-running queries then investigate tablespaces. Create two tablespaces - the default on the big disks and the "small" one on your SSD. Put your small DB on the SSD. Then you can just copy from one table to another with foreign-data-wrappers (FDW) or dump/restore etc.

Storing binary files in sql server

I'm writing an mvc/sql server application that needs to associate documents (word, pdf, excel, etc) with records in the database (supporting sql server 2005). The consensus is it's best to keep the files in the file system and only save a path/reference to the file in the database. However, in my scenario, an audit trail is extremely important. We already have a framework in place to record audit information whenever a change is made in the system so it would be nice to use the database to store documents as well. If the documents were stored in their own table with a FK to the related record would performance become an issue? I'm aware of the potential problems with backups/restores but would db performance start to degrade at some point if the document tables became very large? If it makes any difference I would never expect this system to need to service anywhere near 100 concurrent requests, maybe tens of requests.
Storing the files as blob in database will increase the size of the db and will definitely affect the backups which you know and is true.
There are many things of consideration whether the db and code server are same.
Because it happens to be code server requests and gets data from db server and then from code server to client.
If the file sizes are too large I would say go for the file system and save file paths in db.
Else you can keep the files as blog in db, it will definitely be more secure, as well as safe from virus, etc.

What is the DBA function/tool called for keeping many remote DBs synchronized with a main DB

I am a front-end developer being asked to fulfil some DBA tasks. Uncharted waters.
My client has 10 remote (off network) data collection terminals hosting a PostgreSQL application. My task is to take the .backup or .sql files those terminals generate and add them to the main DB. The schema for all of these DBs will match. But the merge operation will lead to many duplicates. I am looking for a tool that can add a backup file to an existing DB, filter out duplicates, and provide a report on the merge.
Is there a term for this kind of operation in the DBA domain?
Is this function normally built into basic DB admin suites (e.g. pgAdmin III), are enterprise-level tools required, or is this something that can be done on the command-line easily enough?
Update
PostgreSQL articles on DB replication here and glossary.
You can't "merge a bunch of tables" but you could use Slony to replicate child tables (i.e. one partition per location) back to a master db.
This is not an out of the box solution but with something like Bucardo or Slony it can be done, albeit with a fair bit of work and added maintenance.

export a large db with terabytes of data

what's the best way to dump a large(terabytes) db? are there other faster/efficient way besides mysqldump? this is intended to be zipped, unzipped, and then reimported into another mysql db on another server.
If it's possible for you to stop the database server, the best way is probably for you to:
Stop the database
Do a file copy of the files (including appropriate transaction logs, etc) to a new file system.
Restart the database.
Then move the copied files to the new server and bring up the database on top of the files. It's a bit complicated to do this, but it's by far the fastest way.
I used to be a DBA for a terabyte+ database in MySQL and this is one of the ways we'd do nightly backups of the database. mysqldump would've never worked for data that large. We'd stop the database each night and file copy the underlying files.
Since your intent seems to be having two copies of the DB, why not set up replication to do this?
That will ensure that both copies of the DB remain in an identical state (in terms of data anyway).
And, if you want a snapshot to be exported, you can:
wait for a quiet time.
disable replication.
back up the slave copy.
re-enable replication.

create data patch for database (synchronize databases)

There is 2 databases: "temp" and "production". Each night production database should be "synchronized", so it will have exactly same data as in "temp". Database sizes are several GB and just copying all data is not an option. But changes are usually quite small: ~100 rows added, ~1000 rows updated and some removed. About 5-50Mb per day.
I was thinking maybe there is some tool (preferably free) that could go trough both databases and create patch, that could be applied to "production database". Or as option just "synchronize" both databases. And it should quite fast. In other words something like rsync for data in databases.
If there is some solution for specific database (mysql, h2, db2, etc), it will be also fine.
PS: structure is guaranteed to be same, so this question is only about transferring data
Finally i found a way to do it in Kettle (PDI):
http://wiki.pentaho.com/display/EAI/Synchronize+after+merge
Only one con: I need create such transformation for each table separately.
Why not setup database replication from Temp Database to your Production database where your temp database will act as the Master and Production will act as a slave. Here is a link for setting up replication in MySql. MSSQL also supports database replication as well. Google should show up many tutorials.

Resources