I was reading http://www.h2database.com/html/advanced.html#durability_problems and i found
Some databases claim they can guarantee durability, but such claims are wrong. A durability test was run against H2, HSQLDB, PostgreSQL, and Derby. All of those databases sometimes lose committed transactions. The test is included in the H2 download, see org.h2.test.poweroff.Test
Also it says
Where losing transactions is not acceptable, a laptop or UPS (uninterruptible power supply) should be used.
So is there any database that is durable. The document says about fsync() command and most hard drives do not obey fsync(). It also talks about no reliable way to flush hard drive buffers
So, is there a time after which a committed transaction becomes durable, so we can buy ups that gives minimum that much backup of power supply.
Also is there a way to know that a transaction committed is durable. Suppose we don't buy ups and after knowing that a transaction is durable we can show success message.
The problem depends on whether or not you can instruct the HDD/SDD to commit transactions to durable media. If the mass storage device does not have the facility to flush to durable media, then no data storage system on top of it can be said to be truely durable.
There are plenty of NAS devices with built in UPS however and these seem to fit the requirement for durable media - if the database on a seperate server commits data to that device and does a checkpoint then the commits are flushed to the media. So long as the media survives a power outage then you can say its durable. The UPS on the NAS should be capable of issuing a controlled shutdown to its associated disk pack, guaranteeing permenance.
Alternatively you could use something like SQL Azure which writes commits to multiple (3) seperate database storage instances on different servers. Although we have no idea if those writes ever reach a permenant storage media, it doesnt actually matter - the measurement of durability is read-repeatability; and this seems to meet that requirement.
Related
I was looking in to the concept of in-memory databases. Articles about that says,
An in-memory database system is a database management system that stores data entirely in main memory.
and they discuss advantages and disadvantages of this concept.
My problem is if these database managements system that stores data entirely in main memory,
do all the data vanish after a power failure???
or are there ways to protect the data ???
Most in-memory database systems offer persistence, at least as an option. This is implemented through transaction logging. On normal shutdown, an in-memory database image is saved. When next re-opened, the previous saved image is loaded and thereafter, every transaction committed to the in-memory database is also appended to a transaction log file. If the system terminates abnormally, the database can be recovered by re-loading the original database image and replaying the transactions from the transaction log file.
The database is still all in-memory, and therefore there must be enough available system memory to store the entire database, which makes it different from a persistent database for which only a portion is cached in memory. Therefore, the unpredictability of a cache-hit or cache-miss is eliminated.
Appending the transaction to the log file can usually be done synchronously or asynchronously, which will have very different performance characteristics. Asynchronous transaction logging will still risk the possibility of losing committed transactions if they were not flushed from the file system buffers and the system is shutdown unexpectedly (i.e. a kernel panic).
In-memory database transaction logging is guaranteed to only ever incur one file I/O to append the transaction to the log file. It doesn't matter if the transaction is large or small, it's still just one write to the persistent media. Further, the writes are always sequential (always appending to the log file), so even on spinning media the performance hit is as small as it can be.
Different media will have greater or lesser impact on performance. HDD will have the greatest, followed by SSD, then memory-tier FLASH (e.g. FusionIO PCIExpress cards) and the least impact coming from NVDIMM memory.
NVDIMM memory can be used to store the in-memory database, or to store the transaction log for recovery. Maximum NVDIMM memory size is less than conventional memory size (and more expensive), but if your in-memory database is some gigabytes in size, this option can retain 100% of the performance of an in-memory database while also providing the same persistence as a conventional database on persistent media.
There are performance comparisons of an in-memory database with transaction logging to HDD, SSD and FusionIO in this whitepaper: http://www.automation.com/pdf_articles/mcobject/McObject_Fast_Durable_Data_Management.pdf
And with NVDIMM in this paper: http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf
The papers were written by us (McObject), but are vendor-neutral.
Could anybody tell me more about difference between physical replication and logical replication in PostgreSQL?
TL;DR: Logical replication sends row-by-row changes, physical replication sends disk block changes. Logical replication is better for some tasks, physical replication for others.
Note that in PostgreSQL 12 (current at time of update) logical replication is stable and reliable, but quite limited. Use physical replication if you are asking this question.
Streaming replication can be logical replication. It's all a bit complicated.
WAL-shipping vs streaming
There are two main ways to send data from master to replica in PostgreSQL:
WAL-shipping or continuous archiving, where individual write-ahead-log files are copied from pg_xlog by the archive_command running on the master to some other location. A restore_command configured in the replica's recovery.conf runs on the replica to fetch the archives so the replica can replay the WAL.
This is what's used for point-in-time replication (PITR), which is used as a method of continuous backup.
No direct network connection is required to the master server. Replication can have long delays, especially without an archive_timeout set. WAL shipping cannot be used for synchronous replication.
streaming replication, where each change is sent to one or more replica servers directly over a TCP/IP connection as it happens. The replicas must have a direct network connection the master configured in their recovery.conf's primary_conninfo option.
Streaming replication has little or no delay so long as the replica is fast enough to keep up. It can be used for synchronous replication. You cannot use streaming replication for PITR1 so it's not much use for continuous backup. If you drop a table on the master, oops, it's dropped on the replicas too.
Thus, the two methods have different purposes. However, both of them transport physical WAL archives from primary to replica; they differ only in the timing, and whether the WAL segments get archived somewhere else along the way.
You can and usually should combine the two methods, using streaming replication usually, but with archive_command enabled. Then on the replica, set a restore_command to allow the replica to fall back to restore from WAL archives if there are direct connectivity issues between primary and replica.
Asynchronous vs synchronous streaming
On top of that, there's synchronous and asynchronous streaming replication:
In asynchronous streaming replication the replica(s) are allowed to fall behind the master in time when the master is faster/busier. If the master crashes you might lose data that wasn't replicated yet.
If the asynchronous replica falls too far behind the master, the master might throw away information the replica needs if max_wal_size (was previously called wal_keep_segments) is too low and no slot is used, meaning you have to re-create the replica from scratch. Or the master's pg_wal(waspg_xlog) might fill up and stop the master from working until disk space is freed if max_wal_size is too high or a slot is used.
In synchronous replication the master doesn't finish committing until a replica has confirmed it received the transaction2. You never lose data if the master crashes and you have to fail over to a replica. The master will never throw away data the replica needs or fill up its xlog and run out of disk space because of replica delays. In exchange it can cause the master to slow down or even stop working if replicas have problems, and it always has some performance impact on the master due to network latency.
When there are multiple replicas, only one is synchronous at a time. See synchronous_standby_names.
You can't have synchronous log shipping.
You can actually combine log shipping and asynchronous replication to protect against having to recreate a replica if it falls too far behind, without risking affecting the master. This is an ideal configuration for many deployments, combined with monitoring how far the replica is behind the master to ensure it's within acceptable disaster recovery limits.
Logical vs physical
On top of that we have logical vs physical streaming replication, as introduced in PostgreSQL 9.4:
In physical streaming replication changes are sent at nearly disk block level, like "at offset 14 of disk page 18 of relation 12311, wrote tuple with hex value 0x2342beef1222....".
Physical replication sends everything: the contents of every database in the PostgreSQL install, all tables in every database. It sends index entries, it sends the whole new table data when you VACUUM FULL, it sends data for transactions that rolled back, etc. So it generates a lot of "noise" and sends a lot of excess data. It also requires the replica to be completely identical, so you cannot do anything that'd require a transaction, like creating temp or unlogged tables. Querying the replica delays replication, so long queries on the replica need to be cancelled.
In exchange, it's simple and efficient to apply the changes on the replica, and the replica is reliably exactly the same as the master. DDL is replicated transparently, just like everything else, so it requires no special handling. It can also stream big transactions as they happen, so there is little delay between commit on the master and commit on the replica even for big changes.
Physical replication is mature, well tested, and widely adopted.
logical streaming replication, new in 9.4, sends changes at a higher level, and much more selectively.
It replicates only one database at a time. It sends only row changes and only for committed transactions, and it doesn't have to send vacuum data, index changes, etc. It can selectively send data only for some tables within a database. This makes logical replication much more bandwidth-efficient.
Operating at a higher level also means that you can do transactions on the replica databases. You can create temporary and unlogged tables. Even normal tables, if you want. You can use foreign data wrappers, views, create functions, whatever you like. There's no need to cancel queries if they run too long either.
Logical replication can also be used to build multi-master replication in PostgreSQL, which is not possible using physical replication.
In exchange, though, it can't (currently) stream big transactions as they happen. It has to wait until they commit. So there can be a long delay between a big transaction committing on the master and being applied to the replica.
It replays transactions strictly in commit order, so small fast transactions can get stuck behind a big transaction and be delayed quite a while.
DDL isn't handled automatically. You have to keep the table definitions in sync between master and replica yourself, or the application using logical replication has to have its own facilities to do this. It can be complicated to get this right.
The apply process its self is more complicated than "write some bytes where I'm told to" as well. It also takes more resources on the replica than physical replication does.
Current logical replication implementations are not mature or widely adopted, or particularly easy to use.
Too many options, tell me what to do
Phew. Complicated, huh? And I haven't even got into the details of delayed replication, slots, max_wal_size, timelines, how promotion works, Postgres-XL, BDR and multimaster, etc.
So what should you do?
There's no single right answer. Otherwise PostgreSQL would only support that one way. But there are a few common use cases:
For backup and disaster recovery use pgbarman to make base backups and retain WAL for you, providing easy to manage continuous backup. You should still take periodic pg_dump backups as extra insurance.
For high availability with zero data loss risk use streaming synchronous replication.
For high availability with low data loss risk and better performance you should use asynchronous streaming replication. Either have WAL archiving enabled for fallback or use a replication slot. Monitor how far the replica is behind the master using external tools like Icinga.
References
continuous archiving and PITR
high availability, load balancing and replication
replication settings
recovery.conf
pgbarman
repmgr
wiki: replication, clustering and connection pooling
I would like to know how replication works in a distributed database. It would be nice if this could be explained in a thorough, yet easy to understand way.
It would also be nice if you could make a comparison between distributed transactions and distributed replication.
Single point of failure
The database server is a central part of an enterprise system, and, if it goes down, service availability might get compromised.
If the database server is running on a single server, then we have a single point of failure. Any hardware issue (e.g., disk drive failure) or software malfunction (e.g., driver problems, malfunctioning updates) will render the system unavailable.
Limited resources
If there is a single database server node, then vertical scaling is the only option when it comes to accommodating a higher traffic load. Vertical scaling, or scaling up, means buying more powerful hardware, which provides more resources (e.g., CPU, Memory, I/O) to serve the incoming client transactions.
Up to a certain hardware configuration, vertical scaling can be a viable and simple solution to scale a database system. The problem is that the price-performance ratio is not linear, so after a certain threshold, you get diminishing returns from vertical scaling.
Another problem with vertical scaling is that, in order to upgrade the server, the database service needs to be stopped. So, during the hardware upgrade, the application will not be available, which can impact underlying business operations.
Database Replication
To overcome the aforementioned issues associated with having a single database server node, we can set up multiple database server nodes. The more nodes, the more resources we will have to process incoming traffic.
Also, if a database server node is down, the system can still process requests as long as there are spare database nodes to connect to. For this reason, upgrading the hardware or software of a given database server node can be done without affecting the overall system availability.
The challenge of having multiple nodes is data consistency. If all nodes are in-sync at any given time, the system is Linearizable, which is the strongest guarantee when it comes to data consistency across multiple registers.
The process of synchronizing data across all database nodes is called replication, and there are multiple strategies that we can use.
Single-Primary Database Replication
The Single-Primary Replication scheme looks as follows:
The primary node, also known as the Master node, is the one accepting writes while the replica nodes can only process read-only transactions. Having a single source of truth allows us to avoid data conflicts.
To keep the replicas in-sync, the primary nodes must provide the list of changes that were done by all committed transactions.
Relational database systems have a Redo Log, which contains all data changes that were successfully committed.
PostgreSQL uses the WAL (Write-Ahead Log) records to ensure transaction Durability and for Streaming Replication.
Because the storage engine is separated from the MySQL server, MySQL uses a separate Binary Log for replication. The Redo Log is generated by the InnoDB storage engine, and its goal is to provide transaction Durability while the Binary Log is created by the MySQL Server, and it stores the logical logging records, as opposed to physical logging created by the Redo Log.
By applying the same changes recorded in the WAL or Binary Log entries, the replica node can stay in-sync with the primary node.
Horizontal scaling
The Single-Primary Replication provides horizontal scalability for read-only transactions. If the number of read-only transactions increases, we can create more replica nodes to accommodate the incoming traffic.
This is what horizontal scaling, or scaling out, is all about. Unlike vertical scaling, which requires buying more powerful hardware, horizontal scaling can be achieved using commodity hardware.
On the other hand, read-write transactions can only be scaled up (vertical scaling) as there is a single primary node.
I would recommend initially spending time reviewing the MySQL Docs on Replication. It's a good example of database replication. They are here:
http://dev.mysql.com/doc/refman/5.5/en/replication.html
Covering the entire scope of your question seems like too much for one question.
If you have some specific questions, please feel free to post them. Thanks!
Clustrix is a distributed database with a shared nothing architecture that supports both distributed transactions and replication. There is some technical documentation available that describes data distribution, distributed evaluation model, and built in fault tolerance, as well as an overview of the architecture.
As a MySQL replacement, Clustrix implements MySQL's replication policy and produces binlogs in the MySQL format, which are serialized so that Clustrix can act as either a Master or Slave to MySQL.
We are telling our client to put a SQL Server database file (mdf), on a different physical drive than the transaction log file (ldf). The tech company (hired by our client) wanted to put the transaction log on a slower (e.g. cheaper) drive than the database drive, because with transaction logs, you are just sequencially writing to the log file.
I told them that I thought that the drive (actually a RAID configuration) needed to be on a fast drive as well, because every data changing call to the database, needs be saved there, as well as to the database itself.
After saying that though, I realized I was not entirely sure about that. Does the speed of the transaction log drive make a significant difference in performance... if the drive with the database is fast?
The speed of the log drive is the most critical factor for a write intensive database. No updates can occur faster than the log can be written, so your drive must support your maximum update rate experienced at a spike. And all updates generate log. Database file (MDF/NDF) updates can afford slower rates of write because of two factors
data updates are written out lazily and flushed on checkpoint. This means that an update spike can be amortized over the average drive throughput
multiple updates can accumulate on a single page and thus will need one single write
So you are right that the log throughput is critical.
But at the same time, log writes have a specific pattern of sequential writes: log is always appended at the end. All mechanical drives have a much higher throughput, for both reads and writes, for sequential operations, since they involve less physical movement of the disk heads. So is also true what your ops guys say that a slower drive can offer in fact sufficient throughput.
But all these come with some big warnings:
the slower drive (or RAID combination) must truly offer high sequential throughput
the drive must see log writes from one and only one database, and nothing else. Any other operation that could interfere with the current disk head position will damage your write throughput and result in slower database performance
the log must be only write, and not read. Keep in mind that certain components need to read from the log, and thus they will move the disk mechanics to other positions so they can read back the previously written log:
transactional replication
database mirroring
log backup
In simplistic terms, if you are talking about an OLTP database, your throughput is determined by the speed of your writes to the Transaction Log. Once this performance ceiling is hit, all other dependant actions must wait on the commit to log to complete.
This is a VERY simplistic take on the internals of the Transaction Log, to which entire books are dedicated, but the rudimentary point remains.
Now if the storage system you are working with can provide the IOPS that you require to support both your Transaction Log and Database data files together then a shared drive/LUN would provide adequately for your needs.
To provide you with a specific recommended course of action I would need to know more about your database workload and the performance you require your database server to deliver.
Get your hands on the title SQL Server 2008 Internals to get a thorough look into the internals of the SQL Server transaction log, it's one of the best SQL Server titles out there and it will pay for itself in minutes from the value you gain from reading.
Well, the transaction log is the main structure that provides ACID, can be a big bottleneck for performance, and if you do backups regularly its required space has an upper limit, so i would put it in a safe, fast drive with just space enough + a bit of margin.
The Transaction log should be on the fastest drives, if it just can complete the write to the log it can do the rest of the transaction in memory and let it hit disk later.
I am tasked with setting up a disaster recovery for one of our system. The primary server is in FL and the secondary is in Germany. The application is a global application within my company.
I am not sure if I should use Log shipping or Mirroring. What I have read is that mirroring will have an adverse effect on the performance of my application. Is this true? Does this mean that any time a user modify or save a record that it will take longer to get a positive response.
Thanks
Mirroring can have different performance impacts depending on the operating mode you choose. If you are mirroring you can have three operating modes: High Protection (with and without automatic failover) and High Performance.
Basically, these amount to synchronous and asynchronous mirroring. With High Protection your application will be waiting for the mirroring to finish before considering the transaction complete. In High Performance mode your application will not wait for the mirroring to have been committed. In fact, it is not guaranteed at any point in time that all the most recent transactions will have been saved in the mirror's transaction log.
One of the main factors to consider with mirroring will be the round trip time of your network. Higher latency will impact more heavily on your performance. You will need to weigh the performance cost against your specific recovery (and failover) requirements.
If you haven't already, you should read Database Mirroring in SQL Server 2005 and
Database Mirroring Best Practices and Performance Considerations.
Mirroring would keep both the primary and DR environments in synch 100% of the time and thus eliminate the possibility for data loss. However, as you noted, this has an adverse affect on performance, but may be necessary in situations that cannot tolerate any data loss (ex. financial applications). Shipping logs and applying them to the standby database at the DR site doesn't have the same impact on user response time, but opens up a small period during which data loss could potentially occur.
Mirroring is operate synchronously (wait until the log is committed to DB), usually deploy on good network connection (LAN)
Log shipping is operate asynchronously (will not wait the log is committed to DB), usually deploy over MPLS / VPN or slow network
so for your objective, u should use Log Shipping