Does the ext3 file system have undo logging? If no, why not? - filesystems

I learned from online articles that disk-based database systems have both undo and redo logging, because "steal, no-force" policy demands it. "Steal" policy means database system can write uncommitted data to disk, which needs undo logging to help rollback; "No-force" policy means database systems do not have to write in-memory data to disk because redo logging is able to help the recovery by replaying the redo log for a committed transaction.
Is ext3 journaling file system doing similar things for atomicity and durability? I learned from several online articles that ext3 file system only has redo logs. If it's true, then why ext3 does not have undo logs?

Related

Doesn't batch writing WAL files in databases negate the purporse of WAL files?

I am reading about databases and I can't understand one thing about WAL files. They exist in order to make sure transactions are reliable and recoverable, however, apparently, to improve performance, WAL files are written in batches instead of immediately. This looks to me quite contradictory and negates the purpose of WAL files. What happens if there's a crash between WAL commits? How does this differ from not having the WAL at all and simply fsync'ing the database itself periodically?
I've no much idea and just seeked for information about this as it seems interesting to me.
If some ninja find my explanation incorrect please, correct me. What I understand at this point is that WAL files are written before the commit, then once confirmed that the transaction data is on the WAL, it confirms the transaction.
What is done in batch is to move this WAL data to heap and index, real tables.
Write-Ahead Logging (WAL) is a standard method for ensuring data integrity. A detailed description can be found in most (if not all) books about transaction processing. Briefly, WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages can be redone from the log records. (This is roll-forward recovery, also known as REDO.)
https://www.postgresql.org/docs/current/wal-intro.html

Are there any databases really durable?

I was reading http://www.h2database.com/html/advanced.html#durability_problems and i found
Some databases claim they can guarantee durability, but such claims are wrong. A durability test was run against H2, HSQLDB, PostgreSQL, and Derby. All of those databases sometimes lose committed transactions. The test is included in the H2 download, see org.h2.test.poweroff.Test
Also it says
Where losing transactions is not acceptable, a laptop or UPS (uninterruptible power supply) should be used.
So is there any database that is durable. The document says about fsync() command and most hard drives do not obey fsync(). It also talks about no reliable way to flush hard drive buffers
So, is there a time after which a committed transaction becomes durable, so we can buy ups that gives minimum that much backup of power supply.
Also is there a way to know that a transaction committed is durable. Suppose we don't buy ups and after knowing that a transaction is durable we can show success message.
The problem depends on whether or not you can instruct the HDD/SDD to commit transactions to durable media. If the mass storage device does not have the facility to flush to durable media, then no data storage system on top of it can be said to be truely durable.
There are plenty of NAS devices with built in UPS however and these seem to fit the requirement for durable media - if the database on a seperate server commits data to that device and does a checkpoint then the commits are flushed to the media. So long as the media survives a power outage then you can say its durable. The UPS on the NAS should be capable of issuing a controlled shutdown to its associated disk pack, guaranteeing permenance.
Alternatively you could use something like SQL Azure which writes commits to multiple (3) seperate database storage instances on different servers. Although we have no idea if those writes ever reach a permenant storage media, it doesnt actually matter - the measurement of durability is read-repeatability; and this seems to meet that requirement.

Possible to checkpoint a WAL file during a transaction?

We are performing quite large transactions on a SQLite database that is causing the WAL file to grow extremely large. (Sometimes up to 1GB for large transactions.) Is there a way to checkpoint the WAL file while in the middle of a transaction? When I try calling sqlite3_wal_checkpoint() or executing the WAL checkpoint PRAGMA statement, both return SQLITE_BUSY.
Not really. This is whole point of transactions: WAL (or journal file) keeps data that would become official once successfully committed. Until that happens, if anything goes wrong - program crash, computer reboot, etc, WAL or journal file allow to safely rollback (undo) uncommitted action. Moving only part of this uncommitted transaction would defeat the purpose.
Note that SQLite documentations defines check-pointing as moving the WAL file transactions back into the database. In other words, checkpointing moves one or more transactions from WAL, but not part of huge uncommitted transaction.
There are few possible solutions to your problem:
Avoid huge transactions - commit in smaller chunks if you can. Of course, this is not always possible, depending on your application.
Use old journaling mode with PRAGMA journal_mode=DELETE. It is slightly slower than new WAL mode (with PRAGMA journal_mode=WAL), but in my experience it tends to create much smaller journal files, and they get deleted when transaction successfully commits. For example, Android 4.x is still using old journaling mode - it tends to work faster on flash and does not create huge temporary or journal files.

Log-Shipping: Why would you choose No Recovery mode?

When configuring LogShipping for SQL Server, you can choose for the secondary database to be in No Recovery mode or Standby mode. No Recovery means you have no access to the database while log shipping is going on. Standby gives you read-only access, and if you select the option to disconnect users whenever a restore is about to happen, would appear not to interfere with the log shipping process. This looks to me like an extra benefit of standby mode, but as far as I can see the documention mentions no adverse affects.
I'm therefore wondering why anyone would choose to use No Recovery mode? The only plausible reasons I can think of are if Standby mode caused a significant performance degredation (but there's no mention of anything like that in the docs), or if there is some security requirement to actively prevent anyone seeing the contents of the secondary database (which would seem rare/unlikely).
Can anyone enlighten me what the advantage of choosing No Recovery mode is supposed to be?
When you use NORECOVERY mode, no access will be given to the target database, so the database does not have to care about uncommitted transactions. The log can just be restored "as is" and left in that state.
When you use STANDBY mode, the database restores as NORECOVERY, then analyzes and rolls back all uncommitted transactions in the log. It can then give read only access to users. When the next log is restored, the database disconnects all users and rolls the uncommitted transactions from the last log forward again before restoring.
As you can see, STANDBY has potentially large extra overhead at restore, depending on your transaction volume.
More details at this article at My World of SQL.

Database durability vs performance

I have studied a lot how durability is achieved in databases and if I understand well it works like this (simplified):
Clent's point of view:
start transaction.
insert into table values...
commit transaction
DB engine point of view:
write transaction start indicator to log file
write changes done by client to log file
write transaction commit indicator to log file
flush log file to HDD (this ensures durability of data)
return 'OK' to client
What I observed:
Client application is single thread application (one db connection). I'm able to perform 400 transactions/sec, while simple tests that writes something to file and then fsync this file to HDD performs only 150 syncs/sec. If client were multithread/multi connection I would imagine that DB engine groups transactions and does one fsync per few transactions, but this is not the case.
My question is if, for example MsSQL, really synchronizes log file (fsync, FlushFileBuffers, etc...) on every transaction commit, or is it some other kind of magic behind?
The short answer is that, for a transaction to be durable, the log file has to be written to stable storage before changes to the database are written to disk.
Stable storage is more complicated than you might think. Disks, for example, are not usually considered to be stable storage. (Not by people who write code for transactional database engines, anyway.)
It see how a particular open source dbms writes to stable storage, you'll need to read the source code. PostgreSQL source code is online. (File is xlog.c) Don't know about MySQL source.

Resources