Which process writes the data file in PostgreSQL?
And what are the data files in postgreSQL?
Note: Performing Insert/Update/Delete operation on postgreSQL-9.5. I want to verify which Process is performing commit on Disk i.e Data File. Use of WAL and Data file.
The data files of a PostgreSQL database cluster are located under the data subdirectory of the data directory. They are written by three processes:
The background writer process that writes dirty blocks from the buffer back to disk to ensure that there are enough clean blocks.
The checkpointer process that writes all dirty blocks to disk at certain times (checkpoints) to provide a starting point for crash recovery.
The backend process (the process that serves a client connection) only writes data to disk if the background writer cannot keep up and there are not enough free blocks available.
The write-ahead log or WAL, located in pg_xlog, is something entirely different. It is written by the backend process immediately before COMMIT to ensure that the information necessary to recover the transaction in the case of a crash is safely written to disk. The same holds for the commit log, located in pg_clog, which contains the information if a transaction was committed or rolled back.
Data may be written to the data file before COMMIT, but they only become visible when the transaction is committed.
It may be worth mentioning that not only DML statements cause data blocks to be dirtied:
The background process “autovacuum” regularly scans tables and indexes and removes unused entries.
The first process to read newly written data will look up the commit information in the commit log and write a hint bit to the tuple so that future readers don't have to do that work again.
Related
I'm trying to implement an atomic version of copy on write. I have certain conditions if met that will make a copy of the original file.
I implemented something like this pseudo code.
//write operations//
if(some condition)
//create a temp file//
rename(srcfile, copied-version)
rename(tmpfile, srcfile)
problem with this logic :
Hardlinks.
I want to transfer the Hardlink from copied version to new srcfile.
You can't.
Hardlinks are one directional pointers. So you can't modify or remove other hardlinks that you don't explicitly know about. All you can do is write to the same file data, and that's not atomic.
This rule applies uniformly to both hadlinks and file descriptors. What that means is that you can't modify the content pointed to by an unknown hardlink and not modify the content pointed to by another process with the same file open.
That effectively prevents you from modifying the file an unknown hardlink points
to atomically.
If you have control over every process which might modify or access these files (if they are only modified by programs you've written), then you might be able to use flock() to signal to other processes that the file is in use. This won't work if the file is stored on an NFS remote file system, but should generally work otherwise.
In some cases, file leases can be a solution to the underlying issue – ensuring atomic content updates – but only if each reader and writer opens and closes the file for each snapshot.
Because a similar limitation happens for the traditional copy–update–rename-over sequence, perhaps the file lease solution would also work for OP.
For details, see man 2 fcntl Leases and Managing signals sections. The process must either have the same owner as the file, or have the CAP_LEASE capability (usually granted to the process via filesystem capabilities). Superuser processes (running as root) have the capability by default.
The idea is that when the process wishes to make "atomic" changes to the file, it acquires a write lease on the file. This only succeeds if no other process has the file open. If another process tries to open the file, the lease holder receives a signal, and has up to lease-break-time (about a minute, typically) to downgrade the lease (or simply close the file); during that time, the opener will block.
Note that there is no way to divert the opener. The situation is that the opener already has a handle to the underlying inode (so access checks and filename resolution has already occurred); it is just that kernel won't return it to the userspace process before the lease is released or broken.
Your lease owner can, however, create a copy of the current contents to a temporary file, acquiring a write lease on that as well, and then rename it over the target file name. This way, each (set of) opener(s) obtain a handle to the file contents as they were at the time of the opening; if they do any modifications, they will be "private", and not reflected on the original file. Since the underlying inode is no longer referred to by any filename, when they (the last process having it open) close it, the inode is deleted and the storage released back to the file system. The Linux page cache also caches such accesses very well, so in many cases the "temporary copy file" never even hits actual storage media (unless there is memory pressure, i.e. memory needed for non-pagecache purposes).
A pure "atomic modification" does not require any kind of copies or renames, only holding the lease for the duration of the set of writes that must appear atomic for the readers.
Note that taking a write lease will normally block until no other process has the file open any longer, so the time at which such a lease-based atomic update can occur, is restricted, and not guaranteed to be always available. (For example, you may have a lazy process that just keeps the file open, and occasionally polls it. If you have such processes, this lease-based approach won't work – but nor would the copy–rename-over approach either.)
Also, leases work only on local files.
If you need record-based atomicity, just use fcntl-based record locks, and have all readers take a read-lock for the region they want to access atomically, and all writers take a write-lock for the region to be updated, as record-locks are advisory (i.e., do not block reads or writes, only other record locks).
The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.
In C on an embedded system (where memory is an issue), trying to optimize performance, multiple inserts are combined into larger transactions.
Intuitively, SQLITE must keep the unsent transactions in a cache somewhere in the limited memory.
Is it possible to have too many inserts between two calls of 'BEGIN TRANSACTION' and 'END TRANSACTION'? Can the cache overflow?
Or, does sqlite3 take care of it and initiate a transaction before a overflow happens?
If the cache may overflow, what is the best strategy to call BEGIN/END?
Any changes you make are written to the database file. To support rollbacks, the old contents of the changed database pages are saved in the journal file.
When you commit a transaction, the journal file is just deleted; when you roll back a transaction, those pages are written back.
So there is not limit on the size of the data in a transaction, as long as you have enough disk space.
(The cache can help with avoiding some writes, but it works transparently and does not affect the semantics of your code.)
We are developing program, that works on data located in shared memory. Program is latency demanding and processes huge amount of data.
If program fails, we must return to last working state FAST.
One way is to read and process data from transaction log, which contains transactions from the start of the day. But this is not fast at all, considering size of transaction log (hundreds of gigabytes).
We are now looking the way to create snapshots of data that can be written to disk and read very fast if program fails. But snapshot creation must not lock program execution and data in that snapshot must be consistent.
If we were using local memory for keeping data instead of shared memory, solution will be easy:
fork()
write data to disk
Because of copy-on-write on linux, only changed data will be copied, so it is very fast.
But we are using posix shared memory.
Is there any way to do it with speed and consistency in mind?
If you can spare enough CPU cycles for a memcpy(), you could:
fork()
lock shared memory
memcpy (shared_mem -> some_buffer)
unlock shared memory
write data to disk taking the time you like
In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?
I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.