sqlite3/C transaction begin end -- overflow? - c

In C on an embedded system (where memory is an issue), trying to optimize performance, multiple inserts are combined into larger transactions.
Intuitively, SQLITE must keep the unsent transactions in a cache somewhere in the limited memory.
Is it possible to have too many inserts between two calls of 'BEGIN TRANSACTION' and 'END TRANSACTION'? Can the cache overflow?
Or, does sqlite3 take care of it and initiate a transaction before a overflow happens?
If the cache may overflow, what is the best strategy to call BEGIN/END?

Any changes you make are written to the database file. To support rollbacks, the old contents of the changed database pages are saved in the journal file.
When you commit a transaction, the journal file is just deleted; when you roll back a transaction, those pages are written back.
So there is not limit on the size of the data in a transaction, as long as you have enough disk space.
(The cache can help with avoiding some writes, but it works transparently and does not affect the semantics of your code.)

Related

How could WAL (write ahead log) have better performance than write directly to disk?

The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.

PostgreSQL-Data File

Which process writes the data file in PostgreSQL?
And what are the data files in postgreSQL?
Note: Performing Insert/Update/Delete operation on postgreSQL-9.5. I want to verify which Process is performing commit on Disk i.e Data File. Use of WAL and Data file.
The data files of a PostgreSQL database cluster are located under the data subdirectory of the data directory. They are written by three processes:
The background writer process that writes dirty blocks from the buffer back to disk to ensure that there are enough clean blocks.
The checkpointer process that writes all dirty blocks to disk at certain times (checkpoints) to provide a starting point for crash recovery.
The backend process (the process that serves a client connection) only writes data to disk if the background writer cannot keep up and there are not enough free blocks available.
The write-ahead log or WAL, located in pg_xlog, is something entirely different. It is written by the backend process immediately before COMMIT to ensure that the information necessary to recover the transaction in the case of a crash is safely written to disk. The same holds for the commit log, located in pg_clog, which contains the information if a transaction was committed or rolled back.
Data may be written to the data file before COMMIT, but they only become visible when the transaction is committed.
It may be worth mentioning that not only DML statements cause data blocks to be dirtied:
The background process “autovacuum” regularly scans tables and indexes and removes unused entries.
The first process to read newly written data will look up the commit information in the commit log and write a hint bit to the tuple so that future readers don't have to do that work again.

what is a sequential write and what is random write

I want to know what exactly is sequential write and what is random write in definition. I will be even more helpful with example. I tried to google the result. But not much google explanation.
Thanks
When you write two blocks that are next to each-other on disk, you have a sequential write.
When you write two blocks that are located far away from eachother on disk, you have random writes.
With a spinning hard disk, the second pattern is much slower (can be magnitudes), because the head has to be moved around to the new position.
Database technology is (or has been, maybe not that important with SSD anymore) to a large part about optimizing disk access patterns. So what you often see, for example, is trading direct updates of data in their on-disk location (random access) versus writing to a transaction log (sequential access). Makes it more complicated and time-consuming to reconstruct the actual value, but makes for much faster commits (and you have checkpoints to eventually consolidate the logs that build up).

WAL sequence number infinite?

I am wondering if database WAL sequences are infinite? I guess most WAL records have a fix size for the WAL number? Is this a really big number that is so big that it just won't reach an end? This might be quite a waste of space? Or have the big DB-player invented a better method?
Or do they implement some logic to let the WAl start at 0 again? That might have heavy impact on many spots in the code...?
EDIT:
Impact: E.g. the recovery after a crash relies on the sequence number getting bigger along the timeline. If the sequence could start over the recovery could get confused.
Term WAL sequence number: WAL (Write Ahead Log a.k.a the transactional log that is guranteed to be on your disk before the application layer received that a transaction was successful). This log has a growing number to keep the database consitent e.g. in case of recovery by checking the WAL sequence number from the pages against the sequence number from the WAL.
I would not assume that every database implements the same strategy.
Speaking only for Oracle, the SCN (system change number) is a 48-bit number so an Oracle database can handle nearly 300 trillion transactions before hitting the limit. Realistically, that will take eons. Even if you could do 1 thousand transactions per second, the SCN wouldn't hit the limit for 300 billion seconds or roughly 9500 years. Now, there are various things that can cause the SCN to increment in addition to just doing transactions (famously the recent issue with hot backups and database links that caused a few users to exceed the database's checks for the reasonability of the SCN) so it won't really take 9500 years to hit the limit. But, realistically, it gives Oracle plenty of time to move to a 64-bit SCN some years down the line, buying everyone a few more centuries of functionality.
Like SQL Server, DB2 calls that counter a Log Sequence Number (LSN). IBM recently expanded the size of their LSN from six bytes to eight bytes, unsigned. The LSN is an ever-growing pointer that shows where in the log files that specific log record can be found. An eight byte LSN means that a DB2 database can write nearly 16 exbibytes of log records before running out of address space, at which point the contents of the database must be unloaded and copied into a new database.
Postgres WAL numbers technically can overflow, but only after writing 32 Eb of data, as the WAL file pointer is 64-bit
See What is an LSN: Log Sequence Number. The article describes the structure of the SQL Server LSN (the WAL number) and shows you how to decode one. Since LSNs are of fixed size and they don't roll over, it follows that you can run out of them. It will take a very very long time though.
In PostgreSQL, WAL logs are stored as a set of segment files. "Segment files are given ever-increasing numbers as names, starting at 000000010000000000000000." The number doesn't wrap around.

Database synchronisation

I have some problem in system I am developing. I have one python script which first works with a virtulisation software and if that operation succeeds, it writes things to database.
If some exception occurs in the virtulisation software then I can manage all things, but the real problem will occur if inserting in database fails. If insert fails , i will have to revert things in that virtulization software otherwise things will become asynchronous. But problem is, reverting things in that software is not possible all the time.
How to handle things so that i can keep the database in sync with that software? Any middle ware or special application??? Or any logic in programming?
You want two actions in your system (OP: operation in your virt. software; WDB: write to database) to be atomic (either both take place, or none). Kind of a distributed transaction, but your virtualized software does not directly support directly a transactionable behaviour (no rollback). If you could make them part of some distributed transactional system, you'd be done (see eg), but that is often impossible or impractical. Different strategies to attain a pseudo-transactional behaviour depends on the particulars of your scenario. Some examples:
Open TX (transaction in DB)
WDB
OP
If OP succeeded, commit TX, else rollback TX.
Only feasible if what you write to DB does not depend on OP operation (improbable).
OP1 (first phase of the operation: you get the results, but do not alter anything)
Open TX
WDB
OP2 (second phase: you modify the virt. sofware)
Commit TX or rollback
(Steps 4-5 can be switched) This would be a poor's man "Two-phase commit" implementation. Only feasible if you can divide your operation in those two phases.
Open TX
Dummy WDB (write a dummy result to DB)
Rollback TX
OP
WDB
This checks that the DB is operational, doing a dummy writing before attempting the real operation and writing. Feasible, but not foolproof.
OP
WDB
If fail: save data to a raw file, log error, send mail to IT, turn red lights on.
Sound pathetic... but sometimes it's the only feasible way.

Resources