Hy guys . help me.
What is Difference wal level logical, hot_standby, minimal, logical and what is wal segment whether need to use big segment or not?
now i use Wal_segment : 50.
why after I try insert 5 million row segment archive over 50?
The parameter wal_level determines how much information is written to the transaction log (write-ahead log, short WAL).
The settings in decreasing order of amount of WAL emitted:
For logical replication or logical decoding, you need logical.
To run physical replication with hot standby, you need hot_standby.
To archive WAL files with archive_mode = on, you need archive.
The minimal level logs only the information required for crash recovery.
Note that from PostgreSQL 9.6 on, archive and hot_standby have been deprecated and replaced with the new setting replica.
A WAL segment is one 16 MB transaction log file as seen in pg_xlog or pg_wal.
I guess that with wal_segment you mean the parameter checkpoint_segments (max_wal_size since 9.5).
It does not pose an absolute limit on the number of WAL segments, it determines after how much WAL a checkpoint will be forced. If your archive_command is slow, WAL may pile up.
Related
I'm researching distributed file system architectures and designs. Quite a few DFS(s) I've come across usually have the following architecture:
A namenode or metadata server used to manage the location of data blocks / chunks as well as the hierarchy of the filesystem.
A data node or data server used to store chunks or blocks of data belonging to one or more logical files
A client that talks to a namenode to find appropriate data nodes to read/write from/to.
Many of these systems have two primary variants, a block size and a replication factor.
My question is:
Are Replication Factor and Forward Error Correction like Reed Solomon Erasure Encoding compatible here? Does it makes sense to use both techniques to ensure high availability of data? Or is it enough to use one or hte other (what are the trade offs?)
Whether you can mix and match plain old replication and erasure codes is dependent on what the distributed file system in question offers in its feature set but they are usually mutually exclusive.
Replication is simple in the sense that the file/object is replicated as a whole to 'n' (the replication factor) data nodes. Writes go to all nodes. Reads can be served from any one of the nodes individually since they host the whole file. So you can distribute different reads among multiple nodes. There is no intermediate math involved and is mostly I/O bound. Also, for a given file size, the disk usage is more (since there are 'n' copies).
Erasure codes are complex in the sense that parts of the file/object are encoded and spread among the 'n' data nodes during writes. Reads need to fetch data from more than one node, decode it and reconstruct the data. So math is involved and can become CPU bound. Compared to replication, the disk usage is less but so is the ability to tolerate faults.
The WAL (Write-Ahead Log) technology has been used in many systems.
The mechanism of a WAL is that when a client writes data, the system does two things:
Write a log to disk and return to the client
Write the data to disk, cache or memory asynchronously
There are two benefits:
If some exception occurs (i.e. power loss) we can recover the data from the log.
The performance is good because we write data asynchronously and can batch operations
Why not just write the data into disk directly? You make every write directly to disk. On success, you tell client success, if the write failed you return a failed response or timeout.
In this way, you still have those two benefits.
You do not need to recover anything in case of power off. Because every success response returned to client means data really on disk.
Performance should be the same. Although we touch disk frequently, but WAL is the same too (Every success write for WAL means it is success on disk)
So what is the advantage of using a WAL?
Performance.
Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.
Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.
This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.
Update:
SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.
As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.
If you write the update directly to disk, there are two options:
write all records to the end of some file
the files are somehow structured
If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you
write to the WAL for persistence
update the memtables (in RAM)
Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.
I have some guess.
Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.
situation 1:
All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.
situation 2:
All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.
I have an Apache Solr 4.2.1 instance that has 4 cores of total size (625MB + 30MB + 20GB + 300MB) 21 GB.
It runs on a 4 Core CPU, 16GB RAM, 120GB HD, CentOS dedicated machine.
1st core is fully imported once a day.
2nd core is fully imported every two hours.
3rd core is delta imported every two hours.
4rth core is fully imported every two hours.
The server also has a decent amount of queries (search for and create, update and delete documents).
Every core has maxDocs: 100 and maxTime: 15000 for autoCommint and maxTime: 1000 for autoSoftCommit.
The System usage is:
Around 97% of 14.96 GB Physical Memory
0MB Swap Space
Around 94% of 4096 File Descriptor Count
From 60% to 90% of 1.21GB of JVM-Memory.
When I reboot the machine the File Descriptor Count fall to near 0 and then steadily over the course of on week or so it reaches the aforementioned value.
So, to conclude, my questions are:
Is 94% of 4096 File Descriptor Count normal?
How can I increase the maximum File Descriptor Count?
How can I calculate the theoretical optimal value for the maximum and used File Descriptor Count.
Will the File Descriptor Count reaches 100? If yes, the server will crash? Or it will keep it bellow 100% by itself and functions as it should?
Thanks a lot beforehand!
Sure.
ulimit -n <number>. See Increasing ulimit on CentOS.
There really isn't one - as many as needed depending on a lot of factors, such as your mergefactor (if you have many files, the number of open files will be large as well - this is especially true for the indices that aren't full imports. Check the number of files in your data directories and issue an optimize if the same index has become very fragmented and have a large mergefactor), number of searchers, other software running on the same server, etc.
It could. Yes (or at least it won't function properly, as it won't be able to open any new files). No. In practice you'll get a message about being unable to open a file with the message "Too many open files".
So, the issue with the File Descriptor Count (FDC) and to be more precise with the ever increasing FDC was that I was committing after every update!
I noticed that Solr wasn't deleting old transaction logs. Thus, after the period of one week FDC was maxing out and I was forced to reboot.
I stopped committing after every update and now my Solr stats are:
Around 55% of 14.96 GB Physical Memory
0MB Swap Space
Around 4% of 4096 File Descriptor Count
From 60% to 80% of 1.21GB of JVM-Memory.
Also, the old transaction logs are deleted by auto commit (soft & hard) and Solr has no more performance wornings!
So, as pointed very well in this article:
Understanding Transaction Logs, Soft Commit and Commit in SolrCloud
"Be very careful committing from the client! In fact, don’t do it."
In C on an embedded system (where memory is an issue), trying to optimize performance, multiple inserts are combined into larger transactions.
Intuitively, SQLITE must keep the unsent transactions in a cache somewhere in the limited memory.
Is it possible to have too many inserts between two calls of 'BEGIN TRANSACTION' and 'END TRANSACTION'? Can the cache overflow?
Or, does sqlite3 take care of it and initiate a transaction before a overflow happens?
If the cache may overflow, what is the best strategy to call BEGIN/END?
Any changes you make are written to the database file. To support rollbacks, the old contents of the changed database pages are saved in the journal file.
When you commit a transaction, the journal file is just deleted; when you roll back a transaction, those pages are written back.
So there is not limit on the size of the data in a transaction, as long as you have enough disk space.
(The cache can help with avoiding some writes, but it works transparently and does not affect the semantics of your code.)
I am wondering if database WAL sequences are infinite? I guess most WAL records have a fix size for the WAL number? Is this a really big number that is so big that it just won't reach an end? This might be quite a waste of space? Or have the big DB-player invented a better method?
Or do they implement some logic to let the WAl start at 0 again? That might have heavy impact on many spots in the code...?
EDIT:
Impact: E.g. the recovery after a crash relies on the sequence number getting bigger along the timeline. If the sequence could start over the recovery could get confused.
Term WAL sequence number: WAL (Write Ahead Log a.k.a the transactional log that is guranteed to be on your disk before the application layer received that a transaction was successful). This log has a growing number to keep the database consitent e.g. in case of recovery by checking the WAL sequence number from the pages against the sequence number from the WAL.
I would not assume that every database implements the same strategy.
Speaking only for Oracle, the SCN (system change number) is a 48-bit number so an Oracle database can handle nearly 300 trillion transactions before hitting the limit. Realistically, that will take eons. Even if you could do 1 thousand transactions per second, the SCN wouldn't hit the limit for 300 billion seconds or roughly 9500 years. Now, there are various things that can cause the SCN to increment in addition to just doing transactions (famously the recent issue with hot backups and database links that caused a few users to exceed the database's checks for the reasonability of the SCN) so it won't really take 9500 years to hit the limit. But, realistically, it gives Oracle plenty of time to move to a 64-bit SCN some years down the line, buying everyone a few more centuries of functionality.
Like SQL Server, DB2 calls that counter a Log Sequence Number (LSN). IBM recently expanded the size of their LSN from six bytes to eight bytes, unsigned. The LSN is an ever-growing pointer that shows where in the log files that specific log record can be found. An eight byte LSN means that a DB2 database can write nearly 16 exbibytes of log records before running out of address space, at which point the contents of the database must be unloaded and copied into a new database.
Postgres WAL numbers technically can overflow, but only after writing 32 Eb of data, as the WAL file pointer is 64-bit
See What is an LSN: Log Sequence Number. The article describes the structure of the SQL Server LSN (the WAL number) and shows you how to decode one. Since LSNs are of fixed size and they don't roll over, it follows that you can run out of them. It will take a very very long time though.
In PostgreSQL, WAL logs are stored as a set of segment files. "Segment files are given ever-increasing numbers as names, starting at 000000010000000000000000." The number doesn't wrap around.