What are the differences between LOG_CHECKPOINT_INTERVAL and LOG_CHECKPOINT_TIMEOUT? I need a clear picture of volume based intervals and time based interval. What are the relations among LOG_CHECKPOINT_TIMEOUT,LOG_CHECKPOINT_INTERVAL and FAST_START_IO_TARGET?
A checkpoint is when the database synchronizes the dirty blocks in the buffer cache with the datafiles. That is, it writes changed data to disk. The two LOG_CHECKPOINT parameters you mention govern how often this activity occurs.
The heart of the matter is: if the checkpoint occurs infrequently it will take longer to recover the database in the event of a crash, because it has to apply lots of data from the redo logs. On the other hand, if the checkpoint occurs too often the database can be tied up as various background processes become a bottleneck.
The difference between the two is that the INTERVAL specifies the maximum amount of redo blocks which can exist between checkpoints and the TIMEOUT specifies the maximum number of seconds between checkpoints. We need to set both parameters to cater for spikes of heavy activity. Note that LOG_CHECKPOINT_INTERVAL is measured in OS blocks not database blocks.
FAST_START_IO_TARGET is a different proposition. It specifies a target for the number of I/Os required to recover the database. The database then manages its checkpoints intelligently to achieve this target. Again, this is a trade-off between recovery times and the amount of background activity, although the impact on normal processing should be less than badly set LOG_CHECKPOINT paremeters. This parameter is only available withe the Enterprise Edition. It was deprecated in 9i in favour of FAST_START_MTTR_TARGET, and Oracle removed it in 10g. There is a view V$MTTR_TARGET_ADVICE which, er, provides advice on setting the FAST_START_MTTR_TARGET.
We should set either the FAST_START%TARGET or the LOG_CHECKPOINT_% parameters but not both. Setting the LOG_CHECKPOINT_INTERVAL will override the setting of FAST_START_MTTR_TARGET.
Related
My understanding: a read replica database exists to allow read volumes to scale.
So far, so good, lots of copies to read from - ok, that makes sense, share the volume of reads between a bunch of copies.
However, the things I'm reading seem to imply "tada! magic fast copies!". How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
Good question.
First, the writes to the replicas may be more efficient than the writes to the primary if the replicas are maintained by replaying the Write-Ahead Logs into the secondaries (sometimes called a "physical replica"), instead of replaying the queries into the secondaries (sometimes called a "logical replica"). A physical replica doesn't need to do any query processing to stay in sync, and may not need to read the target database blocks/pages into memory in order to apply the changes, leaving more of the memory and CPU free to process read requests.
Even a logical replica might be able to apply changes cheaper on a replica as a query on the primary of the form
update t set status = 'a' where status = 'b'
might get replicated as a series of
update t set status = 'a' where id = ?
saving the replica from having to identify which rows to update.
Second, the secondaries allow the read workload to scale across more physical resources. So total read workload is spread across more servers.
I try to understand the impact of multi-partitioned transactions in VoltDB 9.x. I know it is designed for single-partioned transactions, but I want to know what it will cost me if I can't avoid it.
In summary, my question is whether it is still the case that multi-partitioned transactions in VoltDB always lock the entire cluster and how are the different kinds of multi-partitioned transactions are related to each other regarding to their execution behaviour?
From H-Store-FAQ:
[...] this allows H-Store to support additional optimizations, such as speculative execution and arbitrary multi-partition transactions. For example, in VoltDB every transaction is either single-partition or all-partition. That is, any transaction that needs to touch multiple partitions will cause the VoltDB’s transaction coordinator to lock all partitions in the cluster, even if the transaction only needs to touch data at two partitions. [...] It is likely VoltDB will support these features in the future [...]
The papers The VoltDB Main Memory DBMS and How VoltDB does Transactions claim that it exists at least one split of multi-partitioned transactions in VoltDB: One-Shot-Reads and General-2PC-Transactions.
In the class MpTransactionTaskQueue there is a distinction, whether a transaction will be routed to the multi-partitioned site (count 1) or a pool of read-only sites (default count up to 20) of the MPI and they can't be executed interleaved.
So these are my sub questions:
Are One-Shot-Reads always be executed on RO-Sites?
Are RO-Sites execute read-only and not-one-phase multi-partitioned transactions in addition?
If it is at least one write fragment in a multi-partitioned transactions it will be executed on the RW-Site and atomic committed with 2PC?
In both cases it is possible, that I don't have to touch all partitions in the cluster. Are uninvolved partitions locked or can they execute single-partitioned transactions in the meantime (if several One-Shot-Reads or one 2PC-Transaction are running on other partitions). If they are locked, how? Does they get the FragmentTaskMessage with an empty or dummy plan fragment for example?
The class SystemProcedureCatalog defines an "Every-Flag" and it will be checked in code in addition to the read-only and single-partitioned flags. How does this flag is related to One-Shot-Reads or the Run-Everywhere-Pattern?
To make things easier for developers, procedures are called the same way regardless of what type they are. Internally there are different types of multi-partition procedures as they provide some optimizations, although there is more to be done and some H-Store projects have done research in these areas.
MP transactions still ultimately involve sending tasks to be done on all the partitions. The one exception you noticed is a special two-partition transaction that is only used in rebalancing data during elastic add or shrink.
Partitions consist of one or more sites (on separate servers) depending on kfactor. These sites stay in sync without a 2PC by requiring deterministic procedures. The partitions work through the backlog in a queue as fast as the process time (or local execution time) allows. All sites handle both reads and writes.
MP tasks sent to those partition queues have to wait on all the pending items to finish. That is why there is a pool of 20 (by default) threads for MP reads. This allows 20 tasks to be sent out at once, so that the next MP read usually doesn't have to wait for 2 networks hops + the max queue wait time + processing time before it can even get queued.
MP reads that are not "single-shot" would be Java procedures with multiple voltExecuteSQL() calls, such as a procedure where subsequent SQL queries depend on the results of prior queries. When these transactions send tasks to the partitions, the partitions have to wait for the max queue wait time + processing time + 2 network hops before they can do the next part of the transaction.
MP writes can also have multiple voltExecuteSQL() calls, plus they have to wait for a final commit signal, so this all delays the progress on the partitions.
There are certainly examples of MP transactions that shouldn't need to involve all of the partitions and could benefit from future optimizations, but it's not as easy as it may seem on a database that has to support durability to disk, k-safety, elastic add and shrink, multi-cluster active-active replication, and many of the other features that have been added to VoltDB over the years since it grew out of the H-Store project.
Disclosure: I work at VoltDB
Can you please tell me, which data structure has an InfluxDB und which data model InfluxDB use? Is this key-value model. I read the full documentation and I didn't catch that.
Thank you in advance!
1. Data model and terminology
An InfluxDB database stores points. A point has four components: a measurement, a tagset, a fieldset, and a timestamp.
The measurement provides a way to associate related points that might have different tagsets or fieldsets. The tagset is a dictionary of key-value pairs to store metadata with a point. The fieldset is a set of typed scalar values—the data being recorded by the point.
The serialization format for points is defined by the [line protocol] (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement is temperature.
The tagset is machine=unit42,type=assembly. The keys, machine and type, in the tagset are called tag keys. The values, unit42 and assembly, in the tagset are called tag values.
The fieldset is internal=32,external=100. The keys, internal and external, in the fieldset are called field keys. The values, 32 and 100, in the fieldset are called field values.
Each point is stored within exactly one database within exactly one retention policy. A database is a container for users, retention policies, and points. A retention policy configures how long InfluxDB keeps points (duration), how many copies of those points are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). The retention policy makes it easy for users (and efficient for the database) to drop older data that is no longer needed. This is a common pattern in time series applications.
We’ll explain replication factor, shard groups, andshards later when we describe how the write path works in InfluxDB.
There’s one additional term that we need to get started: series. A series is simply a shortcut for saying retention policy + measurement + tagset. All points with the same retention policy, measurement, and tagset are members of the same series.
You can refer to the [documentation glossary] for these terms or others that might be used in this blog post series.
2. Receiving points from clients
Clients POST points (in line protocol format) to InfluxDB’s HTTP /write endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. The POST specifies a database and an optional retention policy via query parameters. If the retention policy is not specified, the default retention policy is used. All points in the body will be written to that database and retention policy. Points in a POST body can be from an arbitrary number of series; points in a batch do not have to be from the same measurement or tagset.
When the database receives new points, it must (1) make those points durable so that they can be recovered in case of a database or server crash and (2) make the points queryable. This post focuses on the first half, making points durable.
3. Persisting points to storage
To make points durable, each batch is written and fsynced to a write ahead log (WAL). The WAL is an append only file that is only read during a database recovery. For space and disk IO efficiency, each batch in the WAL is compressed using [snappy compression] before being written to disk.
While the WAL format efficiently makes incoming data durable, it is an exceedingly poor format for reading—making it unsuitable for supporting queries. To allow immediate query ability of new data, incoming points are also written to an in-memory cache. The cache is an in-memory data structure that is optimized for query and insert performance. The cache data structure is a map of series to a time-sorted list of fields.
The WAL makes new points durable. The cache makes new points queryable. If the system crashes or shut down before the cache is written to TSM files, it is rebuilt when the database starts by reading and replaying the batches stored in the WAL.
The combination of WAL and cache works well for incoming data but is insufficient for long-term storage. Since the WAL must be replayed on startup, it is important to constrain it to a reasonable size. The cache is limited to the size of RAM, which is also undesirable for many time series use cases. Consequently, data needs to be organized and written to long-term storage blocks on disk that are size-efficient (so that the database can store a lot of points) and efficient for query.
Time series queries are frequently aggregations over time—scans of points within a bounded time range that are then reduced by a summary function like mean, max, or moving windows. Columnar database storage techniques, where data is organized on disk by column and not by row, fit this query pattern nicely. Additionally, columnar systems compress data exceptionally well, satisfying the need to store data efficiently. There is a lot of literature on column stores. [Columnar-oriented Database Systems] is one such overview.
Time series applications often evict data from storage after a period of time. Many monitoring applications, for example, will store the last month or two of data online to support monitoring queries. It needs to be efficient to remove data from the database if a configured time-to-live expires. Deleting points from columnar storage is expensive, so InfluxDB additionally organizes its columnar format into time-bounded chunks. When the time-to-live expires, the time-bounded file can simply be deleted from the filesystem rather than requiring a large update to persisted data.
Finally, when InfluxDB is run as a clustered system, it replicates data across multiple servers for availability and durability in case of failures.
The optional time-to-live duration, the granularity of time blocks within the time-to-live period, and the number of replicas are configured using an InfluxDB retention policy:
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]
The duration is the optional time to live (if data should not expire, set duration to INF). SHARD DURATION is the granularity of data within the expiration period. For example, a one- hour shard duration with a 24 hour duration configures the database to store 24 one-hour shards. Each hour, the oldest shard is expired (removed) from the database. Set REPLICATION to configure the replication factor—how many copies of a shard should exist within a cluster.
Concretely, the database creates this physical organization of data on disk:
'' Database director /db
'' Retention Policy directory /db/rp
'' Shard Group (time bounded). (Logical)
'' Shard directory (db/rp/Id#)
'' TSM0001.tsm (data file)
'' TSM0002.tsm (data file)
'' …
The in-memory cache is flushed to disk in the TSM format. When the flush completes, flushed points are removed from the cache and the corresponding WAL is truncated. (The WAL and cache are also maintained per-shard.) The TSM data files store the columnar-organized points. Once written, a TSM file is immutable. A detailed description of the TSM file layout is available in the [InfluxDB documentation].
4. Compacting persisted points
The cache is a relatively small amount of data. The TSM columnar format works best when it can store long runs of values for a series in a single block. A longer run produces both better compression and reduces seeks to scan a field for query. The TSM format is based heavily on log-structured merge-trees. New (level one) TSM files are generated by cache flushes. These files are later combined (compacted) into level two files. Level two files are further combined into level three files. Additional levels of compaction occur as the files become larger and eventually become cold (the time range they cover is no longer hot for writes.) The documentation reference above offers a detailed description of compaction.
There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.
Refer: https://www.influxdata.com/blog/influxdb-internals-101-part-one/
It is essentially key-value, key being time, where value can be one or more fields/columns. Values can also optionally be indexed columns, called tags in influxdb, that are optimised for searching along with time which is always required. At least one non-indexed value is required.
See schema design documentation for more details.
Much like Cassandra, in fact, though influx is essentially schema-on-write while developers write schema for Cassandra.
Storage engine wise again very similar to Cassandra, using a variation of SSTables as used in Cassandra, optimised for time series data.
I am not sure if the following influx document was there when you were looking for your answer:
https://docs.influxdata.com/influxdb/v1.5/concepts/key_concepts/
But it really helped me understanding the data structure of influxdb.
BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).
Do I understand correctly that table/row lock hints are being used for pessimistic transaction (TX) isolation models of concurrency ONLY?
In other words, when can table/row lock hints be used during engagement of optimistic TX isolation provided by SQL Server (2005 and higher)?
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
I did read that pessimistic options are legacy and are not needed anymore, though I am in doubt.
Also, having optimistic (aka snapshot aka versioning) TX isolation levels built-in SQL Server2005+,
when one would need to manually code for optimistic concurrency features?
The last question is inspired by having read:
"Optimistic Concurrency in SQL Server" (September 28, 2007)
describing custom coding to provide versioning in SQL Server.
Optimistic concurrency requires more resources and is more expensive when the conflict occurs.
Two sessions can read and modify the values and the conflict only occurs when they try to apply their changes simultaneously. This means that in case of the concurrent update both values should be stored somewhere (which of course requires resources).
Also, when a conflict occurs, usually the whole transaction should be rolled back or the cursor refetched, which is expensive too.
Pessimistic concurrency model uses locking, thus downgrading concurrency but improving performance.
In case of two concurrent tasks, it may be cheaper for the second task to wait for a lock to release than spending CPU time and disk I/O on two simultaneous works and then yet more on rolling back the less fortunate work and redoing it.
Say, you have a query like this:
UPDATE mytable
SET myvalue = very_complex_function(#range)
WHERE rangeid = #range
, with very_complex_function reading some data from mytable itself. In other words, this query transforms a subset of mytable sharing the value of range.
Now, when two functions work on the same range, there may be two scenarios:
Pessimistic: the first query locks, the second query waits for it. The first query completes in 10 seconds, the second one does too. Total: 20 seconds.
Optimistic: both queries work independently (on the same input). This shares CPU time between them plus some overhead on switching. They should keep their intermediate data somewhere, so the data is stored twice (which implies twice I/O or memory). Let's say both complete almost at the same time, in 15seconds.
But when it's time to commit the work, the second query will conflict and will have to rollback its changes (say, it takes the same 15 seconds). Then it needs to reread the data again and do the work again, with the new set of data (10 seconds).
As a result, both queries complete later than with a pessimistic locking: 15 and 40 seconds vs. 10 and 20.
When one would need pessimistic TX isolation levels/hints in SQL Server2005+ if the later provides built-in optimistic (aka snapshot aka versioning) concurrency isolation?
Optimistic isolation levels are, well, optimistic. You should not use them when you expect high contention on your data.
BTW, optimistic isolation (for the read queries) was available in SQL Server 2000 too.
I have a detailed answer here: Developing Modifications that Survive Concurrency
I think there's a bit confusion over terminology here.
The technique of optimistic locking/optimistic concurrency/... is a programming technique used to avoid the following scenario :
start transaction
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
display data on user's screen
await user input, lock remains active
keep awaiting user input, lock still preventing any writes/modifications
user input never comes (for whatever reason)
transaction times out (and this is usually not very rapidly, as the user must be given reasonable time to enter his input).
Optimistic locking replaces this with the following:
start transaction READ
read data, setting a "read" lock on it to prevent any deletes/modifications to our data
end transaction READ, releasing the read lock just set
display data on user's screen
await user input, but data can be modified/deleted meanwhile by other transactions
user input arrives
start transaction WRITE
verify that the data has remained unaltered, raising an exception if it has
apply user updates
end transaction WRITE
So the single "user transaction" to go fetch some data, and change and update them, consists of two distinct "database transactions". What is usually called "isolation levels" applies to those database transactions. The "optimistic locking" that you refer to applies to the "user transaction".
The matter is further complicated in that, broadly speaking, two completely distinct strategies are possible for the "isolating the database transactions part" :
MVCC
2-phase locking
I think the "snapshot versioning isolation level" means that the MVCC technique (well, one of its various possible variations) is being used for the database transaction. The other commonly known isolation levels apply more to transaction isolation using 2PL as the serialization(/isolation) technique. (And mixing them up can get messy ...)