Frequently Updated Table in Cassandra - database

I am doing an IoT sensor based project. In this each sensor is sending data to the server in every minute. I am expecting a maximum of 100k sensors in the future.
I am logging the data sent by each sensor in history table. But I have a Live Information table in which latest status of each sensor is being updated.
So I want to update the row corresponding to each sensor in Live Table, every minute.
Is there any problem with this? I read that frequent update operation is bad in cassandra.
Is there a better way?
I am already using Redis in my project for storing session etc. Should I move this LIVE table to Redis?

This is what you're looking for: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html
How you tune memtable thresholds depends on your data and write load. Increase memtable throughput under either of these conditions:
The write load includes a high volume of updates on a smaller set of data.
A steady stream of continuous writes occurs. This action leads to more efficient compaction.
So increasing commitlog_total_space_in_mb will make Cassandra flush memtables to disk less often. This means most of your updates will happen in memory only and you will have fewer duplicates of data.

At C* there's consistency levels for reading and consistency levels to write. If are going to have only one node then this not apply, zero problems, but if are going to use more than one dc or racks you need to increase the consistency level to grant that what you are retrieving is the last version of the updated row, or at writing level use an high consistency level. In my case I'm using ANY to write and QUORUM to read. This allows me to have all nodes expect one down to write and 51% up of the nodes to read. This is a trade off in the CAP theorem. Pls take a look at:
http://docs.datastax.com/en/cassandra/latest/cassandra/dml/dmlConfigConsistency.html
https://wiki.apache.org/cassandra/ArchitectureOverview

Related

InfluxDB data structure & database model

Can you please tell me, which data structure has an InfluxDB und which data model InfluxDB use? Is this key-value model. I read the full documentation and I didn't catch that.
Thank you in advance!
1. Data model and terminology
An InfluxDB database stores points. A point has four components: a measurement, a tagset, a fieldset, and a timestamp.
The measurement provides a way to associate related points that might have different tagsets or fieldsets. The tagset is a dictionary of key-value pairs to store metadata with a point. The fieldset is a set of typed scalar values—the data being recorded by the point.
The serialization format for points is defined by the [line protocol] (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement is temperature.
The tagset is machine=unit42,type=assembly. The keys, machine and type, in the tagset are called tag keys. The values, unit42 and assembly, in the tagset are called tag values.
The fieldset is internal=32,external=100. The keys, internal and external, in the fieldset are called field keys. The values, 32 and 100, in the fieldset are called field values.
Each point is stored within exactly one database within exactly one retention policy. A database is a container for users, retention policies, and points. A retention policy configures how long InfluxDB keeps points (duration), how many copies of those points are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). The retention policy makes it easy for users (and efficient for the database) to drop older data that is no longer needed. This is a common pattern in time series applications.
We’ll explain replication factor, shard groups, andshards later when we describe how the write path works in InfluxDB.
There’s one additional term that we need to get started: series. A series is simply a shortcut for saying retention policy + measurement + tagset. All points with the same retention policy, measurement, and tagset are members of the same series.
You can refer to the [documentation glossary] for these terms or others that might be used in this blog post series.
2. Receiving points from clients
Clients POST points (in line protocol format) to InfluxDB’s HTTP /write endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. The POST specifies a database and an optional retention policy via query parameters. If the retention policy is not specified, the default retention policy is used. All points in the body will be written to that database and retention policy. Points in a POST body can be from an arbitrary number of series; points in a batch do not have to be from the same measurement or tagset.
When the database receives new points, it must (1) make those points durable so that they can be recovered in case of a database or server crash and (2) make the points queryable. This post focuses on the first half, making points durable.
3. Persisting points to storage
To make points durable, each batch is written and fsynced to a write ahead log (WAL). The WAL is an append only file that is only read during a database recovery. For space and disk IO efficiency, each batch in the WAL is compressed using [snappy compression] before being written to disk.
While the WAL format efficiently makes incoming data durable, it is an exceedingly poor format for reading—making it unsuitable for supporting queries. To allow immediate query ability of new data, incoming points are also written to an in-memory cache. The cache is an in-memory data structure that is optimized for query and insert performance. The cache data structure is a map of series to a time-sorted list of fields.
The WAL makes new points durable. The cache makes new points queryable. If the system crashes or shut down before the cache is written to TSM files, it is rebuilt when the database starts by reading and replaying the batches stored in the WAL.
The combination of WAL and cache works well for incoming data but is insufficient for long-term storage. Since the WAL must be replayed on startup, it is important to constrain it to a reasonable size. The cache is limited to the size of RAM, which is also undesirable for many time series use cases. Consequently, data needs to be organized and written to long-term storage blocks on disk that are size-efficient (so that the database can store a lot of points) and efficient for query.
Time series queries are frequently aggregations over time—scans of points within a bounded time range that are then reduced by a summary function like mean, max, or moving windows. Columnar database storage techniques, where data is organized on disk by column and not by row, fit this query pattern nicely. Additionally, columnar systems compress data exceptionally well, satisfying the need to store data efficiently. There is a lot of literature on column stores. [Columnar-oriented Database Systems] is one such overview.
Time series applications often evict data from storage after a period of time. Many monitoring applications, for example, will store the last month or two of data online to support monitoring queries. It needs to be efficient to remove data from the database if a configured time-to-live expires. Deleting points from columnar storage is expensive, so InfluxDB additionally organizes its columnar format into time-bounded chunks. When the time-to-live expires, the time-bounded file can simply be deleted from the filesystem rather than requiring a large update to persisted data.
Finally, when InfluxDB is run as a clustered system, it replicates data across multiple servers for availability and durability in case of failures.
The optional time-to-live duration, the granularity of time blocks within the time-to-live period, and the number of replicas are configured using an InfluxDB retention policy:
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]
The duration is the optional time to live (if data should not expire, set duration to INF). SHARD DURATION is the granularity of data within the expiration period. For example, a one- hour shard duration with a 24 hour duration configures the database to store 24 one-hour shards. Each hour, the oldest shard is expired (removed) from the database. Set REPLICATION to configure the replication factor—how many copies of a shard should exist within a cluster.
Concretely, the database creates this physical organization of data on disk:
'' Database director /db
'' Retention Policy directory /db/rp
'' Shard Group (time bounded). (Logical)
'' Shard directory (db/rp/Id#)
'' TSM0001.tsm (data file)
'' TSM0002.tsm (data file)
'' …
The in-memory cache is flushed to disk in the TSM format. When the flush completes, flushed points are removed from the cache and the corresponding WAL is truncated. (The WAL and cache are also maintained per-shard.) The TSM data files store the columnar-organized points. Once written, a TSM file is immutable. A detailed description of the TSM file layout is available in the [InfluxDB documentation].
4. Compacting persisted points
The cache is a relatively small amount of data. The TSM columnar format works best when it can store long runs of values for a series in a single block. A longer run produces both better compression and reduces seeks to scan a field for query. The TSM format is based heavily on log-structured merge-trees. New (level one) TSM files are generated by cache flushes. These files are later combined (compacted) into level two files. Level two files are further combined into level three files. Additional levels of compaction occur as the files become larger and eventually become cold (the time range they cover is no longer hot for writes.) The documentation reference above offers a detailed description of compaction.
There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.
Refer: https://www.influxdata.com/blog/influxdb-internals-101-part-one/
It is essentially key-value, key being time, where value can be one or more fields/columns. Values can also optionally be indexed columns, called tags in influxdb, that are optimised for searching along with time which is always required. At least one non-indexed value is required.
See schema design documentation for more details.
Much like Cassandra, in fact, though influx is essentially schema-on-write while developers write schema for Cassandra.
Storage engine wise again very similar to Cassandra, using a variation of SSTables as used in Cassandra, optimised for time series data.
I am not sure if the following influx document was there when you were looking for your answer:
https://docs.influxdata.com/influxdb/v1.5/concepts/key_concepts/
But it really helped me understanding the data structure of influxdb.

Cassandra read write latency for large partition

For a large cassandra partition read latencies are usally huge.
But does write latency get impacted in this case? Since cassandra is columnar database and holds immutable data, shouldn't the write (which appends data at the end of the row) take less time?
In all the experiments I have conducted with Cassandra, I have noticed that write throughput is not affected by data size while read performance takes a big hit if your SSTables are too big, concurrent_reads threads are low ( check using nodetool tpstats if ReadStage is going into pending state) and increase them in cassandra.yaml file. Using LeveledCompaction seems to help as data for same key remains in same SSTable. Make sure your data is distributed evenly across all nodes. Cassandra optimization is tricky and you may have to implement "hacks" to obtain desired performance in minimum possible hardware.

Possible bottlenecks when inserting and updating BYTEA rows?

The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven't much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you :)
Edit 1:
More detailed information:
there will be only one client accessing the database, using only one Linux process
database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using bytea.
PostgreSQL's MVCC design means that an UPDATE is logically equivalent to a DELETE and an INSERT. When you insert the row then update it, what's happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.
I question your testing methodology - can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You've given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
bytea insert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others' data so you don't need UPDATE locks, use async commit and/or a commit_delay if you don't have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.
Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it'd require about 60 transactions per second. You could use a commit_delay to achieve group commit and significantly lower fsync() overhead, or even use synchronous_commit = off if you can afford to lose a time window of transactions in case of a crash.
With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven't benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I'd be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here's the first approach I'd test, benchmark and load-test, as it's in my view probably the most practical:
One connection per data stream, synchronous_commit = off + a commit_delay.
INSERT each 16kb record as it comes in into a staging table (if possible UNLOGGED or TEMPORARY if you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.
For absolutely best speed with this approach, implement a bytea_agg aggregate function for bytea as an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it's likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:
CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
INSERT INTO final_table SELECT stream_id, bytea_agg(data_block) FROM temp_stream_table;
You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or UNLOGGED table rather than a TEMPORARY table to accumulate those 16kb records, you'd need to make sure it was being quite aggressively VACUUMed.
See also:
Whats the fastest way to do a bulk insert into Postgres?
How to speed up insertion performance in PostgreSQL

Measuring impact of sql server index on writes

I have a large table which is both heavily read, and heavily written (append only actually).
I'd like to get an understanding of how the indexes are affecting write speed, ideally the duration spent updating them (vs the duration spent inserting), but otherwise some sort of feel for the resources used solely for the index maintenance.
Is this something that exists in sqlserver/profiler somewhere?
Thanks.
Look at the various ...wait... columns under sys.dm_db_index_operational_stats. This will account for waits for locks and latches, however it will not account for log write times. For log writes you can do a simple math based on row size (ie. a new index that is 10 bytes wide on a table that is 100 bytes wide will add 10% log write) since log write time is driven just by the number of bytes written. The Log Flush... counters under Database Object will measure the current overall DB wide log wait times.
Ultimately, the best measurement is base line comparison of well controlled test load.
I don't believe there is a way to find out the duration of the update, but you can check the last user update on the index by querying sys.dm_db_index_usage_stats. This will give you some key information on how often the index is queried and updated, and the datetime stamp of this information.

The best way to design a Reservation based table

One of my Clients has a reservation based system. Similar to air lines. Running on MS SQL 2005.
The way the previous company has designed it is to create an allocation as a set of rows.
Simple Example Being:
AllocationId | SeatNumber | IsSold
1234 | A01 | 0
1234 | A02 | 0
In the process of selling a seat the system will establish an update lock on the table.
We have a problem at the moment where the locking process is running slow and we are looking at ways to speed it up.
The table is already efficiently index, so we are looking at a hardware solution to speed up the process. The table is about 5 mil active rows and sits on a RAID 50 SAS array.
I am assuming hard disk seek time is going to be the limiting factor in speeding up update locks when you have 5mil rows and are updating 2-5 rows at a time (I could be wrong).
I've herd about people using index partition over several disk arrays, has anyone had similar experiences with trying to speed up locking? can anyone give me some advise onto a possible solution on what hardware might be able to be upgraded or what technology we can take advantage of in order to speed up the update locks (without moving to a cluster)?
One last try…
It is clear that there are too many locks hold for too long.
Once the system starts slowing down
due to too many locks there is no
point in starting more transactions.
Therefore you should benchmark the system to find out the optimal number of currant transaction, then use some queue system (or otherwise) to limit the number of currant transaction. Sql Server may have some setting (number of active connections etc) to help, otherwise you will have to write this in your application code.
Oracle is good at allowing reads to bypass writes, however SqlServer is not as standared...
Therefore I would split the stored proc to use two transactions, the first transaction should just:
be a SNAPSHOT (or READ UNCOMMITTED) transaction
find the “Id” of the rows for the seats you wish to sell.
You should then commit (or abort) this transaction,
and use a 2nd (hopefully very short) transaction that
Most likcly is READ COMMITTED, (or maybe SERIALIZABLE)
Selects each row for update (use a locking hint)
Check it has not been sold in the mean time (abort and start again if it has)
Set the “IsSold” flag on the row
(You may be able to the above in a single update statement using “in”, and then check that the expected number of rows were updated)
Sorry sometimes you do need to understant what each time of transaction does and how locking works in detail.
If the table is smaller, then the
update is shorter and the locks are
hold for less time.
Therefore consider splitting the table:
so you have a table that JUST contains “AllocationId” and “IsSold”.
This table could be stored as a single btree (index organized table on AllocationId)
As all the other indexes will be on the table that contrains the details of the seat, no indexes should be locked by the update.
I don't think you'd getting anything out of table partitioning -- the only improvement you'd get would be in fewer disk reads from a smaller (shorter) index trees (each read will hit each level of the index at least once, so the fewer levels the quicker the read.) However, I've got a table with a 4M+ row partition, indexed on 4 columns, net 10 byte key length. It fits in three index levels, with the topmost level 42.6% full. Assuming you had something similar, it seems reasonable that partitioning might only remove one level from the tree, and I doubt that's much of an improvement.
Some off the-cuff hardward ideas:
Raid 5 (and 50) can be slower on writes, because of the parity calculation. Not an issue (or so I'm told) if the disk I/O cache is large enough to handle the workload, but if that's flooded you might want to look at raid 10.
Partition the table across multiple drive arrays. Take two (or more) Raid arrays, distribute the table across the volumes[files/file groups, with or without table partitioning or partitioned views], and you've got twice the disk I/O speed, depending on where the data lies relative to the queries retrieving it. (If everythings on array #1 and array #2 is idle, you've gained nothing.)
Worst case, there's probably leading edge or bleeding edge technology out there that will blow your socks off. If it's critical to your business and you've got the budget, might be worth some serious research.
How long is the update lock hold for?
Why is the lock on the “table” not just the “rows” being sold?
If the lock is hold for more then a
faction of a second that is likely to
be your problem. SqlServer does not
like you holding locks while users
fill in web forms etc.
With SqlServer, you have to implement a “shopping cart” yourself, by temporary reserving the seat until the user pays for it. E.g add a “IsReserved” and “ReservedAt” colunn, then any seats that has been reserved for more then n minutes should be automatically unreserved.
This is a hard problem, as a shopper does not expect a seat that is in stock to be sold to someone else where he is checking out. However you don’t know if the shopper will ever complete the checkout. So how do you show it on a UI. Think about having a look at what other booking websites do then copy one that your users already know how to use.
(Oracle can sometimes cope with lock being kept for a long time, but even Oracle is a lot faster and happier if you keep your locking short.)
I would first try to figure out why the you are locking the table rather than just a row.
One thing to check out is the Execution plan of the Update statement to see what Indexes it causes to be updated and then make sure that row_level_lock and page_level_lock are enabled on those indexes.
You can do so with the following statement.
Select allow_row_locks, allow_page_locks from sys.indexes where name = 'IndexNameHere'
Here are a few ideas:
Make sure your data and logs are on separate spindles, to maximize write performance.
Configure your drives to only use the first 30% or so for data, and have the remainder be for backups (minimize seek / random access times).
Use RAID 10 for the log volume; add more spindles as needed for performance (write performance is driven by the speed of the log)
Make sure your server has enough RAM. Ideally, everything needed for a transaction should be in memory before the transaction starts, to minimize lock times (consider pre-caching). There are a bunch of performance counters you can check for this.
Partitioning may help, but it depends a lot on the details of your app and data...
I'm assuming that the T-SQL, indexes, transaction size, etc, have already been optimized.
In case it helps, I talk about this subject in detail in my book (including SSDs, disk array optimization, etc) -- Ultra-Fast ASP.NET.

Resources