Understanding SSTable immutiability - database

I'm trying to understand better the immutability of sstables in Cassandra. It's very clear what happens both in an insert operation, or in update/delete operation when the data exists in the memtable. But it's not clear what happens when I want to modify data that has already been flushed out.
So I understand the simple senario: I execute an insert opertaion and the data is written to a memtable. When the memtable is full then it's flushed to an sstable.
Now, how does modification of data occur? What happens when I execute a delete or update command (when the data has been flushed out)? If the sstable is immutable, so how will the data get deleted/updated? And how does the memtable work in delete and update commands (of data that does not exist in it because it has been flushed out)? What will the memtable contain?

In Cassandra / Scylla you ALWAYS append. Meaning any operation, whether it's insert / update / delete will create a new entry for that partition containing the new data and new timestamp. In case of a delete operation the new entry will actually be a tombstone with the new timestamp (indicating that the previous data was deleted). This applies whether the data is still in memory (memtable) or already flushed to disk -> sstable created.
Several "versions" of the same partition with different data and different timestamps can reside in multiple sstables (and even in memory) at the same time. SStables will be merged duration compaction and there are several compaction strategies that can be applied.
When the gc_grace_period (default: 10 days, tunable) has expired, on the next compaction that tombstone will be removed, meaning the data that was deleted and the tombstone indicating the latest action (delete), will not get merged into the new sstable.
The internal implementation of the memtables might be slightly different between Scylla and Cassandra but for the sake of simplicity let's assume it is the same.
You are welcomed to read more about the architecture in the following documentation:
SStables
Compaction strategies

Related

How exactly streaming data to PostgreSQL through STDIN works?

Let's say I am using COPY to stream data into my database.
COPY some_table FROM STDIN
I noticed that AFTER stream had finished, database needs significant amount of time to process this data and input these variables into the table. In PgAdmin's monitoring I can see that there are nearly 0 table writes throughout streaming process and then suddenly everything writes in 1 peak.
Some statistics:
I am inserting 450k rows into one table without indexes or keys,
table has 28 fields,
I am sending all NULLs to every field
I am worried that there are problems with my implementation of streams. Is it how streaming works? Database is waiting to gather all text to then execute one gigantic command?
COPY inserts the rows as they are sent, so the data are really streamed. But PostgreSQL doesn't write them to disk immediately: rather, it only writes transaction log (WAL) information to disk, and the actual rows are written to the shared memory cache. The data are persisted later, during the next checkpoint. There is a delay between the start of COPY and actual writing to disk, which could explain what you observe.
The monitoring charts provided in pgAdmin are not fit for the purpose you are putting them to. Most of that data is coming from the stats collector, and that is generally only updated once per statement or transaction, at the end of the statement or transaction. So they are pretty much useless for monitoring rare, large, ongoing operations.
The type of monitoring you want to do is best done with OS tools, like top, vmstat, or sar.

Snowflake: Concurrent queries with CREATE OR REPLACE

When running a CREATE OR REPLACE TABLE AS statement in one session, are other sessions able to query the existing table, before the transaction opened by CORTAS is committed?
From reading the usage notes section of the documentation, it appears this is the case. Ideally I'm looking for someone who's validated this in practice and at scale, with a large number of read operations on the target table.
Using OR REPLACE is the equivalent of using DROP TABLE on the existing table and then creating a new table with the same name; however, the dropped table is not permanently removed from the system. Instead, it is retained in Time Travel. This is important to note because dropped tables in Time Travel can be recovered, but they also contribute to data storage for your account. For more information, see Storage Costs for Time Travel and Fail-safe.
In addition, note that the drop and create actions occur in a single atomic operation. This means that any queries concurrent with the CREATE OR REPLACE TABLE operation use either the old or new table version.
Recreating or swapping a table drops its change data. Any stream on the table becomes stale. In addition, any stream on a view that has this table as an underlying table, becomes stale. A stale stream is
I have not "proving it via performance tests to prove it happens" but we did run for 5 years, where we read from tables of on set of warehouses and rebuilts underlying tables of overs and never noticed "corruption of results".
I always thought of snowflake like double buffer in computer graphics, you have the active buffer, that the video signal is reading from (the existing tables state) and you are writing to the back buffer while a MERGE/INSERT/UPDATE/DELETE is running, and when that write transaction is complete the active "current page/files/buffer" is flipped, all going forward reads are now from the "new" state.
Given the files are immutable, the double buffer analogy holds really well (aka this is how time travel works also). Thus there is just a "global state of what is current" maintained in the meta data.
To the CORTAS / Transaction, I would assume as that is an DDL operation, if you had any transactions that it completes them, like all DDL operations do. So perhaps prior in my double buffer story, that is a hiccup to understand.

Why aren't read replica databases just as slow as the main database? Do they not suffer the same "write burden" as they must be in sync?

My understanding: a read replica database exists to allow read volumes to scale.
So far, so good, lots of copies to read from - ok, that makes sense, share the volume of reads between a bunch of copies.
However, the things I'm reading seem to imply "tada! magic fast copies!". How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
Good question.
First, the writes to the replicas may be more efficient than the writes to the primary if the replicas are maintained by replaying the Write-Ahead Logs into the secondaries (sometimes called a "physical replica"), instead of replaying the queries into the secondaries (sometimes called a "logical replica"). A physical replica doesn't need to do any query processing to stay in sync, and may not need to read the target database blocks/pages into memory in order to apply the changes, leaving more of the memory and CPU free to process read requests.
Even a logical replica might be able to apply changes cheaper on a replica as a query on the primary of the form
update t set status = 'a' where status = 'b'
might get replicated as a series of
update t set status = 'a' where id = ?
saving the replica from having to identify which rows to update.
Second, the secondaries allow the read workload to scale across more physical resources. So total read workload is spread across more servers.

handling duplicates in rocksdb

I want to use rocksdb and wanted to know explicitly how does it handle duplicates.
The documentations say:
The entire database is stored in a set of sstfiles. When a memtable is full,
its content is written out to a file in Level-0 (L0). RocksDB removes
duplicate and overwritten keys in the memtable when it is flushed to a file in L0.
Now in the case of haiving an environment with multiple databases, I couldn't find a description.
Are the Keys, in this case, environment wide unique or would every database has its unique keys? I couldn't find a description of the behavior for the whole environment.
Short answer to your question: there's a background process called compaction, which will periodically merge couple sst-files into a single sorted run (this sorted run could be represented as multiple sst-files, but each with disjoint key-range.) During this compaction process, it handles duplicate keys.
Here's the long answer to your question:
RocksDB is a LSM database. When a key-value pair is written to RocksDB, RocksDB simply creates an data entry for it and append it to a in-memory-buffer called MemTable.
When MemTable becomes full, RocksDB will sort all the keys and flush them as a single sst-file. As we keep writing more data, there will be more sst-files being flushed, where each sst-files here usually have overlapping key-range. At this moment, suppose we have N sst files, and there's a read request comes in. Then this read request must check all these N sst files to see whether it contains the requested key as each sst file can have overlapping key-range. As a result, without any process to reorganize these sst files, reads will become slower as we keep writing more data.
The process that reorganize these sst files is called compaction, which is essentially a multi-way merge sort like operation that inputs multiple sst files and outputs a single sorted run. During the compaction process, RocksDB will sort all the keys from these input sst files, merge possible data entries, and delete data entries when it finds a matched deletion entry.

Implications of SSTable immutability in Cassandra for disk usage

According to:
http://www.datastax.com/docs/1.0/ddl/column_family#about-column-family-compression
The reason RDBMSs see a performance degredation as a result of compression is because the data being over-written must be seeked on disk, decompressed, over-written, and then recompressed. On the other hand, Cassandra can see performance increase for reads and writes because the SSTable is immutable, so no records are ever over-written and the overhead is thus much smaller than for a compressed RDBMS.
I'm wondering, what are the implications of this over the long term, as a Cassandra data store continues to grow? It seems like the only consequence is an ever-growing need for more disk space, is this correct?
Periodically Cassandra will run a compaction process on your existing SSTables. Compaction merges multiple SSTables into one new larger SSTable, discarding obsoleted data. After compaction has occurred Cassandra will (eventually) delete the old SSTables.
So if the size of your data set is stable your SSTable size will not grow infinitely. The Cassandra wiki contains more information on compaction.

Resources