CommitLog Recovery with Cassandra

CommitLog Recovery with Cassandra - database

I have noted following statement in the Cassandra documentation on commit log archive configuration:
https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configLogArchive.html
"Restore stops when the first client-supplied timestamp is greater than the restore point timestamp. Because the order in which the database receives mutations does not strictly follow the timestamp order, this can leave some mutations unrecovered."
This statement made us concerned about using point in time recovery based on Cassandra commit logs, since this indicates a point in time recovery will not recover all mutations with timestamp lower than the indicated restore point timestamp if we have mutations out of timestamp order (which we will have).
I tried to verify this behavior via some experiments but have not been able to reproduce this behavior.
I did 2 experiments:
Simple row inserts
Set restore_point_in_time to 1 hour ahead in time.
insert 10 rows (using default current timestamp)
insert a row using timestamp <2 hours ahead in time>
insert 10 rows (using default current timestamp)
Now I killed my cassandra instance making sure it was terminated without having a chance to flush to SS tables.
During startup I could see from cassandra logs that it was doing CommitLog replay.
After replay I queried by table and could see that 20 rows had been recovered but the one with the timestamp ahead of time was not inserted. Though here based on the documentation I would have expected that only the first 10 rows had been inserted. I verified in casssandra log that CommitLog replay had been done.
Larger CommitLog split experiment
I wanted to see if the documented feature then was working over a commitlog split/rollover.
So I set commitlog_segment_size_in_mb to 1 MB to cause the commitlog to rollover more frequently instead of the 32MB default.
I then ran a script to mass insert rows to force the commit log to split.
So the results here was that I inserted 12000 records, then inserted a record with a timestamp ahead of my restore_point_in_time then I inserted 8000 records afterwards.
At about 13200 rows my commitlog rolled over to a new file.
I then again killed my cassandra instance and restarted. Again I could see in the log that CommitLog replay was being done and after replay I could see that all rows except the single row with timestamp ahead of restore_point_in_time was recovered.
Notes
I did similar experiments using commitlog_sync batch option and also to make sure my rows had not been flushed to SSTables I tried restoring snapshot with empty tables before starting up cassandra to make it perform commitlog replay. In all cases I got the same results.
I guess my question is if the statement in the documentation is still valid? or maybe I'm missing something in my experiments?
Any help would be greatly appreciated ? I need an answer for this to be able to conclude on a backup/recovery mechanism we want to implement in a larger scale cassandra cluster setup.
All experiments where done using Cassandra 3.11 (single-node-setup) in a Docker container (the official cassandra docker image). I ran the experiments on the image "from-scratch" so no changes in configs where done other than what I included in the description here.

I think that it will be relatively hard to reproduce, as you'll need to make sure that some of the mutations come later than other, and this may happen mostly when some clients has not synchronized clocks, or nodes are overloaded, and then hints are replayed some time later, etc.
But this parameter may not be required at all - if you look into CommitLogArchiver.java, then you can see that if this parameter is not specified, then it's set to the Long.MAX, meaning that there is no upper bound and all commit logs will be replayed, and then Cassandra will handle it standard way: "the latest timestamp wins".

Related

Change tracking SQL Server

What is the main difference between begin_version and min_valid_version columns for a particular change tracking enabled table in sys.change_tacking_tables?
I am seeing same values for both the columns for each table (1000+ tables) in my database.
I also tried with truncate and insert operation. But, couldn't see any change.
When can we see the different values for both the columns, please give us the example.

From https://learn.microsoft.com/en-us/sql/relational-databases/system-catalog-views/change-tracking-catalog-views-sys-change-tracking-tables?view=sql-server-2017,
begin_version - Version of the database when change tracking began for the table. This version is usually indicates when change tracking was enabled, but this value is reset if the table is truncated.
When you enable change tracking on a database level, the version of database at that point is considered as 1. Every transaction that gets committed on this database increases the version sequentially. So the begin_version corresponding to each table is basically the version of the database at the time of enabling change_tracking on a particular table.
min_valid_version - Minimum valid version of change tracking information that is available for the table.
Based on the retention period that you configured for the change tracking auto cleanup, your min_valid_version gets updated everytime your autocleanup thread wakes up. The change tracking auto cleanup thread wakes up every 30 minutes in the background and updates the invalid cleanup version for all databases enabled for change tracking.
For eg, if your retention period is 2 days which is the default, any time this thread wakes up, it will determine the maximum version of the database from the transactions that got committed 2 days ago from now. You could query sys.dm_tran_commit_table to find more information about this. commit_ts is the version that I just talked about in the above lines.

Why adding another LOOKUP transformation slows down performance significantly SSIS

I have a simple SSIS package that transfer data between source and destination from one server to another.
If its new records - it inserts, otherwise it checks HashByteValue column and if it different its update record.
Table contains approx 1.5 million rows, and updates around 50 columns.
When I start debug the package, for around 2 minutes nothing happens, I cant even see the green check-mark. After that I can see data starts flowing through, but sometimes it stops, then flowing again, then stops again and so on.
The whole package looks like this:
But if I do just INSERT part (without update) then it works perfectly, 1 min and all 1.5 million records in a destination table.
So why adding another LOOKUP transformation to the package that updates records slows down performance so significantly.
Is it something to do with memory? I am using FULL CACHE option in both lookups.
what would be the way to increase performance?
Can the reason be in Auto Growth File size:

Besides changing AutoGrowth size to 100MB, your Database Log file is 29GB. That means you most likely are not doing Transaction Log backups.
If you're not, and only do Full Backups nightly or periodically. Change the Recovery Model of your Database from Full to Simple.
Database Properties > Options > Recovery Model
Then Shrink your Log file down to 100MB using:
DBCC SHRINKFILE(Catalytic_Log, 100)

I don't think that your problem is in the lookup. The OLE DB Command is realy slow on SSIS and I don't think it is meant for a massive update of rows. Look at this answer in the MSDN: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/4f1a62e2-50c7-4d22-9ce9-a9b3d12fd7ce/improve-data-load-perfomance-in-oledb-command?forum=sqlintegrationservices
To verify that the error is not the lookup, try disabling the "OLE DB Command" and rerun the process and see how long it takes.
In my personal experience it is always better to create a Stored procedure to do the whole "dataflow" when you have to update or insert based on certain conditions. To do that you would need a Staging table and a Destination table (where you are going to load the transformed data).
Hope it helps.

Deleting same number of records from SQL Server database takes either 0.2 sec or 30 sec

I am using SQL Server 2008 R2. I am deleting ~5000 records from a table. While I was testing performance with the same data I have found that the deletion either takes 1 sec or 31 sec.
The test database is confidential so can not share it here.
I have already tried to separate the load and only delete 1000 record at a time but I still experience the deviation.
How should I continue my investigation? What could be the reason for the performance difference?
The query is simple, something like: delete from PART where INVOICE_ID = 64225

DELETE clause uses a lot of system and transaction log resources and thats why it can take a lot of time. If possible try to TRUNCATE the table instead.

When you run the DELETE for the first time perhaps the data wasn't in the buffer pool so had to be read off storage, the second time you run it will be in storage. Other factors will include if the checkpoint process is running your changed pages off to storage. Post your query plan ->
set statistics xml on
select * from sys.objects
set statistics xml off

Argh! After having a closer look at the execution plan I have noticed that one index was missing. After adding the index all executions are running fast. Note that after this I have removed the index to test if the problem comes back. The performance deviations re appeared.

Change Data Capture (CDC) cleanup job only removes a few records at a time

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim

Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

What factors that degrade the performance of a SQL Server 2000 Job?

We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.

Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.

While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.

As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.