HBase: major_compact not working properly - database

When I run a major compaction in Apache HBase, it is not deleting rows marked for deletion unless I first perform a total reboot of HBase.
First I delete the row I want and subsequently perform a scan to see that the row I want is marked for deletion:
column=bank:respondent_name, timestamp=1407157745014, type=DeleteColumn
column=bank:respondent_name, timestamp=1407157745014, value=STERLING NATL MTGE CO., INC
Then I run the command major_compact 'myTable' and wait a couple of minutes for the major compaction to finish in the background. Then when I perform the scan again, the row and tombstone marker are still there.
However, if I restart HBase and run another major compaction, the row and tombstone marker disappear. In a nutshell, major_compact only seems to be working properly if I perform a restart of HBase right before I run the major compaction. Any ideas on why this is the case? I would like to see the row and tombstone marker be deleted every time I run a major compaction. Thanks.

My experience is to flush the table firstly before run major_compact for this table
hbase>flush 'table'
hbase>major_compact 'table'

Step 1. create table
create 'mytable', 'col1'
Step 2. insert data into table
put 'mytable',1,'col1:name','srihari'
Step 3. Flush the table
flush 'mytable'
Observe one file in below location
Location : /hbase/data/default/mytable/*/col1
Repeat the step 2 and 3 one more time and observe the location we can see two files in that location.
Now execute the below command
major_compact 'mytable'
Now we can see only one file in that location.

Related

CommitLog Recovery with Cassandra

I have noted following statement in the Cassandra documentation on commit log archive configuration:
https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configLogArchive.html
"Restore stops when the first client-supplied timestamp is greater than the restore point timestamp. Because the order in which the database receives mutations does not strictly follow the timestamp order, this can leave some mutations unrecovered."
This statement made us concerned about using point in time recovery based on Cassandra commit logs, since this indicates a point in time recovery will not recover all mutations with timestamp lower than the indicated restore point timestamp if we have mutations out of timestamp order (which we will have).
I tried to verify this behavior via some experiments but have not been able to reproduce this behavior.
I did 2 experiments:
Simple row inserts
Set restore_point_in_time to 1 hour ahead in time.
insert 10 rows (using default current timestamp)
insert a row using timestamp <2 hours ahead in time>
insert 10 rows (using default current timestamp)
Now I killed my cassandra instance making sure it was terminated without having a chance to flush to SS tables.
During startup I could see from cassandra logs that it was doing CommitLog replay.
After replay I queried by table and could see that 20 rows had been recovered but the one with the timestamp ahead of time was not inserted. Though here based on the documentation I would have expected that only the first 10 rows had been inserted. I verified in casssandra log that CommitLog replay had been done.
Larger CommitLog split experiment
I wanted to see if the documented feature then was working over a commitlog split/rollover.
So I set commitlog_segment_size_in_mb to 1 MB to cause the commitlog to rollover more frequently instead of the 32MB default.
I then ran a script to mass insert rows to force the commit log to split.
So the results here was that I inserted 12000 records, then inserted a record with a timestamp ahead of my restore_point_in_time then I inserted 8000 records afterwards.
At about 13200 rows my commitlog rolled over to a new file.
I then again killed my cassandra instance and restarted. Again I could see in the log that CommitLog replay was being done and after replay I could see that all rows except the single row with timestamp ahead of restore_point_in_time was recovered.
Notes
I did similar experiments using commitlog_sync batch option and also to make sure my rows had not been flushed to SSTables I tried restoring snapshot with empty tables before starting up cassandra to make it perform commitlog replay. In all cases I got the same results.
I guess my question is if the statement in the documentation is still valid? or maybe I'm missing something in my experiments?
Any help would be greatly appreciated ? I need an answer for this to be able to conclude on a backup/recovery mechanism we want to implement in a larger scale cassandra cluster setup.
All experiments where done using Cassandra 3.11 (single-node-setup) in a Docker container (the official cassandra docker image). I ran the experiments on the image "from-scratch" so no changes in configs where done other than what I included in the description here.
I think that it will be relatively hard to reproduce, as you'll need to make sure that some of the mutations come later than other, and this may happen mostly when some clients has not synchronized clocks, or nodes are overloaded, and then hints are replayed some time later, etc.
But this parameter may not be required at all - if you look into CommitLogArchiver.java, then you can see that if this parameter is not specified, then it's set to the Long.MAX, meaning that there is no upper bound and all commit logs will be replayed, and then Cassandra will handle it standard way: "the latest timestamp wins".

Destination table becomes truncated at start of running Kettle script

I have a kettle script that reads from Table A, parses the data then sends them to Table 1 and Table 2. From the whole kettle script, I disabled the branch that populates Table 2 and ran the script; from this, Table 1 is populated. After this I did the other way around to populate the other table (Table2). That is, I disabled the branch that populates Table 1. When the script was running, I noticed that Table1 is being truncated while Table2 is being populated. After the whole migration script has finished, both tables are populated.
I also noticed this 'Truncate Table' flag in the destination table. I just don't understand why the truncation is necessary given that I disabled the branch that runs it. Any explanations for this?
The truncation happens when the step is initialized. Regardless of the incoming hop being enabled or disabled, the truncation will always happen. Same happens in steps like Text file output, where a 0 byte file is created when the transformation starts.

Should I use a merge with this scenario?

I have a table that gets updated from an outside source. It normally sits empty until they push data to me. With this data I am supposed to add, update or delete records in two other tables (link by a primary/foreign key). Data is pushed to me one row at a time and occasionally in a large download twice a year. They want me to update my tables in real time. SHould I use a trigger and have it read line by line or merge the tables?
I'd have a scheduled job that runs a sproc to check for work to do in that table, and them process them in batches. Have a column on the import/staging table that you can update with a batch number or timestamp so if something goes wrong (like they have pushed you some goofy data) you know where to restart from and can identify which row caused the problem.
If you use a trigger, not only might it slow down them feeding you a large batch of data, but you'll also possibly lose the ability to keep a record of where the process got to if it fails.
If it was always one row at a time then I think the trigger method would be okay option.
Edit: Just to clarify the point about batch number/timestamp, this is so if you have new/unexpected data which crashes your import, you can alter the code and re-run the process as much as you like without having to ask for a fresh import.

Sql Server 2005 - manage concurrency on tables

I've got in an ASP.NET application this process :
Start a connection
Start a transaction
Insert into a table "LoadData" a lot of values with the SqlBulkCopy class with a column that contains a specific LoadId.
Call a stored procedure that :
read the table "LoadData" for the specific LoadId.
For each line does a lot of calculations which implies reading dozens of tables and write the results into a temporary (#temp) table (process that last several minutes).
Deletes the lines in "LoadDate" for the specific LoadId.
Once everything is done, write the result in the result table.
Commit transaction or rollback if something fails.
My problem is that if I have 2 users that start the process, the second one will have to wait that the previous has finished (because the insert seems to put an exclusive lock on the table) and my application sometimes falls in timeout (and the users are not happy to wait :) ).
I'm looking for a way to be able to have the users that does everything in parallel as there is no interaction, except the last one: writing the result. I think that what is blocking me is the inserts / deletes in the "LoadData" table.
I checked the other transaction isolation levels but it seems that nothing could help me.
What would be perfect would be to be able to remove the exclusive lock on the "LoadData" table (is it possible to force SqlServer to only lock rows and not table ?) when the Insert is finished, but without ending the transaction.
Any suggestion?
Look up SET TRANSACTION ISOLATION LEVEL READ COMMITTED SNAPSHOT in Books OnLine.
Transactions should cover small and fast-executing pieces of SQL / code. They have a tendancy to be implemented differently on different platforms. They will lock tables and then expand the lock as the modifications grow thus locking out the other users from querying or updating the same row / page / table.
Why not forget the transaction, and handle processing errors in another way? Is your data integrity truely being secured by the transaction, or can you do without it?
if you're sure that there is no issue with cioncurrent operations except the last part, why not start the transaction just before those last statements, Whichever they are that DO require isolation), and commit immediately after they succeed.. Then all the upfront read operations will not block each other...

What factors that degrade the performance of a SQL Server 2000 Job?

We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.
Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.
While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.
As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.

Resources