My co-worker and I have been thrown into a project that uses Cassandra with no introductions.
Alright, let's do this!
SELECT * FROM reports WHERE timestamp < '2019-01-01 00:00:00' ALLOW FILTERING;
Error: 1300
Apparently, we have too many tombstones. What's that?
A tombstone is deleted data that hasn't been removed yet for performance reasons.
Tombstones should be deleted with nodetool repair before the gc_grace_period has expired, default is 10 days.
Now, this project is around 7 years old and it doesn't seem like there's a job that runs repair.
According to default warning and error values, 1K tombstones are a lot. We find about 1.4M.
We measured the number of tombstones with Tracing on, running a SELECT query, and accumulating the tombstones reported.
We tried to run nodetool repair --full -pr -j 4 but we get Validation failed in /10.0.3.1.
DataStax's guide to repairing repairs wants us to fix the validation error with nodetool scrub.
But we still get the same error afterwards.
The guide then wants us to run sstablescrub, which failed with an out-of-memory exception.
Going back to our original problem of deleting data before 2019, we tried to run DELETE FROM reports WHERE timestamp < '2019-01-01 00:00:00'.
However, timestamp is not our partition key so we are not allowed to delete data like this, which has also been confirmed by many other StackOverflow posts and an DataStax issue on Jira.
Every post mentions that we should "just" change the schema of our Cassandra database to fit our queries.
First, we only need to do this once; second, our client wants to have this data deleted as soon as possible.
Is there a way of easily changing the schema of a Cassandra database?
Is there a way that we can make a slow solution that at least works?
All in all, we are new to Cassandra and we are unsure on how to proceed.
What we want is
delete all data from before 2019 and confirm that it is deleted
have stable selects, avoiding error 1300
Can you help?
We have 4 nodes running in Docker on Azure if that is necessary to know.
The version of Cassandra is 3.11.6.
Tombstones could exist in the SSTables longer than 10 days because they are evicted during compaction, and if it didn't happen for a long time, then they just stay there. You have following options available (for 3.11.x):
if you have disk space you may force compaction using the nodetool compact -s that will combine all SSTables into several SSTables - this will put a lot of load onto the system as it will read all data & write them back
use nodetool garbagecollect to evict old data & expired tombstones - but it may not delete all tombstones
you can tune parameters of the specific table so compaction will happen more often, like, decrease the minimal number of SSTables for compaction from 4 to 2, plus some other options (min_threshold, tombstone_threshold, etc.)
In future, for repairs it's recommended to use something like Reaper, that performs token range repair, putting less load onto the system.
Mass deletion of data could be done by external tools, for example:
Spark + Spark Cassandra Connector - see this answer for example
DSBulk - you can use the -query option to specify your query to unload data to disk (only columns of the primary key, and use :start/:end keywords), and then loading data providing the -query 'DELETE FROM table WHERE primary_key = ....'
And for schema change - it's not the most trivial task. To match your table structure to queries you most probably will need to change the primary key, and in Cassandra this is is done only via creation of the new table(s), and loading data into these new tables. For that task you'll also need something like Spark or DSBulk, especially if you'll need to migrate data with TTL and/or WriteTime. See this answer for more details.
Related
I'm moving data from ODBC to OLE Destination, records get inserted everyday on the ODBC in different tables. The packages gets slower and slower it take about a day for million records sometimes more. The tables can have new data inserted or new updated data and the loading and looking up of new data slows the processs. Is the anyway i can fast track the ETL process or is there any open source platform i can use to load the data faster
Tried to count the number of rows in the OLE Destination to check and only insert new records that are greater than the ones in the ODBC Source, but to my surprise the ROW_NUMBER() function isn't supported in Openedge ODBC
Based on the limited information in your question, I'd design your packages like the following
SEQC PG to SQL
The point of these operations is to transfer data from our source system verbatim to the target. The target table should be brand new and the SQL Server equivalent of the PG table from a data type perspective. Clustered Key if one exists, otherwise, see how a heap performs. I am going to reference this as a staging table.
The Data Flow itself is going to be bang simple
By default, the destination will perform a fast load and lock the table.
Run the package and observe times.
Edit the OLE DB Destination and change the Maximum Commit Size to something less than 2147483647. Try 100000 - is it better, worse? Move up/down an order of magnitude until you have an idea of what it looks like will be the fastest the package can move data.
There are a ton of variables at this stage of the game - how busy is the source PG database, what are the data types involved, how far does the data need to travel from the Source, to your computer, to the Destination but this can at least help you understand "can I pull (insert large number here) rows from the source system within the expected tolerance" If you can get the data moved from PG to SQL within the expected SLA and you still have processing time left, then move on to the next section.
Otherwise, you have to rethink your strategy for what data gets brought over. Maybe there's reliable (system generated) insert/update times associated to the rows. Maybe it's a financial-like system where rows aren't updated, just new versions of the row are insert and the net values are all that matters. Too many possibilities here but you'll likely need to find a Subject Matter Expert on the system - someone who knows the logical business process the database models as well as how the data is stored in the database. Buy that person some tasty snacks because they are worth their weight in gold.
Now what?
At this point, we have transferred the data from PG to SQL Server and we need to figure out what to do with it. 4 possibilities exist
The data is brand new. We need to add the row into the target table
The data is unchanged. Do nothing
The data exists but is different. We need to change the existing row in the target table
There is data in the target table that isn't in the staging table. We're not going to do anything about this case either.
Adding data, inserts, are easy and can be fast - it depends on table design.
Changing data, updates, are less easy in SSIS and are slower than adding new rows. Slower because behind the scenes, the database will delete and add the row back in.
Non-Clustered indexes are also potential bottlenecks here, but they can also be beneficial. Welcome to the world of "it depends"
Option 1 is to just write the SQL statements to handle the insert and update. Yes, you have a lovely GUI tool for creating data flows but you need speed and this is how you get it (especially since we've already moved all the data from the external system to a central repository)
Option 2 is to use a Data Flow and potentially an Execute SQL Task to move the data. The idea being, the Data Flow will segment your data into New which will use an OLE DB Destination to write the inserts. The updates - it depends on volume what makes the most sense from an efficiency perspective. If it's tens, hundreds, thousands of rows to update, eh take the performance penalty and use an OLE DB Command to update the row. Maybe it's hundreds of thousands and the package runs good enough, then keep it.
Otherwise, route your changed rows to yet another staging table and then do a mass update from the staged updates to the target table. But at this point, you just wrote half the query you needed for the first option so just write the Insert and be done (and speed up performance because now everything is just SQL Engine "stuff")
You might want to investigate Progress' Change Data Capture feature. If you have a modern release of OpenEdge (11.7 or better) and the proper licenses you can enable CDC policies to track changes. Your ETL process could then use that information to target its efforts.
Warning: it's complicated. There is a lot more to actually doing it than marketing would have you believe. But if your use-case is straight-forward it might not be too terrible.
Or you could implement Progress "Pro2" product to do all the dirty work for you. (That's an extra cost option.)
We have a process which uses uses SQL Server's amazing tableDiff via:
Microsoft SQL Server\100\COM\Tablediff.exe
It's SQL Server 2008 R2. It connects from one instance to another identical instance. It works very well!
I have a situation where a table which now has 10767594 records is taking 2.5 hours to complete, it only has one table in the job. How can I improve this?
The process is triggered by a Windows Scheduled Task, this calls a .bat file, the .bat file contains the recommended code which has no issue. We have a couple of these in place and have had for some time. It's just the one job that deals with the big table from instance to instance that is taking too long.
I have realised that the source table does have an index but the destination table does not. I will put an index on this table, what else can I do?
Does table diff run better with indexes?
Is there a ways to use table diff more effectively?
E.g. if I capture the lastProcessedID can I run tableDiff next time for all records where id > lastProcessedID?
Any advice would be great. Thank you in advance
EDITED:
MY SOLUTION - This was a very very big surprise. As I mentioned above, the 10 million+ record table which was identical on the source and destination except for 2 indexes (on the source). After waiting for out of hours since this is an internal production server I applied the indexes to the source. Now I run the tableDiff job which has not been changed at all and it completes in under 2 minutes. 2.5 hours to 2 mins!
I have accepted the answer below because it very very helpful. I did go down the Merge Replication path however after setting up replication and publishing I found out that the production instance was not able to be a subscriber due to the replication not be ticked on install. As Jason says its a reasonable amount of research, learning and setting up. Since I am not a DBA and had not looked at this before it was a worth while experience.
The performance issue is because the remote queries pull every record from each place to do the comparison to generate the output. Indexes can help slightly to make the pull a little faster from each location, but it's not likely to be significant.
An incremental approach is definitely better. I don't believe tablediff directly supports comparing 2 queries. If it did, you could do something like EXCEPT or INTERSECT to do the comparisons. If you're trying to keep these databases in sync, why not consider other solutions, like log shipping, mirroring, SSIS, replication, clustering, etc.
I am working with two instances of an Oracle database, call them one and two. two is running on better hardware (hard disk, memory, CPU) than one, and two is one minor version behind one in terms of Oracle version (both are 11g). Both have the exact same table table_name with exactly the same indexes defined. I load 500,000 identical rows into table_name on both instances. I then run, on both instances:
delete from table_name;
This command takes 30 seconds to complete on one and 40 minutes to complete on two. Doing INSERTs and UPDATEs on the two tables has similar performance differences. Does anyone have any suggestions on what could have such a drastic impact on performance between the two databases?
I'd first compare the instance configurations - SELECT NAME, VALUE from V$PARAMETER ORDER BY NAME and spool the results into text files for both instances and use some file comparison tool to highlight differences. Anything other than differences due to database name and file locations should be investigated. An extreme case might be no archive logging on one database and 5 archive destinations defined on the other.
If you don't have access to the filesystem on the database host find someone who does and have them obtain the trace files and tkprof results from when you start a session, ALTER SESSION SET sql_trace=true, and then do your deletes. This will expose any recursive SQL due to triggers on the table (that you may not own), auditing, etc.
If you can monitor the wait_class and event columns in v$session for the deleting session you'll get a clue as to the cause of the delay. Generally I'd expect a full table DELETE to be disk bound (a wait class indication I/O or maybe configuration). It has to read the data from the table (so it knows what to delete), update the data blocks and index blocks to remove the entries which generate a lot of entries for the UNDO tablespace and the redo log.
In a production environment, the underlying files may be spread over multiple disks (even SSD). Dev/test environments may have them all stuck on one device and have a lot of head movement on the disk slowing things down. I could see that jumping an SQL maybe tenfold. Yours is worse than that.
If there is concurrent activity on the table [wait_class of 'Concurrency'] (eg other sessions inserting) you may get locking contention or the sessions are both trying to hammer the index.
Something is obviously wrong in instance two. I suggest you take a look at these SO questions and their answers:
Oracle: delete suddenly taking a long time
oracle delete query taking too much time
In particular:
Do you have unindexed foreign key references (reason #1 of delete taking a looong time -- look at this script from AskTom),
Do you have any ON DELETE TRIGGER on the table ?
Do you have any activity on instance two (if this table is continuously updated, you may be blocked by other sessions)
please note: i am not a dba...
I have the following written on my office window:
In case of emergency ask the on call dba to:
Check Plan
Run Stats
Flush Shared Buffer Pool
Number 2 and/or 3 normally fix queries which work in one database but not the other or which worked yesterday but not today....
I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.
I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.
We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.
My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.
Does anyone else monitor replication and how do you do it?
Just a little bit of extra information:
It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.
As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.
We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.
The easiest way is to set it up through Replication Monitor.
Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.
For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.
This works well enough for alerting us to the problem so that we can fix it right away.
You could run a regular check that data changes are taking place, though this could be complex depending on your application.
If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:
SELECT CHECKSUM_AGG(*)
FROM audit_base
WHERE action_timestamp BETWEEN <time1> AND BETWEEN <time2>
where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.
By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.
Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.
You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).
This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.
There is a SqlServer2000 Database we have to update during weekend.
It's size is almost 10G.
The updates range from Schema changes, primary keys updates to some Million Records updated, corrected or Inserted.
The weekend is hardly enough for the job.
We set up a dedicated server for the job,
turned the Database SINGLE_USER
made any optimizations we could think of: drop/recreate indexes, relations etc.
Can you propose anything to speedup the process?
SQL SERVER 2000 is not negatiable (not my decision). Updates are run through custom made program and not BULK INSERT.
EDIT:
Schema updates are done by Query analyzer TSQL scripts (one script per Version update)
Data updates are done by C# .net 3.5 app.
Data come from a bunch of Text files (with many problems) and written to local DB.
The computer is not connected to any Network.
Although dropping excess indexes may help, you need to make sure that you keep those indexes that will enable your upgrade script to easily find those rows that it needs to update.
Otherwise, make sure you have plenty of memory in the server (although SQL Server 2000 Standard is limited to 2 GB), and if need be pre-grow your MDF and LDF files to cope with any growth.
If possible, your custom program should be processing updates as sets instead of row by row.
EDIT:
Ideally, try and identify which operation is causing the poor performance. If it's the schema changes, it could be because you're making a column larger and causing a lot of page splits to occur. However, page splits can also happen when inserting and updating for the same reason - the row won't fit on the page anymore.
If your C# application is the bottleneck, could you run the changes first into a staging table (before your maintenance window), and then perform a single update onto the actual tables? A single update of 1 million rows will be more efficient than an application making 1 million update calls. Admittedly, if you need to do this this weekend, you might not have a lot of time to set this up.
What exactly does this "custom made program" look like? i.e. how is it talking to the data? Minimising the amount of network IO (from a db server to an app) would be a good start... typically this might mean doing a lot of work in TSQL, but even just running the app on the db server might help a bit...
If the app is re-writing large chunks of data, it might still be able to use bulk insert to submit the new table data. Either via command-line (bcp etc), or through code (SqlBulkCopy in .NET). This will typically be quicker than individual inserts etc.
But it really depends on this "custom made program".