Cassandra DELETEs with or without IF EXISTS - database

I have a situation where I do not know if data exists in a set of tables. So, as of now, I am issuing DELETEs on all those tables. So, a single API call is resulting in about 30-50 DELETEs in Cassandra. Recently, it is so happening that most of the DELETEs are being issued on non-existent data. Would Cassandra's performance still be negatively affected because of the millions of DELETEs on data that does not exist? Should I use 'IF EXISTS' while deleting data that I am unsure if it exists or not?

It's better to just issue regular delete without the IF EXISTS because in this case the coordinator starts to use serial consistency and paxos protocol which takes longer and makes other nodes run in batches etc. IF NOT EXISTS is a light weight transaction and they should be used with 1% workload, not something you do regularly.
Still you don't want to have a lot of tombstones around (what delete does) so it depends on how you model your data and how you do deletes. I'll be more than happy to provide insight on that if you give some schema, insert and delete statements ;)

IF EXISTS will just fail if the row doesn't exists.
Deletes indeed affect performance, but deleting nonexistent row will do nothing (but searching for this row), it will not create tombstones for columns that aren't there.

Related

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

Can you Delete in a replication based Distributed Database?

I have thus far been living under the impression that you can not truly delete a row in a replication based Distributed Database. It all works well in a Copy based one. But in Replication you mark them as "consider this delete" and filter them out in every last query. But you do not ever actually delete something from the DB. I think it is time to verify if that assumption is true.
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision. It goes something like this:
Database A:
Adds a Entry under Key 11 (11A)
Database B:
Adds a Entry under Key 11 (11B)
Database A:
Deletes a Entry under Key 11
Now it depends in which Order these 3 operations "meet" in the wild:
The expected order would be:
11A Create
11 Delete (which means 11A)
11B Create
But what if this happens instead?
11A Create
11B Create (fails, already a key 11)
11 Delete
Or even worse, this?
11B Create
11A Create (fails, already a key 11)
11 Delete (which will hit 11B)
I'll assume that we are talking about a leaderless distributed database, that is one where all nodes play the same role (there is no master), so reads and writes can both be served by all nodes. Otherwise, if there's a single master, it can impose a specific ordering on all the writes/deletes and thus resolve the concurrency problem you are describing.
But in Replication you mark them as "consider this delete" and filter
them out in every last query.
That's right and it's done for 2 main reasons:
correctness: if items were deleted instead of tombstoned, then there could be an ambiguous instance, where 2 nodes are consulted where node A has the item but node B does not. And the system as a whole cannot distinguish whether that item was deleted (but the delete failed in A) or whether the item was recently created (but the created failed in B). With tombstones, this distinction can be made clear.
performance: most of those systems do not perform in-place updates (as RDBMS databases usually do), but instead perform append-only operations. That's done in order to improve performance, since random access operations in disk are much slower than sequential operations. As a result, performing the deleted via tombstones aligns well with this approach.
But you do not ever actually delete something from the DB.
That is not necessarily true. Usually, the tombstones are eventually removed from the database (in a garbage-collection fashion). Eventually here means that they are deleted when the system can be sure that the example described above cannot happen anymore for these items (because the deletes have propagated to all the nodes).
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision
That's right for most of the distributed systems of that kind. The result will depend on the order the operations reached the database. However, some of these databases provide alternative mechanisms, such as conditional writes/deletes. In this way, you can only delete a specific version of an item or update an item only if its version if a specific one (thus aborting the update if someone else updated it in the meanwhile). An example of operations of this kind from Cassandra are conditional deletes and the so-called lightweight transactions
Below are some references that describe how Riak and Cassandra perform deletes, which contain a lot of information around tombstones as well:
Riak: Object deletion
About deletes and tombstones in Cassandra

In Oracle, Is it safe to drop tables containing a large amount of data?

I have a production Oracle database which contains a large amount of data backed up in tables which were made during previous work. The tables are independent of each other and the rest of the database.
I want to remove these backups, preferably in one shot. I know in more recent versions of Oracle dropped tables don't actually get dropped until purged from from the recycle bin. I will take of that.
Is it safe to DROP them all at once? Is there a performance penalty during the DROP operation? Is there a chance to run out of resources during the operation?
What is the safest way to do this?
It's probably safe to drop them all at once.
In general, dropping a table is very quick regardless of the size of the table. DROP doesn't really change any data, Oracle just changes the data dictionary to mark the space as available. I've dropped lots of tables with hundreds of gigabytes or more of data and never had a problem. (Your datafiles may not be sized properly anymore, but that's another issue.)
Other than dependencies and locks, the only time I've ever seen a drop take a (relatively) long time was because of delayed block cleanout. Basically, if you update, delete, or insert (without append) a lot of data, Oracle may write some transaction data to the blocks. The reason for this is to make COMMIT instantaneous, but it means that the next query that even reads from the table may have to clean up the old transaction entries.
But your chances of running into that problem are small. If you have very large tables they were probably created with direct path inserts, or someone else has already queried the table and cleaned out the blocks. Even in the worst case, if your system was good enough to write the data it will probably be good enough to get rid of it (although you could run into ORA-01555 snapshot too old if the transactions are too old, or out of archive log space from the extra redo from delayed block cleanout, etc.).
If the tables have no dependents and are not in use, its safe to drop them all at once. If you are worry about the new recyclebin feature, you can do "drop table table_name purge" and it'll bypass the recyclebin and get purge without having to purge them from the recyclebin.

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

How to optimize a table for fast inserts only?

I have a log table that will receive inserts from several web apps. I wont be doing any searching/sorting/querying of this data. I will be pulling the data out to another database to run reports. The initial table is strictly for RECEIVING the log messages.
Is there a way to ensure that the web applications don't have to wait on these inserts? For example I know that adding a lot of indexes would slow inserts, so I won't. What else is there? Should I not add a primary key? (Each night the table will be pumped to a reports DB which will have a lot of keys/indexes)
If performance is key, you may not want to write this data to a database. I think most everything will process a database write as a round-trip, but it sounds like you don't want to wait for the returned confirmation message. Check if, as S. Lott suggests, it might not be faster to just append a row to a simple text file somewhere.
If the database write is faster (or necessary, for security or other business/operational reasons), I would put no indexes on the table--and that includes a primary key. If it won't be used for reads or updates, and if you don't need relational integrity, then you just don't need a PK on this table.
To recommend the obvious: as part of the nightly reports run, clear out the contents of the table. Also, never reset the database file sizes (ye olde shrink database command); after a week or so of regular use, the database files should be as big as they'll ever need to be and you won't have to worry about the file growth performance hit.
Here are a few ideas, note for the last ones to be important you would have extremly high volumns:
do not have a primary key, it is enforced via an index
do not have any other index
Create the database large enough that you do not have any database growth
Place the database on it's own disk to avoid contention
Avoid software RAID
place the database on a mirrored disk, saves the calculating done on RAID 5
No keys,
no constraints,
no validation,
no triggers,
No calculated columns
If you can, have the services insert async, so as to not wait for the results (if that is acceptable).
You can even try to insert into a "daily" table, which should then be less records,
and then move this across before the batch runs at night.
But mostly on the table NO KEYS/Validation (PK and Unique indexes will kill you)

Resources