Can you Delete in a replication based Distributed Database? - database

I have thus far been living under the impression that you can not truly delete a row in a replication based Distributed Database. It all works well in a Copy based one. But in Replication you mark them as "consider this delete" and filter them out in every last query. But you do not ever actually delete something from the DB. I think it is time to verify if that assumption is true.
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision. It goes something like this:
Database A:
Adds a Entry under Key 11 (11A)
Database B:
Adds a Entry under Key 11 (11B)
Database A:
Deletes a Entry under Key 11
Now it depends in which Order these 3 operations "meet" in the wild:
The expected order would be:
11A Create
11 Delete (which means 11A)
11B Create
But what if this happens instead?
11A Create
11B Create (fails, already a key 11)
11 Delete
Or even worse, this?
11B Create
11A Create (fails, already a key 11)
11 Delete (which will hit 11B)

I'll assume that we are talking about a leaderless distributed database, that is one where all nodes play the same role (there is no master), so reads and writes can both be served by all nodes. Otherwise, if there's a single master, it can impose a specific ordering on all the writes/deletes and thus resolve the concurrency problem you are describing.
But in Replication you mark them as "consider this delete" and filter
them out in every last query.
That's right and it's done for 2 main reasons:
correctness: if items were deleted instead of tombstoned, then there could be an ambiguous instance, where 2 nodes are consulted where node A has the item but node B does not. And the system as a whole cannot distinguish whether that item was deleted (but the delete failed in A) or whether the item was recently created (but the created failed in B). With tombstones, this distinction can be made clear.
performance: most of those systems do not perform in-place updates (as RDBMS databases usually do), but instead perform append-only operations. That's done in order to improve performance, since random access operations in disk are much slower than sequential operations. As a result, performing the deleted via tombstones aligns well with this approach.
But you do not ever actually delete something from the DB.
That is not necessarily true. Usually, the tombstones are eventually removed from the database (in a garbage-collection fashion). Eventually here means that they are deleted when the system can be sure that the example described above cannot happen anymore for these items (because the deletes have propagated to all the nodes).
My understanding is that you would run into a Race Condition with the Replication if there was ever a key collision
That's right for most of the distributed systems of that kind. The result will depend on the order the operations reached the database. However, some of these databases provide alternative mechanisms, such as conditional writes/deletes. In this way, you can only delete a specific version of an item or update an item only if its version if a specific one (thus aborting the update if someone else updated it in the meanwhile). An example of operations of this kind from Cassandra are conditional deletes and the so-called lightweight transactions
Below are some references that describe how Riak and Cassandra perform deletes, which contain a lot of information around tombstones as well:
Riak: Object deletion
About deletes and tombstones in Cassandra

Related

Updating database keys where one table's keys refer to another's

I have two tables in DynamoDB. One has data about homes, one has data about businesses. The homes table has a list of the closest businesses to it, with walking times to each of them. That is, the homes table has a list of IDs which refer to items in the businesses table. Since businesses are constantly opening and closing, both these tables need to be updated frequently.
The problem I'm facing is that, when either one of the tables is updated, the other table will have incorrect data until it is updated itself. To make this clearer: let's say one business closes and another one opens. I could update the businesses table first to remove the old business and add the new one, but the homes table would then still refer to the now-removed business. Similarly, if I updated the homes table first to refer to the new business, the businesses table would not yet have this new business' data yet. Whichever table I update first, there will always be a period of time where the two tables are not in synch.
What's the best way to deal with this problem? One way I've considered is to do all the updates to a secondary database and then swap it with my primary database, but I'm wondering if there's a better way.
Thanks!
Dynamo only offers atomic operations on the item level, not transaction level, but you can have something similar to an atomic transaction by enforcing some rules in your application.
Let's say you need to run a transaction with two operations:
Delete Business(id=123) from the table.
Update Home(id=456) to remove association with Business(id=123) from the home.businesses array.
Here's what you can do to mimic a transaction:
Generate a timestamp for locking the items
Let's say our current timestamp is 1234567890. Using a timestamp will allow you to clean up failed transactions (I'll explain later).
Lock the two items
Update both Business-123 and Home-456 and set an attribute lock=1234567890.
Do not change any other attributes yet on this update operation!
Use a ConditionalExpression (check the Developer Guide and API) to verify that attribute_not_exists(lock) before updating. This way you're sure there's no other process using the same items.
Handle update lock responses
Check if both updates succeeded to Home and Business. If yes to both, it means you can proceed with the actual changes you need to make: delete the Business-123 and update the Home-456 removing the Business association.
For extra care, also use a ConditionExpression in both updates again, but now ensuring that lock == 1234567890. This way you're extra sure no other process overwrote your lock.
If both updates succeed again, you can consider the two items updated and consistent to be read by other processes. To do this, run a third update removing the lock attribute from both items.
When one of the operations fail, you may try again X times for example. If it fails all X times, make sure the process cleans up the other lock that succeeded previously.
Enforce the transaction lock throught your code
Always use a ConditionExpression in any part of your code that may update/delete Home and Business items. This is crucial for the solution to work.
When reading Home and Business items, you'll need to do this (this may not be necessary in all reads, you'll decide if you need to ensure consistency from start to finish while working with an item read from DB):
Retrieve the item you want to read
Generate a lock timestamp
Update the item with lock=timestamp using a ConditionExpression
If the update succeeds, continue using the item normally; if not, wait one or two seconds and try again;
When you're done, update the item removing the lock
Regularly clean up failed transactions
Every minute or so, run a background process to look for potentially failed transactions. If your processes take at max 60 seconds to finish and there's an item with lock value older than, say 5 minutes (remember lock value is the time the transaction started), it's safe to say that this transaction failed at some point and whatever process running it didn't properly clean up the locks.
This background job would ensure that no items keep locked for eternity.
Beware this implementation do not assure a real atomic and consistent transaction in the sense traditional ACID DBs do. If this is mission critical for you (e.g. you're dealing with financial transactions), do not attempt to implement this. Since you said you're ok if atomicity is broken on rare failure occasions, you may live with it happily. ;)
Hope this helps!

Cassandra DELETEs with or without IF EXISTS

I have a situation where I do not know if data exists in a set of tables. So, as of now, I am issuing DELETEs on all those tables. So, a single API call is resulting in about 30-50 DELETEs in Cassandra. Recently, it is so happening that most of the DELETEs are being issued on non-existent data. Would Cassandra's performance still be negatively affected because of the millions of DELETEs on data that does not exist? Should I use 'IF EXISTS' while deleting data that I am unsure if it exists or not?
It's better to just issue regular delete without the IF EXISTS because in this case the coordinator starts to use serial consistency and paxos protocol which takes longer and makes other nodes run in batches etc. IF NOT EXISTS is a light weight transaction and they should be used with 1% workload, not something you do regularly.
Still you don't want to have a lot of tombstones around (what delete does) so it depends on how you model your data and how you do deletes. I'll be more than happy to provide insight on that if you give some schema, insert and delete statements ;)
IF EXISTS will just fail if the row doesn't exists.
Deletes indeed affect performance, but deleting nonexistent row will do nothing (but searching for this row), it will not create tombstones for columns that aren't there.

What type of fact table / loading solution for a reservation system?

Background
I am designing a Data Warehouse with SQL Server 2012 and SSIS. The source system handles hotel reservations. The reservations are split between two tables, header and header line item. The Fact table would be at the line item level with some data from the header.
The issue
The challenge I have is that the reservation (and its line items) can change over time.
An example would be:
The booking is created.
A room is added to the booking (as a header line item).
The customer arrives and adds food/drinks to their reservation (more line items).
A payment is added to the reservation (as a line item).
A room could be subsequently cancelled and removed from the booking (a line item is deleted).
The number of people in a room can change, affecting that line item.
The booking status changes from "Provisional" to "Confirmed" at a point in its life cycle.
Those last three points are key, I'm not sure how I would keep that line updated without looking up the record and updating it. The business would like to keep track of the updates and deletions.
I'm resisting updating because:
From what I've read about Fact tables, its not good practice to revisit rows once they've been written into the table.
I could do this with a look-up component but with upward of 45 million rows, is that the best approach?
The questions
What type of Fact table or loading solution should I go for?
Should I be updating the records, if so how can I best do that?
I'm open to any suggestions!
Additional Questions (following answer from ElectricLlama):
The fact does have a 1:1 relationship with the source. You talk about possible constraints on future development. Would you be able to elaborate on the type of constraints I would face?
Each line item will have a modified (and created date). Are you saying that I should delete all records from the fact table which have been modified since the last import and add them again (sounds logical)?
If the answer to 2 is "yes" then for auditing purposes would I write the current fact records to a separate table before deleting them?
In point one, you mention deleting/inserting the last x days bookings based on reservation date. I can understand inserting new bookings. I'm just trying to understand why I would delete?
If you effectively have a 1:1 between the source line and the fact, and you store a source system booking code in the fact (no dimensional modelling rules against that) then I suggest you have a two step load process.
delete/insert the last x days bookings based on reservation date (or whatever you consider to be the primary fact date),
delete/insert based on all source booking codes that have changed (you will of course have to know this beforehand)
You just need to consider what constraints this puts on future development, i.e. when you get additional source systems to add, you'll need to maintain the 1:1 fact to source line relationship to keep your load process consistent.
I've never updated a fact record in a dataload process, but always delete/insert a certain data domain (i.e. that domain might be trailing 20 days or source system booking code). This is effectively the same as an update but also takes cares of deletes.
With regards to auditing changes in the source, I suggest you write that to a different table altogether, not the main fact, as it's purpose will be audit, not analysis.
The requirement to identify changed records in the source (for data loads and auditing) implies you will need to create triggers and log tables in the source or enable native SQL Server CDC if possible.
At all costs avoid using the SSIS lookup component as it is totally ineffective and would certainly be unable to operate on 45 million rows.
Stick with the 'insert/delete a data portion' approach as it lends itself to SSIS ability to insert/delete (and its inability to efficiently update or lookup)
In answer to the follow up questions:
1:1 relationship in fact
What I'm getting at is you have no visibility on any future systems that need to be integrated, or any visibility on what future upgrades to your existing source system might do. This 1:1 mapping introduces a design constraint (its not really a constraint, more a framework). Thinking about it, any new system does not need to follow this particular load design, as long as it's data arrive in the fact consistently. I think implementing this 1:1 design is a good idea, just trying to consider any downside.
If your source has a reliable modified date then you're in luck as you can do a differential load - only load changed records. I suggest you:
Load all recently modified records (last 5 days?) into a staging table
Do a DELETE/INSERT based on the record key. Do the delete inside SSIS in an execute SQL task, don't mess about with feeding data flows into row-by-row delete statements.
Audit table:
The simplest and most accurate way to do this is simply implement triggers and logs in the source system and keep it totally separate to your star schema.
If you do want this captured as part of your load process, I suggest you do a comparison between your staging table and the existing audit table and only write new audit rows, i.e. reservation X last modified date in the audit table is 2 Apr, but reservation X last modified date in the staging table is 4 Apr, so write this change as a new record to the audit table. Note that if you do a daily load, any changes in between won't be recorded, that's why I suggest triggers and logs in the source.
DELETE/INSERT records in Fact
This is more about ensuring you have an overlapping window in your load process, so that if the process fails for a couple of days (as they always do), you have some contingency there, and it will seamlessly pick the process back up once it's working again. This is not so important in your case as you have a modified date to identify differential changes, but normally for example I would pick a transaction date and delete, say 7 trailing days. This means that my load process can be borken for 6 days, and if I fix it by the seventh day everything will reload properly without needing extra intervention to load the intermediate days.
I would suggest having a deleted flag and update that instead of deleting. Your performance will also be better.
This will enable you to perform an analysis on how the reservations are changing over a period of time. You will need to ensure that this flag is used in all the analysis to ensure that there is no confusion.

DB Fk/Pk keys performance

In our DB we have a single centric table with millions of rows that is constantly being inserted and updated.
This table has a single column acting as the unique identifier and is used to link the content of this table with mutliple tables with a one to many relation.
This means that wehn inserting entry to, say, USERS table, in the same transaction also USERS_PETS and USERS_PARENTS (and 10 more) will be populated, with multiple rows, based on the same unique identifier from the main table.
Since the application using this DB is constantly inserting new entries and updating existing ones the relation between these tables is kept only at the application level (i.e. logical ERD instead of handling this via FK/PK decelrations).
Questions:
Is this correct to assume that from pure performnces point of view, this is the best approach?
Is there a way to set these keys (so that the DB will be more self descriptive) without impacting performaces?
This is the worst possible approach and I guarantee you will have data integrity issues eventually. Data integrity is far more critical than performance. This is stupid and short-sighted.
No, for the same reason we use seatbelts in cars even when we are in a hurry. The difference is negligeble and totally not worth it.
Some specific dbms vendors may offer a way of declaring constraints while not enforcing them. In Oracle for example, you can specify the Integrity Constraint State as DISABLE NOVALIDATE.
You base data integrity on hope. Hope doesn't scale well.
And there's no such thing as "pure performance point of view". Unless, that is, you never read from the database. If you only insert, never update, never delete, and never read, you can make a case that there exists a "pure performance point of view". But if you ever update, delete, or read, then performance isn't a point--it's more like a surface or a solid, and all you can do is move the balancing point around among inserts, updates, deletes, and reads.
And, because somebody reading this still won't get it, the most critical part of read performance is getting back the right answer. If you can't guarantee the right answer, sensible people won't care how marginally faster your inserts are.

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

Resources