Is my distributed database C+A or A+P? - distributed

Let a distributed database have a recover mechanism such as, in case it is splitted by a network partition which allows two sides to keep working, both sides continue to commit transactions and, when the partition is ending, they agree to go back to the last common state.
Should we say that this system :
is A+P, as it has a built-in (not very clever) mechanism to deal with partitions, and the consistency is sort of broken by the fact we can deny a commit which has been done ?
is C+A, as it does not really deals with partitions, and the state is always consistent (never a mixed weird thing recorded) ?
another more complicated option ?

"when the partition is ending, they agree to go back to the last common state."
Do you mean to say all the writes since time t1 (when partition split occurs) to time t2 (when partition split ends) are lost?
That makes your database neither consistent nor durable.
However, you'd be providing availability and partition tolerance.

Related

Why aren't read replica databases just as slow as the main database? Do they not suffer the same "write burden" as they must be in sync?

My understanding: a read replica database exists to allow read volumes to scale.
So far, so good, lots of copies to read from - ok, that makes sense, share the volume of reads between a bunch of copies.
However, the things I'm reading seem to imply "tada! magic fast copies!". How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
How are the copies faster, as surely they must also be burdened by the same amount of writing as the main db in order that they remain in sync?
Good question.
First, the writes to the replicas may be more efficient than the writes to the primary if the replicas are maintained by replaying the Write-Ahead Logs into the secondaries (sometimes called a "physical replica"), instead of replaying the queries into the secondaries (sometimes called a "logical replica"). A physical replica doesn't need to do any query processing to stay in sync, and may not need to read the target database blocks/pages into memory in order to apply the changes, leaving more of the memory and CPU free to process read requests.
Even a logical replica might be able to apply changes cheaper on a replica as a query on the primary of the form
update t set status = 'a' where status = 'b'
might get replicated as a series of
update t set status = 'a' where id = ?
saving the replica from having to identify which rows to update.
Second, the secondaries allow the read workload to scale across more physical resources. So total read workload is spread across more servers.

Confused about Apache HBase basics?

I'm currently reading through Seven Databases in Seven Weeks and I've come across this statement.
HBase also makes strong consistency guarantees, making it easier to
transition from relational databases for some use cases. Finally,
HBase guarantees atomicity at the row level, which means that you can
have strong consistency guarantees at a crucial level of HBase’s data
model.
I'm having some trouble understanding it.
My shallow understanding is that Apache HBase is a distributed database, so it's like a Master-Slave sort of thing?
So, when you do a write, you first do it on the Master and then the Master copies over the writes to the slaves. The consistency guarantee is that all the slaves have the same the same values for their records? So, a high consistency guarantee means that they will all have the same values, where as a low consistency guarantee means that the master may have written changes to some of the slaves, but not all (so if you're reading values from one of the slaves, you might get different results based off which slave you read from)?
Is this correct so far?
So, with HBase... "guaranteeing atomicity at the row level" means a transaction will only be completed when the master has written to all the slaves? And that provides the high consistency?
Am I headed on the right track? If not, I'd really appreciate some clarification on what that paragraph means.
Thank you very much!
If by 'master' you mean region/shard/partition master, then you are on the right track. Every row key is associated with exactly one Region (HBase terminology for shard), and every region is replicated across multiple servers/disks/racks/whatever. There is only one primary Region server (or 'master') that the client talks to, as per every row key.
So, with HBase... "guaranteeing atomicity at the row level" means a transaction will only be completed when the master has written to all the slaves? And that provides the high consistency?
No, consistency and atomicity are two different things.
HBase provides atomicity on a row level, which means that when you write to a row, then the entire write operation is fully completed or not changed anything - there is no in between (partial update). This is not the case when you write to multiple rows in one command - some might chage and some might not, but no row will be partially updated or changed.
Consistency (in this context) means that updates must first be acknowledged by the remote replicas, before the clients gets ok. This is done primarily via HDFS-based transaction log file. You may read on HBase WAL for more details.

Multiple writes in a relational database

I'm pretty sure that with a relational database, it's faster and better to read 50 records at once than to make 50 calls for one record each. Is there a performance benefit from performing multiple writes all at once? If not, why not?
Probably depends on the RDBMS and the storage engine, but at least in MySQL/InnoDB, multiple writes in one transaction (as well the multi-insert syntax, which, afaik, is MySQL extension) allows you not to update non-unique indexes before transaction is commited, and the update of the index happens at once with all new values (since it's a b-tree, in this way its much faster). It's possible that RDBMS optimizes other writes as well, to have sequential instead of random writes.
Also, if there is a table-level locking (as in MyISAM), locking the table once, writting multiple records and then unlocking removes the overhead of lock/unlock for every write.
So generally, there is performance gain, but it depends on the database server used.
Doing all your reads at once makes sense, although there are some problems in it which I'll touch on in a minute.
Doing all your writes at once poses a particular problem: the data is in the database until you put it there. If you're waiting for some optimization threshold (let's say 50) then transaction 1 is going to have to wait for (unrelated) transactions 2-50 to complete before it goes to the database. This means that in the mean time (which could be several [seconds, minutes, hours]) nobody knows what those records are, or if they're updated what the new values are. (Same with reads but the other way around. Your data may be out of date by the time you get to use it.)
Performance wise, I cannot imagine that combining writes closer together would not have some performance. (IF that was confusing to read, I meant "You should always get a performance boost by grouping.") If nothing else, you have a better chance to hit memory caches instead of disk caches than if you do them separately. #Darhazer brings up a good point about locking. So strictly from a total-time-spent-writing point of view, it would be better to group them. From an application performance point of view, it's difficult to say without an intricate knowledge of the business requirements of the app.

Sql Server transactions - usage recommendations

I saw this sentence not only in one place:
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
What does this really mean?
It puzzles me now because I want to use transactions for my app which in normal use will deal with inserting of hundreds of rows from many clients, concurrently.
For example, I have a service which exposes a method: AddObjects(List<Objects>) and of course these object contain other nested different objects.
I was thinking to start a transaction for each call from the client performing the appropriate actions (bunch of insert/update/delete for each object with their nested objects). EDIT1: I meant a transaction for entire "AddObjects" call in order to prevent undefined states/behaviour.
Am I going in the wrong direction? If yes, how would you do that and what are your recommendations?
EDIT2: Also, I understood that transactions are fast for bulk oeprations, but it contradicts somehow with the quoted sentence. What is the conclusion?
Thanks in advance!
A transaction has to cover a business specific unit of work. It has nothing to do with generic 'objects', it must always be expressed in domain specific terms: 'debit of account X and credit of account Y must be in a transaction', 'subtract of inventory item and sale must be in a transaction' etc etc. Everything that must either succeed together or fail together must be in a transaction. If you are down an abstract path of 'adding objects to a list is a transaction' then yes, you are on a wrong path. The fact that all inserts/updates/deletes triggered by a an object save are in a transaction is not a purpose, but a side effect. The correct semantics should be 'update of object X and update of object Y must be in a transaction'. Even a degenerate case of a single 'object' being updated should still be regarded in terms of domain specific terms.
That recommendation is best understood as Do not allow user interaction in a transaction. If you need to ask the user during a transaction, roll back, ask and run again.
Other than that, do use transaction whenever you need to ensure atomicity.
It is not a transactions' problem that they may cause "concurrency issues", it is the fact that the database might need some more thought, a better set of indices or a more standardized data access order.
"A transaction should be kept as short as possible to avoid concurrency issues and to enable maximum number of positive commits."
The longer a transaction is kept open the more likely it will lock resources that are needed by other transactions. This blocking will cause other concurrent transactions to wait for the resources (or fail depending on the design).
Sql Server is usually setup in Auto Commit mode. This means that every sql statement is a distinct transaction. Many times you want to use a multi-statement transaction so you can commit or rollback multiple updates. The longer the updates take, the more likely other transactions will conflict.

Database deadlocks

One of the classical reasons we have a database deadlock is when two transactions are inserting and updating tables in a different order.
For example, transaction A inserts in Table A then Table B.
And transaction B inserts in Table B followed by A.
Such a scenario is always at risk of a database deadlock (assuming you are not using serializable isolation level).
My questions are:
What kind of patterns do you follow in your design to make sure that all transactions are inserting and updating in the same order.
A book I was reading- had a suggestion that you can sort the statements by the name of the table. Have you done something like this or different - which would enforce that all inserts and updates are in the same order?
What about deleting records? Delete needs to start from child tables and updates and inserts need to start from parent tables. How do you ensure that this would not run into a deadlock?
All transactions are
inserting\updating in the same order.
Deletes; identify records to be
deleted outside a transaction and
then attempt the deletion in the
smallest possible transaction, e.g.
looking up by the primary key or similar
identified during the lookup stage.
Small transactions generally.
Indexing and other performance
tuning to both speed transactions
and to promote index lookups over
tablescans.
Avoid 'Hot tables',
e.g. one table with incrementing
counters for other tables primary
keys. Any other 'switchboard' type
configuration is risky.
Especially if not using Oracle, learn
the looking behaviour of the target
RDBMS in detail (optimistic /
pessimistic, isolation levels, etc.)
Ensure you do not allow row locks to
escalate to table locks as some
RDMSes will.
Deadlocks are no biggie. Just be prepared to retry your transactions on failure.
And keep them short. Short transactions consisting of queries that touch very few records (via the magic of indexing) are ideal to minimize deadlocks - fewer rows are locked, and for a shorter period of time.
You need to know that modern database engines don't lock tables; they lock rows; so deadlocks are a bit less likely.
You can also avoid locking by using MVCC and the CONSISTENT READ transaction isolation level: instead of locking, some threads will just see stale data.
Carefully design your database processes to eliminate as much as possible transactions that involve multiple tables. When I've had database design control there has never been a case of deadlock for which I could not design out the condition that caused it. That's not to say they don't exist and perhaps abound in situations outside my limited experience; but I've had no shortage of opportunities to improve designs causing these kinds of problems. One obvious strategy is to start with a chronological write-only table for insertion of new complete atomic transactions with no interdependencies, and apply their effects in an orderly asynchronous process.
Always use the database default isolation levels and locking settings unless you are absolutely sure what risks they incur, and have proven it by testing. Redesign your process if at all possible first. Then, impose the least increase in protection required to eliminate the risk (and test to prove it.) Don't increase restrictiveness "just in case" - this often leads to unintended consequences, sometimes of the type you intended to avoid.
To repeat the point from another direction, most of what you will read on this and other sites advocating the alteration of database settings to deal with transaction risks and locking problems is misleading and/or false, as demonstrated by how they conflict with each other so regularly. Sadly, especially for SQL Server, I have found no source of documentation that isn't hopelessly confusing and inadequate.
I have found that one of the best investments I ever made in avoiding deadlocks was to use a Object Relational Mapper that could order database updates. The exact order is not important, as long as every transaction writes in the same order (and deletes in exactly the reverse order).
The reason that this avoids most deadlocks out of the box is that your operations are always table A first, then table B, then table C (which perhaps depends on table B).
You can achieve a similar result as long as you exercise care in your stored procedures or data layer's access code. The only problem is that it requires great care to do it by hand, whereas a ORM with a Unit of Work concept can automate most cases.
UPDATE: A delete should run forward to verify that everything is the version you expect (you still need record version numbers or timestamps) and then delete backwards once everything verifies. As this should all happen in one transaction, the possibility of something changing out from under you shouldn't exist. The only reason for the ORM doing it backwards is to obey the key requirements, but if you do your check forward, you will have all the locks you need already in hand.
I analyze all database actions to determine, for each one, if it needs to be in a multiple statement transaction, and then for each such case, what the minimum isolation level is required to prevent deadlocks... As you said serializable will certainly do so...
Generally, only a very few database actions require a multiple statement transaction in the first place, and of those, only a few require serializable isolation to eliminate deadlocks.
For those that do, set the isolation level for that transaction before you begin, and reset it whatever your default is after it commits.
Your example would only be a problem if the database locked the ENTIRE table. If your database is doing that...run :)

Resources