I'm studying about database mechanism and see that there are two mechanisms: table level locking and row level locking. I don't see column level locking and when I google, I see no document tell about this except this link: database locking. In this link:
A column level lock just means that some columns within a given row in
a given table are locked. This form of locking is not commonly used
because it requires a lot of resources to enable and release locks at
this level. Also, there is very little support for column level
locking in most database vendors.
So, which vendors support column level locking ? And can you tell me more detail, why column level locking requires a lot of resources than row level locking.
Thanks :)
A lock cannot, in and of itself, require anything. It's an abstract verb acting on an abstract noun. Why should locking a column cost any more than locking a byte, or a file, or a door? So I wouldn't put a lot of stock in your link.
The answer to your question lies in why locks exist -- what they protect -- and how DBMSs are engineered.
One of the primary jobs of a DBMS is to manage concurrency: to give each user, insofar as possible, the illusion that all the data belong to each one, all the time. Different parties are changing the database, and the DBMS ensures that those changes appear to all users as a transaction, meaning no one sees "partial changes", and no one's changes ever "step on" another's. You and I can both change the same thing, but not at the same time: the DBMS makes sure one of us goes first, and can later show who that was. The DBMS uses locks to protect the data while they're being changed, or to prevent them from being changed while they're being viewed.
Note that when we "want to change the same thing", the thing is a row (or rows). Rows represent the things out there in the real world, the things we're counting and tracking. Columns are attributes of those things.
Most DBMSs are organized internally around rows of data. The data are in memory pages and disk blocks, row-by-row. Locks in these systems protect row-oriented data structures in memory. To lock individual rows is expensive; there are a lot of rows. As an expedient, many systems lock sets of rows (pages) or whole tables. Fancier ones have elaborate "lock escalation" to keep the lock population under control.
There are some DBMSs organized around columns. That's a design choice; it makes inserts more expensive, because a row appears in several physical places (1/column), not neatly nestled between to other rows. But the tradeoff is that summarization of individual columns is cheaper in terms of I/O. It could be, in such systems, that there are "column locks", and there's no reason to think they'd be particularly expensive. Observe, however, that for insertion they'd affect concurrency in exactly the same way as a table lock: you can't insert a row into a table whose column is locked. (There are ways to deal with that too. DBMSs are complex, with reason.)
So the answer to your question is that most DBMSs don't have "columns" as internal structures that a lock could protect. Of those that do, a column-lock would be a specialty item, something to permits a certain degree of column-wise concurrency, at the expense of otherwise being basically a table-lock.
Related
I read a topic on Cassandra DB. It wrote that it's good for app that do not require ACID property.
Is there any application or situation that ACID is not important?
There are many (most?) scenarios where ACID isn't needed. For example, if you have a database of products where one table is id -> description and another is id -> todays_price, and yet another is id -> sales_this_week, then there's no need to have all tables locked when updating something. Even if (for some reason) there are some common bits of data across the three tables, depending on usage, having them out of sync for a few seconds may not be a problem. Perhaps the monthly sales are only needed at the end of the month when aggregating a report. Not being ACID compliant means not necessarily satisfying all four properties of ACID... in most business cases, as long as things are eventually consistent with each other, it may be good enough.
It's worth mentioning that commits affecting the same cassandra partition ARE atomic. If something does need to be atomically consistent, then your data model should strive to put that bit of information in the same partition (and as such, in the same table). When we talk about eventual consistency in the context of cassandra, we mean things affecting different partitions (which could be different rows in the same table), not the same partition.
The canonical example of "transaction" is a debit from one account, and a credit from another.... this is "important" because "banks". In reality, this is not how banks operate. What such a system does need is a definitive list of transactions (read transfers). If you were to model this for cassandra, you could have a table of transfers consisting of (from_account, to_account, amount, time, etc.). These records would be atomically consistent. Your "accounts" tables would be updated from this list of transfers. How soon this gets reflected depends on the business. For example, in the UK, transfers from Lloyds bank to Lloyds bank are almost instant. Whereas some inter-bank transfers can take a couple of days. In the case of the latter, your account's balance usually shows the un-deducted amount of pending transfers, while a separate "available balance" considers the pending transfers.
Different things operate at different latencies, and in some cases ACID, and the resultant immediate consistency across all updated records may be important. For a lot of others though, specially when dealing with distributed systems with lots of data, ACID at the database level may not be required.
Even where "visible consistency" is required, it can often be handled with coordination mechanisms at the application level, CRDTs, etc. To the end user, the system is atomic - either something succeeds, or it doesn't, and the user gets a confirmation. Internally, the system may be updating multiple database, dealing with external services, etc., and only confirming when everything's peachy. So, ACID for different rows in a table, or across tables in a single database, or even across multiple databases may not be sufficient for externally visible consistency. Cassandra has tunable consistency where by you can use data modelling, and deal with the tradeoffs to make a "good enough" system that meets business requirements. If you need ACID transactionality across tables, though - Cassandra wouldn't be fit for that use case. However, you may be able to model your business requirements within cassandra's constraints, and use it to get the other scalability benefits it provides.
We are currently having an OLTP application storing information in a DB2 database.
The information regarding the declarations is stored in multiple DB2 tables and some of the tables are really huge (260 million records).
If we want to improve the performance would it make sense to duplicate each of those tables used to store the declaration info. So instead of having a single table DECLARATION, we now have 2 tables DECLARATION_A and DECLARATION_B. The idea is to store the information related to Declarations with type A in the DECLARATION_A table and declarations with type B in the DECLARATION_B table.
So will this kind of split, improve the performance of those business processes that are writing to and reading from those Declaration tables ?
Or even more specifically : will this kind of split also assure that business process A that is reading/updating/creating records for Declarations with type A is less impacted by concurrent business process B that is reading/updated/creating records for Declarations with type B ?
In general, I don't believe that splitting the table would improve your performance and I will try, at a high level explain why. In case you split the table into A and B, then you have to route updates to one and reads to the other. Even if reads are accelerated by avoiding concurrent reads (let's assume that you currently use a pessimistic locking protocol), then periodically you need to make sure that updates are propagated to the read table (e.g., B) from the update table (e.g., A). The former operation will incur additional overhead, thus, in the end I don't expect your performance to improve.
In addition, DBMSs are designed in a way to avoid conflicts as much as possible among different memory pages. In other words, if your latest transaction is updating page X, then a read transaction will be affected until the lock on page X is released. Depending on your workload, page X will be the last page (in an append-only workload), or multiple pages in memory (in a random update pattern).
In conclusion, I think you need to provide the locking mechanism that your database is using, and the operation profile (i.e., percentage of SELECT/INSERT/DELETE/UPDATEs) of your application.
It depends.
If it is an OLTP application, it means short transactions with record-level, index-based access. Is using the same hardware with two databases fighting for the same resources a good idea?
If it is a messy OLTP-like system with not-sure-what-we-are-doing implementation, revisiting the architecture and programs and eventually splitting the data by purpose/business app might be good.
In our DB we have a single centric table with millions of rows that is constantly being inserted and updated.
This table has a single column acting as the unique identifier and is used to link the content of this table with mutliple tables with a one to many relation.
This means that wehn inserting entry to, say, USERS table, in the same transaction also USERS_PETS and USERS_PARENTS (and 10 more) will be populated, with multiple rows, based on the same unique identifier from the main table.
Since the application using this DB is constantly inserting new entries and updating existing ones the relation between these tables is kept only at the application level (i.e. logical ERD instead of handling this via FK/PK decelrations).
Questions:
Is this correct to assume that from pure performnces point of view, this is the best approach?
Is there a way to set these keys (so that the DB will be more self descriptive) without impacting performaces?
This is the worst possible approach and I guarantee you will have data integrity issues eventually. Data integrity is far more critical than performance. This is stupid and short-sighted.
No, for the same reason we use seatbelts in cars even when we are in a hurry. The difference is negligeble and totally not worth it.
Some specific dbms vendors may offer a way of declaring constraints while not enforcing them. In Oracle for example, you can specify the Integrity Constraint State as DISABLE NOVALIDATE.
You base data integrity on hope. Hope doesn't scale well.
And there's no such thing as "pure performance point of view". Unless, that is, you never read from the database. If you only insert, never update, never delete, and never read, you can make a case that there exists a "pure performance point of view". But if you ever update, delete, or read, then performance isn't a point--it's more like a surface or a solid, and all you can do is move the balancing point around among inserts, updates, deletes, and reads.
And, because somebody reading this still won't get it, the most critical part of read performance is getting back the right answer. If you can't guarantee the right answer, sensible people won't care how marginally faster your inserts are.
I'm no DBA, but I respect database theory. Isn't adding columns like isDeleted and sequenceOrder bad database practice?
That depends. Being able to soft-delete a tuple (i.e., mark it as deleted rather then actually deleting it) is essential if there's any need to later access that tuple (e.g., to count deleted things, or do some type of historical analysis). It also has the possible benefit, depending on how indexes are structured, to cause less of a disk traffic hit when soft-deleting a row (by having to touch fewer indexes). The downside is that the application takes on responsibility for managing foreign keys to soft-deleting things.
If soft deleting is done for performance, a periodic (e.g., nightly, weekly) tasks can clean soft-deleted tuples out during a low-traffic period.
Using an explicit 'sequence order' for some tuples is useful in several cases, esp. when it's not possible or wise to depend on some other field (e.g., ids, which app developers are trained not to trust) to order things that need to be ordered in some specific way for business reasons.
IsDeleted columns have two purposes.
To hide a record from users instead of deleting it, thus retaining the record in the database for later use.
To provide a two-stage delete process, where one user marks a record for deletion, and another user confirms.
Not sure what SequenceOrder is about. Do you have a specific application in mind?
Absolutely not. Each database has different requirements, and based on those requirements, you may need columns such as those.
An example for isDeleted could be if you want to allow the user interface to delete unneeded things, but retain them in the database for auditing or reporting purposes. Or if you have incredibly large datasets, deleting is a very slow operation and may not be possible to perform in real-time. In this case, you can mark it deleted, and run a batch clean-up periodically.
An example for sequenceOrder is to enable arbitrary sorting of database rows in the UI, without relying on intrinsic database order, or sequental insertion. If you insert rows in order, you can usually get them out of order..until people start deleting and inserting new rows.
SequenceOrder doesn't sound great (although you've given no background at all), but I've used columns like IsDeleted for soft deletions all my career.
Since you explicitly state that you're interested in the theoretical perspective, here goes :
At the level of the LOGICAL design, it is almost by necessity a bad idea to have a boolean attribute in a table (btw the theory's correct term for this is "relvar", not "table"). The reason being that having a boolean attribute makes it very awkward to define/document the meaning (relational theory names this the "Predicate") that the relvar has in your system. If you include the boolean attribute, then the predicate defining such a relvar's meaning would have to include some construct like "... and it is -BOOLEANATTRIBUTENAME here- that this tuple has been deleted.". That is awkward circumlocution.
At the logical design level, you should have two distinct tables, one for the non-deleted rows, and one for the deleted-rows-that-someone-might-still-be-interested-in.
At the PHYSICAL desing level, things may be different. If you have a lot of delete-and-undelete, or even just a lot of delete activity, then physically having two distinct tables is likely to be a bad idea. One table with a boolean attribute that acts as a "distinguishing key" between the two logical tables might indeed be better. If otoh, you have a lot of query activity that only needs the non-deleted ones, and the volume of deleted ones is usually large in comparison to the non-deleted ones, it might be better to keep them apart physically too (and bite the bullet about the probably worse update performance you'll get - if that were noticeable).
But you said you were interested in the theoretical perspective, and theory (well, as far as I know it) has actually very little to say about matters of physical design.
Wrt the sequenceOrder column, that really depends on the particular situation. I guess that most of the time, you wouldn't need them, because ordering of items as required by the business is most likely to be on "meaningful" data. But I could imagine sequenceOrder columns getting used to mimick insertion timestamps and the like.
Backing up what others have said, both can have their place.
In our CRM system I have an isDeleted - like field in our customer table so that we can hide customers we are no longer servicing while leaving all the information about them in the database. We can easily restore deleted customers and we can strictly enforce referential integrity. Otherwise, what happens when you delete a customer but do not want to delete all records of the work you have done for them? Do you leave references to the customer dangling?
SequenceOrder, again, is useful to allow user-defined ordering. I don't think I use it anywhere, but suppose you had to list say your five favorite foods in order. Five tasks to complete in the order they need to be completed. Etc.
Others have adequately tackled isDeleted.
Regarding sequenceOrder, business rules frequently require lists to be in an order that may not be determined by the actual data.
Consider a table of Priority statuses. You might have rows for High, Low, and Medium. Ordering the the description will give you either High, Low, Medium or Medium, Low, High.
Obviously that order does not give information about the relationship that exists between the three records. Instead you would need a sequenceOrder field so that it makes sense. So that you end up with [1] High, [2] Medium, [3] Low; or the reverse.
Not only does this help with human readability, but system processes can now give appropriate weight to each one.
One of the classical reasons we have a database deadlock is when two transactions are inserting and updating tables in a different order.
For example, transaction A inserts in Table A then Table B.
And transaction B inserts in Table B followed by A.
Such a scenario is always at risk of a database deadlock (assuming you are not using serializable isolation level).
My questions are:
What kind of patterns do you follow in your design to make sure that all transactions are inserting and updating in the same order.
A book I was reading- had a suggestion that you can sort the statements by the name of the table. Have you done something like this or different - which would enforce that all inserts and updates are in the same order?
What about deleting records? Delete needs to start from child tables and updates and inserts need to start from parent tables. How do you ensure that this would not run into a deadlock?
All transactions are
inserting\updating in the same order.
Deletes; identify records to be
deleted outside a transaction and
then attempt the deletion in the
smallest possible transaction, e.g.
looking up by the primary key or similar
identified during the lookup stage.
Small transactions generally.
Indexing and other performance
tuning to both speed transactions
and to promote index lookups over
tablescans.
Avoid 'Hot tables',
e.g. one table with incrementing
counters for other tables primary
keys. Any other 'switchboard' type
configuration is risky.
Especially if not using Oracle, learn
the looking behaviour of the target
RDBMS in detail (optimistic /
pessimistic, isolation levels, etc.)
Ensure you do not allow row locks to
escalate to table locks as some
RDMSes will.
Deadlocks are no biggie. Just be prepared to retry your transactions on failure.
And keep them short. Short transactions consisting of queries that touch very few records (via the magic of indexing) are ideal to minimize deadlocks - fewer rows are locked, and for a shorter period of time.
You need to know that modern database engines don't lock tables; they lock rows; so deadlocks are a bit less likely.
You can also avoid locking by using MVCC and the CONSISTENT READ transaction isolation level: instead of locking, some threads will just see stale data.
Carefully design your database processes to eliminate as much as possible transactions that involve multiple tables. When I've had database design control there has never been a case of deadlock for which I could not design out the condition that caused it. That's not to say they don't exist and perhaps abound in situations outside my limited experience; but I've had no shortage of opportunities to improve designs causing these kinds of problems. One obvious strategy is to start with a chronological write-only table for insertion of new complete atomic transactions with no interdependencies, and apply their effects in an orderly asynchronous process.
Always use the database default isolation levels and locking settings unless you are absolutely sure what risks they incur, and have proven it by testing. Redesign your process if at all possible first. Then, impose the least increase in protection required to eliminate the risk (and test to prove it.) Don't increase restrictiveness "just in case" - this often leads to unintended consequences, sometimes of the type you intended to avoid.
To repeat the point from another direction, most of what you will read on this and other sites advocating the alteration of database settings to deal with transaction risks and locking problems is misleading and/or false, as demonstrated by how they conflict with each other so regularly. Sadly, especially for SQL Server, I have found no source of documentation that isn't hopelessly confusing and inadequate.
I have found that one of the best investments I ever made in avoiding deadlocks was to use a Object Relational Mapper that could order database updates. The exact order is not important, as long as every transaction writes in the same order (and deletes in exactly the reverse order).
The reason that this avoids most deadlocks out of the box is that your operations are always table A first, then table B, then table C (which perhaps depends on table B).
You can achieve a similar result as long as you exercise care in your stored procedures or data layer's access code. The only problem is that it requires great care to do it by hand, whereas a ORM with a Unit of Work concept can automate most cases.
UPDATE: A delete should run forward to verify that everything is the version you expect (you still need record version numbers or timestamps) and then delete backwards once everything verifies. As this should all happen in one transaction, the possibility of something changing out from under you shouldn't exist. The only reason for the ORM doing it backwards is to obey the key requirements, but if you do your check forward, you will have all the locks you need already in hand.
I analyze all database actions to determine, for each one, if it needs to be in a multiple statement transaction, and then for each such case, what the minimum isolation level is required to prevent deadlocks... As you said serializable will certainly do so...
Generally, only a very few database actions require a multiple statement transaction in the first place, and of those, only a few require serializable isolation to eliminate deadlocks.
For those that do, set the isolation level for that transaction before you begin, and reset it whatever your default is after it commits.
Your example would only be a problem if the database locked the ENTIRE table. If your database is doing that...run :)