After posting the question here, I got to know that NoSQL are better at scaling out because they make a trade off between support for transaction and scalability.
So I wonder in what circumstances transactions are not that important so that scalability is more preferable to support of transaction?
Well, I would say first that NoSQL is better at scaling is some situations, but not all.
Full ACID transactions are Atomic, Consistent, Isolated and Durable. If you lose transactions, you will loose some or all of ACID within the datastore.
There are many ways to restore these functions with other asynchronous systems like message queues that themselves are durable. You can shove data onto a durable message queue, pop the data and deal with it in your NoSQL, then, when you can confirm it's stored to your required minimum, you can flag the message as consumed. It's the D in ACID, but distributed and asynchronous. There are ways to ensure the others, but they are often sacrificed to some extent, or moved into another place in the system. With some NoSQL solutions, you just have to move consistency into the application so it doesn't try to store invalid data.
When you start moving away from database driven transactions, you must increase your application testing dramatically to ensure your system doesn't fail (for some values of fail).
There are essentially no situations where transactions and constraints are not important in a system that has both read and write requirements. If they weren't you wouldn't care about your data at all (and some people don't, but regret it later). There are however levels of "caring". It's just a matter of how you end up at ACID or some pseudo-ACID that's "good enough". RDMBS makes caring about your data cheap. NoSQL makes caring about your data expensive, but, it makes scaling cheap(er) (in some cases). There are many companies with multi-terabyte database in RDBMSes, so to say unilaterally that "they don't scale" is simply inaccurate. Multi-terabyte SQL databases however, can cost lots of money, depending on the use case (you can after all just slap a RAID 10 array with a few 3TB drives onto a computer and throw a database engine on it. Might take several minutes to a few hours to do any kind of table scan on a big table, or even indexed look-up though, but if you don't care, it's cheap and multi-terabyte).
The biggest category is read-only type queries, where an aborted or botched transaction can simply be repeated. Anything where you are changing an underlying state, or want to guarantee once and only once activity, should have proper transactional semantics.
That is, "I want to order one widget, charge my credit card" should be a proper transaction: I don't want my card charged unless the widget is ordered, and the vendor doesn't want the widget sent unless the card is charged. "Report the shipment status of order xyz" doesn't need to be transactional -- if I don't get an answer, I can hit reload.
Much of it is just a bit of lateral thinking.
Thw whole point of transcation is you wrap up several operations, and should any fail all that have succeeded get rolled back, and while the transaction is in progress, records are locked and unless you have read uncommitted going, you don't see any on the individual changes of state until the transaction is committed.
Doing all that with distributed systems is expensive, because you need one 'central' and difficult to scale point that needs to 'know' all about the others.
So instead or Order this, charge my card, and show me my current balance.
You do Try to order this, if it's instock charge my card, and if my card gets charged the current known balance will be this.
There's a risk, that the order will be placed, put payment fail, so you need to deal with that. There's a risk that the proposed balance of the card my not be entirely accurate, hence add weasel words and show the potential effect of payment as opposed to the result.
It's not so much are transactions important, it's seeing as they aren't as well supported in NoSQL systems, where/how can I get away with not using them.
Related
I read a topic on Cassandra DB. It wrote that it's good for app that do not require ACID property.
Is there any application or situation that ACID is not important?
There are many (most?) scenarios where ACID isn't needed. For example, if you have a database of products where one table is id -> description and another is id -> todays_price, and yet another is id -> sales_this_week, then there's no need to have all tables locked when updating something. Even if (for some reason) there are some common bits of data across the three tables, depending on usage, having them out of sync for a few seconds may not be a problem. Perhaps the monthly sales are only needed at the end of the month when aggregating a report. Not being ACID compliant means not necessarily satisfying all four properties of ACID... in most business cases, as long as things are eventually consistent with each other, it may be good enough.
It's worth mentioning that commits affecting the same cassandra partition ARE atomic. If something does need to be atomically consistent, then your data model should strive to put that bit of information in the same partition (and as such, in the same table). When we talk about eventual consistency in the context of cassandra, we mean things affecting different partitions (which could be different rows in the same table), not the same partition.
The canonical example of "transaction" is a debit from one account, and a credit from another.... this is "important" because "banks". In reality, this is not how banks operate. What such a system does need is a definitive list of transactions (read transfers). If you were to model this for cassandra, you could have a table of transfers consisting of (from_account, to_account, amount, time, etc.). These records would be atomically consistent. Your "accounts" tables would be updated from this list of transfers. How soon this gets reflected depends on the business. For example, in the UK, transfers from Lloyds bank to Lloyds bank are almost instant. Whereas some inter-bank transfers can take a couple of days. In the case of the latter, your account's balance usually shows the un-deducted amount of pending transfers, while a separate "available balance" considers the pending transfers.
Different things operate at different latencies, and in some cases ACID, and the resultant immediate consistency across all updated records may be important. For a lot of others though, specially when dealing with distributed systems with lots of data, ACID at the database level may not be required.
Even where "visible consistency" is required, it can often be handled with coordination mechanisms at the application level, CRDTs, etc. To the end user, the system is atomic - either something succeeds, or it doesn't, and the user gets a confirmation. Internally, the system may be updating multiple database, dealing with external services, etc., and only confirming when everything's peachy. So, ACID for different rows in a table, or across tables in a single database, or even across multiple databases may not be sufficient for externally visible consistency. Cassandra has tunable consistency where by you can use data modelling, and deal with the tradeoffs to make a "good enough" system that meets business requirements. If you need ACID transactionality across tables, though - Cassandra wouldn't be fit for that use case. However, you may be able to model your business requirements within cassandra's constraints, and use it to get the other scalability benefits it provides.
I'm pretty sure that with a relational database, it's faster and better to read 50 records at once than to make 50 calls for one record each. Is there a performance benefit from performing multiple writes all at once? If not, why not?
Probably depends on the RDBMS and the storage engine, but at least in MySQL/InnoDB, multiple writes in one transaction (as well the multi-insert syntax, which, afaik, is MySQL extension) allows you not to update non-unique indexes before transaction is commited, and the update of the index happens at once with all new values (since it's a b-tree, in this way its much faster). It's possible that RDBMS optimizes other writes as well, to have sequential instead of random writes.
Also, if there is a table-level locking (as in MyISAM), locking the table once, writting multiple records and then unlocking removes the overhead of lock/unlock for every write.
So generally, there is performance gain, but it depends on the database server used.
Doing all your reads at once makes sense, although there are some problems in it which I'll touch on in a minute.
Doing all your writes at once poses a particular problem: the data is in the database until you put it there. If you're waiting for some optimization threshold (let's say 50) then transaction 1 is going to have to wait for (unrelated) transactions 2-50 to complete before it goes to the database. This means that in the mean time (which could be several [seconds, minutes, hours]) nobody knows what those records are, or if they're updated what the new values are. (Same with reads but the other way around. Your data may be out of date by the time you get to use it.)
Performance wise, I cannot imagine that combining writes closer together would not have some performance. (IF that was confusing to read, I meant "You should always get a performance boost by grouping.") If nothing else, you have a better chance to hit memory caches instead of disk caches than if you do them separately. #Darhazer brings up a good point about locking. So strictly from a total-time-spent-writing point of view, it would be better to group them. From an application performance point of view, it's difficult to say without an intricate knowledge of the business requirements of the app.
Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)
The BASE acronym is used to describe the properties of certain databases, usually NoSQL databases. It's often referred to as the opposite of ACID.
There are only few articles that touch upon the details of BASE, whereas ACID has plenty of articles that elaborate on each of the atomicity, consistency, isolation and durability properties. Wikipedia only devotes a few lines to the term.
This leaves me with some questions about the definition:
Basically Available, Soft state, Eventual consistency
I have interpreted these properties as follows, using this article and my imagination:
Basically available could refer to the perceived availability of the data. If a single node fails, part of the data won't be available, but the entire data layer stays operational.
Is this interpretation correct, or does it refer to something else?
Update: deducing from Mau's answer, could it mean the entire data layer is always accepting new data, i.e. there are no locking scenarios that prevent data from being inserted immediately?
Soft state: All I could find was the concept of data needing a period refresh. Without a refresh, the data will expire or be deleted.
Automatic deletion of data in a database seems strange to me.
Expired or stale data makes more sense. But this concept would apply to any type of redundant data storage, not just NoSQL. Does it describe something else then?
Eventual consistency means that updates will eventually ripple through to all servers, given enough time.
This property is clear to me.
Can someone explain these properties in detail?
Or is it just a far-fetched and meaningless acronym that refers to the concepts of acids and bases as found in chemistry?
The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem.
The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:
Consistency
Availability
Partition tolerance
A BASE system gives up on consistency.
Basically available indicates that the system does guarantee availability, in terms of the CAP theorem.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time, given that the system doesn't receive input during that time.
Brewer does admit that the acronym is contrived:
I came up with [the BASE] acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is "ACID" -- much more than people realize, so we figured it was good enough.
It has to do with BASE: the BASE jumper kind is always Basically Available (to new relationships), in a Soft state (none of his relationship last very long) and Eventually consistent (one day he will get married).
Basic Availability: The database appears to work most of the time.
Soft State: Stores don’t have to be write-consistent or mutually consistent all the time.
Eventual consistency: Data should always be consistent, with regards how any number of changes are performed.
ACID and BASE are consistency models for RDBMS and NoSQL respectively. ACID transactions are far more pessimistic i.e. they are more worried about data safety. In the NoSQL database world, ACID transactions are less fashionable as some databases have loosened the requirements for immediate consistency, data freshness and accuracy in order to gain other benefits, like scalability and resiliency.
BASE stands for -
Basic Availability - The database appears to work most of the time.
Soft-state - Stores don't have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
Eventual consistency - Stores exhibit consistency at some later point (e.g., lazily at read time).
Therefore BASE relaxes consistency to allow the system to process request even in an inconsistent state.
Example: No one would mind if their tweet were inconsistent within their social network for a short period of time. It is more important to get an immediate response than to have a consistent state of users' information.
To add to the other answers, I think the acronyms were derived to show a scale between the two terms to distinguish how reliable transactions or requests where between RDMS versus Big Data.
From this article acid vs base
In Chemistry, pH measures the relative basicity and acidity of an
aqueous (solvent in water) solution. The pH scale extends from 0
(highly acidic substances such as battery acid) to 14 (highly alkaline
substances like lie); pure water at 77° F (25° C) has a pH of 7 and is
neutral.
Data engineers have cleverly borrowed acid vs base from chemists and
created acronyms that while not exact in their meanings, are still apt
representations of what is happening within a given database system
when discussing the reliability of transaction processing.
One other point, since I having been working with Big Data using Elasticsearch it would help if I explained how it is structured. An instance of Elasticsearch is a node and a group of nodes form a cluster.
To me, from a practical standpoint, BA (Basically Available), in this context, has the idea of multiple master nodes to handle the Elasticsearch cluster and it's operations.
If you have 3 master nodes and the currently directing master node goes down, the system stays up, albeit in a less efficient state, and another master node takes its place as the main directing master node. If two master nodes go down, the system still stays up and the last master node takes over.
It could just be because ACID is one set of properties that substances show( in Chemistry) and BASE is a complement set of them.So it could be just to show the contrast between the two that the acronym was made up and then 'Basically Available Soft State Eventual Consistency' was decided as it's full-form.
In plain English, what are the disadvantages and advantages of using
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
in a query for .NET applications and reporting services applications?
This isolation level allows dirty reads. One transaction may see uncommitted changes made by some other transaction.
To maintain the highest level of isolation, a DBMS usually acquires locks on data, which may result in a loss of concurrency and a high locking overhead. This isolation level relaxes this property.
You may want to check out the Wikipedia article on READ UNCOMMITTED for a few examples and further reading.
You may also be interested in checking out Jeff Atwood's blog article on how he and his team tackled a deadlock issue in the early days of Stack Overflow. According to Jeff:
But is nolock dangerous? Could you end
up reading invalid data with read uncommitted on? Yes, in theory. You'll
find no shortage of database
architecture astronauts who start
dropping ACID science on you and all
but pull the building fire alarm when
you tell them you want to try nolock.
It's true: the theory is scary. But
here's what I think: "In theory there
is no difference between theory and
practice. In practice there is."
I would never recommend using nolock
as a general "good for what ails you"
snake oil fix for any database
deadlocking problems you may have. You
should try to diagnose the source of
the problem first.
But in practice adding nolock to queries that you absolutely know are simple, straightforward read-only affairs never seems to lead to problems... As long as you know what you're doing.
One alternative to the READ UNCOMMITTED level that you may want to consider is the READ COMMITTED SNAPSHOT. Quoting Jeff again:
Snapshots rely on an entirely new data change tracking method ... more than just a slight logical change, it requires the server to handle the data physically differently. Once this new data change tracking method is enabled, it creates a copy, or snapshot of every data change. By reading these snapshots rather than live data at times of contention, Shared Locks are no longer needed on reads, and overall database performance may increase.
My favorite use case for read uncommited is to debug something that is happening inside a transaction.
Start your software under a debugger, while you are stepping through the lines of code, it opens a transaction and modifies your database. While the code is stopped, you can open a query analyzer, set on the read uncommited isolation level and make queries to see what is going on.
You also can use it to see if long running procedures are stuck or correctly updating your database using a query with count(*).
It is great if your company loves to make overly complex stored procedures.
This can be useful to see the progress of long insert queries, make any rough estimates (like COUNT(*) or rough SUM(*)) etc.
In other words, the results the dirty read queries return are fine as long as you treat them as estimates and don't make any critical decisions based upon them.
The advantage is that it can be faster in some situations. The disadvantage is the result can be wrong (data which hasn't been committed yet could be returned) and there is no guarantee that the result is repeatable.
If you care about accuracy, don't use this.
More information is on MSDN:
Implements dirty read, or isolation level 0 locking, which means that no shared locks are issued and no exclusive locks are honored. When this option is set, it is possible to read uncommitted or dirty data; values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction. This option has the same effect as setting NOLOCK on all tables in all SELECT statements in a transaction. This is the least restrictive of the four isolation levels.
When is it ok to use READ UNCOMMITTED?
Rule of thumb
Good: Big aggregate reports showing constantly changing totals.
Risky: Nearly everything else.
The good news is that the majority of read-only reports fall in that Good category.
More detail...
Ok to use it:
Nearly all user-facing aggregate reports for current, non-static data e.g. Year to date sales.
It risks a margin of error (maybe < 0.1%) which is much lower than other uncertainty factors such as inputting error or just the randomness of when exactly data gets recorded minute to minute.
That covers probably the majority of what an Business Intelligence department would do in, say, SSRS. The exception of course, is anything with $ signs in front of it. Many people account for money with much more zeal than applied to the related core metrics required to service the customer and generate that money. (I blame accountants).
When risky
Any report that goes down to the detail level. If that detail is required it usually implies that every row will be relevant to a decision. In fact, if you can't pull a small subset without blocking it might be for the good reason that it's being currently edited.
Historical data. It rarely makes a practical difference but whereas users understand constantly changing data can't be perfect, they don't feel the same about static data. Dirty reads won't hurt here but double reads can occasionally be. Seeing as you shouldn't have blocks on static data anyway, why risk it?
Nearly anything that feeds an application which also has write capabilities.
When even the OK scenario is not OK.
Are any applications or update processes making use of big single transactions? Ones which remove then re-insert a lot of records you're reporting on? In that case you really can't use NOLOCK on those tables for anything.
Use READ_UNCOMMITTED in situation where source is highly unlikely to change.
When reading historical data. e.g some deployment logs that happened two days ago.
When reading metadata again. e.g. metadata based application.
Don't use READ_UNCOMMITTED when you know souce may change during fetch operation.
Regarding reporting, we use it on all of our reporting queries to prevent a query from bogging down databases. We can do that because we're pulling historical data, not up-to-the-microsecond data.
This will give you dirty reads, and show you transactions that's not committed yet. That is the most obvious answer. I don't think its a good idea to use this just to speed up your reads. There is other ways of doing that if you use a good database design.
Its also interesting to note whats not happening. READ UNCOMMITTED does not only ignore other table locks. It's also not causing any locks in its own.
Consider you are generating a large report, or you are migrating data out of your database using a large and possibly complex SELECT statement. This will cause a shared lock that's may be escalated to a shared table lock for the duration of your transaction. Other transactions may read from the table, but updates are impossible. This may be a bad idea if its a production database since the production may stop completely.
If you are using READ UNCOMMITTED you will not set a shared lock on the table. You may get the result from some new transactions or you may not depending where it the table the data were inserted and how long your SELECT transaction have read. You may also get the same data twice if for example a page split occurs (the data will be copied to another location in the data file).
So, if its very important for you that data can be inserted while doing your SELECT, READ UNCOMMITTED may make sense. You have to consider that your report may contain some errors, but if its based on millions of rows and only a few of them are updated while selecting the result this may be "good enough". Your transaction may also fail all together since the uniqueness of a row may not be guaranteed.
A better way altogether may be to use SNAPSHOT ISOLATION LEVEL but your applications may need some adjustments to use this. One example of this is if your application takes an exclusive lock on a row to prevent others from reading it and go into edit mode in the UI. SNAPSHOT ISOLATION LEVEL does also come with a considerable performance penalty (especially on disk). But you may overcome that by throwing hardware on the problem. :)
You may also consider restoring a backup of the database to use for reporting or loading data into a data warehouse.
It can be used for a simple table, for example in an insert-only audit table, where there is no update to existing row, and no fk to other table. The insert is a simple insert, which has no or little chance of rollback.
I always use READ UNCOMMITTED now. It's fast with the least issues. When using other isolations you will almost always come across some Blocking issues.
As long as you use Auto Increment fields and pay a little more attention to inserts then your fine, and you can say goodbye to blocking issues.
You can make errors with READ UNCOMMITED but to be honest, it is very easy make sure your inserts are full proof. Inserts/Updates which use the results from a select are only thing you need to watch out for. (Use READ COMMITTED here, or ensure that dirty reads aren't going to cause a problem)
So go the Dirty Reads (Specially for big reports), your software will run smoother...