When is returning stale state acceptable? - database

I am looking at possible combinations of a consistency with replication factor with Cassandra.
For this combination
RF 2
Write CL ONE
Read CL ONE
R. Strickland in Cassandra High Availability says
sometimes returning stale data is acceptable
How it comes that this is acceptable?
As I understand stale state does not reflect reality,original database could be changed.

This is really depends on your application - you might tolerate reading the "stale" data, or losing some data because node where you did write at ONE is down...
For example, quite often, when writing time series from sensors, people may tolerate losing several data points, but achieve very high write throughput.
For reading, there could be also cases, where the answering faster could be more important than showing up-to-date data. Example, for example, may include something like "Google Finance" where it's not so important that user see the latest data point, but for user is more important to get response fast, especially when user has many stocks in portfolio...

Related

Database System for timeseries & aggregation

I'm currently creating a raspberry pi based logging device for logging the power which is fed into the grid by a solar array.
The "main table" will be growing at ~ 20 entries representing the "current" power produced by several parts of the array.
Basically this isn't that much and can be handled at an acceptable performance using a raspberry pi, but with a growing amount of data queries like "select last 10 years, group by month" probably wouldn't be very effective... (the data should be displayed via an interactive web interface)
I thought of doing some "background aggregation" and maintaining several tables for containing the aggregated data of various timeframes, but this seems like a problem which probably has been dealt with by many people before.
What do you suggest me to do?
You do not know how much data growth is needed to affect performance.
You do not know by how much performance will be affected then.
You do not know if performance will be affected at all.
As long as you do not have even an estimate of how much performance improvement you need, it does not make sense to try to do optimizations.
Or, as said by Donald Knuth:
premature optimization is the root of all evil
If you really do want to create caches of aggregated values, I'd suggest to use triggers to keep the cache consistent after any change to the original data.

In what circumstances support of transaction is not that important?

After posting the question here, I got to know that NoSQL are better at scaling out because they make a trade off between support for transaction and scalability.
So I wonder in what circumstances transactions are not that important so that scalability is more preferable to support of transaction?
Well, I would say first that NoSQL is better at scaling is some situations, but not all.
Full ACID transactions are Atomic, Consistent, Isolated and Durable. If you lose transactions, you will loose some or all of ACID within the datastore.
There are many ways to restore these functions with other asynchronous systems like message queues that themselves are durable. You can shove data onto a durable message queue, pop the data and deal with it in your NoSQL, then, when you can confirm it's stored to your required minimum, you can flag the message as consumed. It's the D in ACID, but distributed and asynchronous. There are ways to ensure the others, but they are often sacrificed to some extent, or moved into another place in the system. With some NoSQL solutions, you just have to move consistency into the application so it doesn't try to store invalid data.
When you start moving away from database driven transactions, you must increase your application testing dramatically to ensure your system doesn't fail (for some values of fail).
There are essentially no situations where transactions and constraints are not important in a system that has both read and write requirements. If they weren't you wouldn't care about your data at all (and some people don't, but regret it later). There are however levels of "caring". It's just a matter of how you end up at ACID or some pseudo-ACID that's "good enough". RDMBS makes caring about your data cheap. NoSQL makes caring about your data expensive, but, it makes scaling cheap(er) (in some cases). There are many companies with multi-terabyte database in RDBMSes, so to say unilaterally that "they don't scale" is simply inaccurate. Multi-terabyte SQL databases however, can cost lots of money, depending on the use case (you can after all just slap a RAID 10 array with a few 3TB drives onto a computer and throw a database engine on it. Might take several minutes to a few hours to do any kind of table scan on a big table, or even indexed look-up though, but if you don't care, it's cheap and multi-terabyte).
The biggest category is read-only type queries, where an aborted or botched transaction can simply be repeated. Anything where you are changing an underlying state, or want to guarantee once and only once activity, should have proper transactional semantics.
That is, "I want to order one widget, charge my credit card" should be a proper transaction: I don't want my card charged unless the widget is ordered, and the vendor doesn't want the widget sent unless the card is charged. "Report the shipment status of order xyz" doesn't need to be transactional -- if I don't get an answer, I can hit reload.
Much of it is just a bit of lateral thinking.
Thw whole point of transcation is you wrap up several operations, and should any fail all that have succeeded get rolled back, and while the transaction is in progress, records are locked and unless you have read uncommitted going, you don't see any on the individual changes of state until the transaction is committed.
Doing all that with distributed systems is expensive, because you need one 'central' and difficult to scale point that needs to 'know' all about the others.
So instead or Order this, charge my card, and show me my current balance.
You do Try to order this, if it's instock charge my card, and if my card gets charged the current known balance will be this.
There's a risk, that the order will be placed, put payment fail, so you need to deal with that. There's a risk that the proposed balance of the card my not be entirely accurate, hence add weasel words and show the potential effect of payment as opposed to the result.
It's not so much are transactions important, it's seeing as they aren't as well supported in NoSQL systems, where/how can I get away with not using them.

What does 'soft-state' in BASE mean?

BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).

2 Questions about Philosophy and Best Practices in Database Development

Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)

Explanation of BASE terminology

The BASE acronym is used to describe the properties of certain databases, usually NoSQL databases. It's often referred to as the opposite of ACID.
There are only few articles that touch upon the details of BASE, whereas ACID has plenty of articles that elaborate on each of the atomicity, consistency, isolation and durability properties. Wikipedia only devotes a few lines to the term.
This leaves me with some questions about the definition:
Basically Available, Soft state, Eventual consistency
I have interpreted these properties as follows, using this article and my imagination:
Basically available could refer to the perceived availability of the data. If a single node fails, part of the data won't be available, but the entire data layer stays operational.
Is this interpretation correct, or does it refer to something else?
Update: deducing from Mau's answer, could it mean the entire data layer is always accepting new data, i.e. there are no locking scenarios that prevent data from being inserted immediately?
Soft state: All I could find was the concept of data needing a period refresh. Without a refresh, the data will expire or be deleted.
Automatic deletion of data in a database seems strange to me.
Expired or stale data makes more sense. But this concept would apply to any type of redundant data storage, not just NoSQL. Does it describe something else then?
Eventual consistency means that updates will eventually ripple through to all servers, given enough time.
This property is clear to me.
Can someone explain these properties in detail?
Or is it just a far-fetched and meaningless acronym that refers to the concepts of acids and bases as found in chemistry?
The BASE acronym was defined by Eric Brewer, who is also known for formulating the CAP theorem.
The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at the same time:
Consistency
Availability
Partition tolerance
A BASE system gives up on consistency.
Basically available indicates that the system does guarantee availability, in terms of the CAP theorem.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time, given that the system doesn't receive input during that time.
Brewer does admit that the acronym is contrived:
I came up with [the BASE] acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is "ACID" -- much more than people realize, so we figured it was good enough.
It has to do with BASE: the BASE jumper kind is always Basically Available (to new relationships), in a Soft state (none of his relationship last very long) and Eventually consistent (one day he will get married).
Basic Availability: The database appears to work most of the time.
Soft State: Stores don’t have to be write-consistent or mutually consistent all the time.
Eventual consistency: Data should always be consistent, with regards how any number of changes are performed.
ACID and BASE are consistency models for RDBMS and NoSQL respectively. ACID transactions are far more pessimistic i.e. they are more worried about data safety. In the NoSQL database world, ACID transactions are less fashionable as some databases have loosened the requirements for immediate consistency, data freshness and accuracy in order to gain other benefits, like scalability and resiliency.
BASE stands for -
Basic Availability - The database appears to work most of the time.
Soft-state - Stores don't have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
Eventual consistency - Stores exhibit consistency at some later point (e.g., lazily at read time).
Therefore BASE relaxes consistency to allow the system to process request even in an inconsistent state.
Example: No one would mind if their tweet were inconsistent within their social network for a short period of time. It is more important to get an immediate response than to have a consistent state of users' information.
To add to the other answers, I think the acronyms were derived to show a scale between the two terms to distinguish how reliable transactions or requests where between RDMS versus Big Data.
From this article acid vs base
In Chemistry, pH measures the relative basicity and acidity of an
aqueous (solvent in water) solution. The pH scale extends from 0
(highly acidic substances such as battery acid) to 14 (highly alkaline
substances like lie); pure water at 77° F (25° C) has a pH of 7 and is
neutral.
Data engineers have cleverly borrowed acid vs base from chemists and
created acronyms that while not exact in their meanings, are still apt
representations of what is happening within a given database system
when discussing the reliability of transaction processing.
One other point, since I having been working with Big Data using Elasticsearch it would help if I explained how it is structured. An instance of Elasticsearch is a node and a group of nodes form a cluster.
To me, from a practical standpoint, BA (Basically Available), in this context, has the idea of multiple master nodes to handle the Elasticsearch cluster and it's operations.
If you have 3 master nodes and the currently directing master node goes down, the system stays up, albeit in a less efficient state, and another master node takes its place as the main directing master node. If two master nodes go down, the system still stays up and the last master node takes over.
It could just be because ACID is one set of properties that substances show( in Chemistry) and BASE is a complement set of them.So it could be just to show the contrast between the two that the acronym was made up and then 'Basically Available Soft State Eventual Consistency' was decided as it's full-form.

Resources