I'm designing a distributed system with a certain flow of data in it. I'd like to guarantee that at least N nodes have almost-current data at any given time.
I do not need complete consistency, only eventual consistency (t.i. for any time instant, the current snapshot of data should eventually appear on at least N nodes. It is tricky to define the term "current" here, but still). Nodes may fail and go back up at any moment, and there is no single "central" node.
O overflowers! Point me to some good papers describing replication schemes. I've so far found one: Consistency Management in Optimistic Replication Algorithms and a more broad and recent article by the same author: Optimistic Replication.
A lot of the trick to this is finding your exact requirements, and yours still sound pretty vague. Do you just need to support operations like this?
Update key K to value V.
Look up a somewhat-recent value of key K.
You mentioned you need eventual consistency. So if you do a single update, it will eventually replicate everywhere. If you do two nearly-simultaneous updates, do you care which one wins? If one replica reports that an update was successfully completed, do you care if the value could be lost if that replica were to temporarily crash shortly afterward? Or if that replica were permanently destroyed?
How precise should somewhat-recent be? If there's a netsplit or something, a lookup might return a very stale result or just fail. Do you care which?
Do you ever need to support fancier operations like...
Get the absolute latest value of key K?
Update the value of key K to value V' provided the latest value is currently V?
Do you have rigid reliability, latency, and/or bandwidth requirements? How far apart are your replicas / how good is the network between? This impacts if you can have cross-replica communication on every update and even on every lookup; or even if you can/should fail over operations to a remote replica if the local one seems to be down.
Depending on your answers here, I've worked with a couple different schemes that might meet your requirements. There are several possible variations on them.
The simplest thing is to just have the application always talk to the local replica. Replicas timestamp values (using NTP-synced clocks) and only talk to each other for asynchronous replication. Highest timestamp wins in replication. Of course, if applications on two different replicas each do a read/modify/write near simultaneously, one of the modifications can easily be lost. (In fact, without a conditional update scheme, the same is even true for near-simultaneous changes on the same replica.) If a replica permanently fails, recent-ish updates can be lost. This is more or less what Bigtable's built-in replication does. In the paper you linked, it'd be the "Optimistic - Multimaster" branch but not caring too much about losing some updates makes it simpler than they suggest.
Some databases use the Paxos algorithm (see for example "Data Management for Internet-Scale Single-Sign-On" here to make fancier things possible. Each replica can know how far behind it might be so you can say "give me a value that's no more than 1 minute old" or "give me the absolute latest value". An update isn't considered complete until a quorum of replicas have accepted it, so "give me the absolute latest value" will definitely always return that value until another update happens. You can do the conditional update operation I mentioned to prevent simultaneous writers from tramping each other. This doesn't seem to fit neatly into either the optimistic or pessimistic category as defined by that author because updates are replicated synchronously to a quorum but replicas which didn't vote in the latest Paxos round may still be able to answer some queries. The scheme can be very complicated, though...
Not RDBMS agnostic, but SQL Server 2008 (2005 onwards) supports Peer-To-Peer Replication
Related
As per this answer, it is recommended to go for single table in Cassandra.
Cassandra 3.0
We are planning for below schema:
Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.
GET request handler will access(read) two tables
POST request handler will access(write) into two tables
PUT request handler will access(write) details table(only)
As per CAP theorem,
What are the consistency issues in having multi-table schema? in Cassandra...
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
recommended to go for single table in Cassandra.
I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.
What are the consistency issues in having multi-table schema? in Cassandra...
Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.
You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.
This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.
Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.
Edit
I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?
So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.
In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.
For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?
So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.
It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.
In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.
I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.
In Cassandra, Hinted Handoff(HH) only happens when consistency level can be met. Also, hints are unreadable to clients. With consistency level > ANY, using HH can improve neither write nor read availability. Requests still fail since the online replicas are insufficient to meet consistency requirements.
What's the point of using Hinted Handoff? Trading capacity for performance?
Why not just synchronize the failed-and-back node with other replica nodes (i.e. re-replication)?
Hinted handoff is just additional anti entropy measure. i.e. you don't have to run repairs right away and data gets consistent (if there was minor outage) when the node comes back online.
I guess it would just be too complicated to take take care of this with replication all the time because you would have to mark somehow the data that hasn't been replicated etc. Basically you would again have something similar to hinted handoff.
Some stuff from official docs:
https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_hh_c.html#concept_ds_ifg_jqx_zj__extreme-write-availability
Basically it's to maximise write throughput of the cluster while there are minor outages. It's configurable and you can disable it in case that you described when there are high consistency levels involved in both read and write.
Plus you have to run the "re-replication" i.e. repair anyway. Because hinted handoff can't really take care of it all.
Personally I used them in a situation R-CL: ONE, W-CL: ONE, RF: 2, NODES: 3. They were very helpful because we maintained the write throughput while doing maintenance and rolling restarts on the cluster. So I would say it works well in situation where W-CL < RF.
Then again there is opinions like this one:
https://blog.threatstack.com/scaling-cassandra-lessons-learned
In fact, just disable them in the configuration. It’s too easy to lose data during a prolonged outage or load spike, and if a node went down because of the load spike you’re just going to pass the problem around the ring, eventually taking multiple or all nodes down. We never experienced this on Cassandra, but have on other systems that supported hinted handoffs.
I recently came up with a case that makes me wonder if I'm a newbie or something trivial has escaped to me.
Suppose I have a software to be run by many users, that uses a table. When the user makes login in the app a series of information from the table appears and he has just to add and work or correct some information to save it. Now, if the software he uses is run by many people, how can I guarantee is he is the only one working with that particular record? I mean how can I know the record is not selected and being worked by 2 or more users at the same time? And please I wouldn't like the answer use “SELECT FOR UPDATE... “
because for what I've read it has too negative impact on the database. Thanks to all of you. Keep up the good work.
This is something that is not solved primarily by the database. The database manages isolation and locking of "concurrent transactions". But when the records are sent to the client, you usually (and hopefully) closed the transaction and start a new one when it comes back.
So you have to care yourself.
There are different approaches, the ones that come into my mind are:
optimistic locking strategies (first wins)
pessimistic locking strategies
last wins
Optimistic locking: you check whether a record had been changed in the meanwhile when storing. Usually it does this by having a version counter or timestamp. Some ORMs and frameworks may help a little to implement this.
Pessimistic locking: build a mechanism that stores the information that someone started to edit something and do not allow someone else to edit the same. Especially in web projects it needs a timeout when the lock is released anyway.
Last wins: the second person storing the record just overwrites the first changes.
... makes me wonder if I'm a newbie ...
That's what happens always when we discover that very common stuff is still not solved by the tools and frameworks we use and we have to solve it over and over again.
Now, if the software he uses is runed by many people how can I guarantee is he
is the only one working with that particular record.
Ah...
And please I wouldn't like the answer use “SELECT FOR UPDATE... “ because for
what I've read it has too negative impact on the database.
Who cares? I mean, it is the only way (keep a lock on a row) to guarantee you are the only one who can change it. Yes, this limits throughput, but then this is WHAT YOU WANT.
It is called programming - choosing the right tools for the job. IN this case impact is required because of the requirements.
The alternative - not a guarantee on the database but an application server - is an in memory or in database locking mechanism (like a table indicating what objects belong to what user).
But if you need to guarantee one record is only used by one person on db level, then you MUST keep a lock around and deal with the impact.
But seriously, most programs avoid this. They deal with it either with optimistic locking (second user submitting changes gets error) or other programmer level decisions BECAUSE the cost of such guarantees are ridiculously high.
Oracle is different from SQL server.
In Oracle, when you update a record or data set the old information is still available because your update is still on hold on the database buffer cache until commit.
Therefore who is reading the same record will be able to see the old result.
If the access to this record though is a write access, it will be a lock until commit, then you'll have access to write the same record.
Whenever the lock can't be resolved, a deadlock will pop up.
SQL server though doesn't have the ability to read a record that has been locked to write changes, therefore depending which query you're running, you might lock an entire table
First you need to separate queries and insert/updates using a data-warehouse database. Which means you could solve slow performance in update that causes locks.
The next step is to identify what is causing locks and work out each case separately.
rebuilding indexes during working hours could cause very nasty locks. Push them to after hours.
Whenever I read something about NoSQL distributed databases they mention the CAP theorem and that it means that in a partitioned system you can either have full consistency, full availability, or a little bit of both, but never both entirely.
What is not really clear to me is what type of consistency they are talking about:
Is it consistency in data freshness, where some clients may get older data than others?
Or is it consistency in the sense that transactions may complete only partially and this may bring the data in an inconsistent state?
The second interpretation sounds quite dangerous to me and not really acceptable. The first interpretation sounds acceptable but how can you prevent that a client that requests a set of data is not served with partly outdated data and partly fresh data?
How dangerous is it to only offer partial consistency and what are the possible negative effects?
Consistency in distributed databases is a huge problem, and it means both of your options: stale data in some places, and partially completed transactions. I'm not going to write an essay about it because it is a huge problem and the solutions are not easy. However, here are some key phrases.
Eventual Consistency is the solution to this, but implementing it sounds like a big job. The key to the implementation is Idempotent Messages. Lets say a complete transaction involves updating data on machines A, B, and C. How do you actually do that? You start sending messages around the place, and keep sending them until you receive an acknowledgement of receipt and successful processing. You may send the message to B twice either because B never got the message, or because B's ack never got received. If you sent it twice because you never got the ack, then B had better do the right thing when it gets it again (which may be to ignore it), and send you an ack so you stop bothering it.
This is a pretty good article, it looks like, and its from a NoSQL point of view. There are loads of links about Idempotent Messages hidden in any search engine, so I'll let you root around.
Final note: Pat Helland who worked on Distributed Databases for many years (at Microsoft and Google among other places) eventually came to the conclusion that consistency for Distributed DBs was impossible, and that you'd better settle for Eventual Consistency via Idempotent Messages.