I am trying to use Mobile Data service in Bluemix and I come across the term last write wins. Can anyone explain what it is clearly?
And what are the other options apart from it
Last write wins is a strategy of deciding which data is most up-to-date when replication is used. Cassandra can be the example. It's simple & fast (uses timestamps) but has limited guarantees - it can cause lost updates / writes. The reason being is that time in computer systems isn't very accurate and nodes sometimes go down.
Check out CouchDB and MongoDB on how they perform consistency... MongoDB uses locks to achieve consistency while CouchDB uses eventual consistency. Mobile data is based on Cloudant (CouchDB under the covers) hence why it Mobile Data uses eventual consistency "last write wins".
Last write wins is basically used during synchronization of files for mobile applications through the File Sync plug-in.
File Sync is limited to use a "last write wins" policy when multiple applications are updating the same files. In "last write wins", the device's copy overwrites the copy stored by File Sync. The resulting behavior depends on whether you are running in automatic or manual mode."
You can visit below link for the reference:
https://www.ng.bluemix.net/docs/#services/mobiledata/index.html#filesync
My understanding of "last write wins" is like this:
Using Selenium i was the first user to book a tennis court on a website. However another user also managed to write to the same booking as we were within milliseconds of each other. The bookings have to be done at 7am. That gives us one whole second. I had managed to book the court and got the email notification that I had successfully booked this court. However, my colleague, who was slightly later (within the second) ended up being the last user to write (exit) to the booking (within the one second) and he also got a email notification. But much to my annoyance it was his name that appeared on the website as the user who had booked the court. His was the last write to the database and he wins - even though I beat him initially to the court booking. Once I had the final write there was probably a thousand of a second left enough for him to get the final write.
Ganz
Related
I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.
Should I log requests info (client ip, request status code, execution time etc.) in my web app into the database to analyse users behavoir and arised errors? And what info log for better experience?
Its often tempting to log lots of information, however I usually find that when I come to use it to answer a question it's often the case that the wrong piece of information has been recorded or only partially. Or it has been recorded but has not been stored in a usable way and takes further programming to turn the log into meaningful information.
So I would start with the question of what you want to see/find and log accordingly. Generally then logging capability can be expanded in the future as new issues/insights are required.
remember every time you log something you are slowing your application down. You are also using more disk space, no one is going to thank you for buying more disk / longer backups just because you have logged everything on every action.
I guess I would follow a train of thought a bit like:
1) What are you trying to find, if its an error you can predict then why not cater for it in your code to start with. If its usability what format does the data need to be in at at what points should it be recorded.
2) How long do you need it for, be sure to purge the logs after a period to conserve disk space.
3) Every element stored is a performance hit, might be small but for high number of transactions it adds up.
4) Be wary of privacy rules, an IP address may be considered as identifiable data, in which case you need to publish a data privacy policy (see point 2).
5) Consider using a flag to control logging on or off. Then you can use it at times of a known issue but not record everything always when not needed.
I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.
Would it be possible to build a Ticketmaster style ticket reservation system by storing all information in a Cassandra cluster?
The system needs to be able to
1. Display the correct number of tickets available at one time
2. Temporarily reserve a ticket while the customer is making the purchase
3. No two users can ever buy the same ticket.
For consistency all reads and writes should be made at quorum. I'm not sure how to implement steps 2 or 3?
Yes, you can.
However, there will be some transactions where you want strict consistency. For example, consistency does not matter when the user is browsing the site and adding tickets to their shopping cart, but when they checkout and select a specific seat number on a specific day consistency matters a great deal (double bookings being a bad thing, especially for high interest events).
So, you could implement 99% of the functionality in an eventually consistent database and implement the checkout process in a consistent database. This is also nice because you can scale 99% of your system that likely gets >70% of the load horizontally and across multiple data centers. Just keep in mind that you will have to deal with the scenario of your site being up but your checkout process being down (ex., an error dialog at checkout asking them to wait/retry and giving them a promo code for their troubles).
The last detail is that you will need to update your eventually consistent database's "number of available tickets" after someone checks out. The good news is that this can be done lazily - queue up that job and do it whenever your system has some spare cycles. It certainly never has to happen in the critical path of the user's checkout process.
I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
EDIT:
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?
So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!