I made a database system right here:
(comments on the normalization are highly appreciated as well - I have a feeling you'll hate me on what I did with tblIsolateSensitivity; tblHAIFile only has a bunch of Boolean fields and foreign keys).
Let's say we have x number of terminals accessing the database. X1 edits Patient 01, X2 edits Patient 02, X3 deletes Patient 01 at the same time. How can I ensure that the data between the three terminals are all up-to-date and consistent?
At the moment, I am querying the data only when the query is needed to be done (ie: when the user searches for a record, or if the program needs to verify something against a database record), meaning the data is only as updated as the most recent query that the user makes. This makes it difficult to ensure that the data is up-to-date on all terminals. Of course, for deleted entries, I have error handling to handle that, but for the rest, well...
So, my question is: how do you guys typically handle this kind of situation? Is there a name for this concept so that I can look it up and read long?
From a database design perspective, you should read up on optimistic concurrency and pessimistic concurrency. These are two options for making sure that you either don't have two users modifying the same record at the same time, or at least if you do allow that, the conflict is detected so it can be resolved.
The basic idea behind optimistic concurrency is that you allow multiple users to view and modify the data at the same time, on the assumption that this will be relatively rare. However, before any user writes changes to the data, a check is made to ensure that the underlying data hasn't changed since it was originally read. In some cases you do this manually with a read before update, checking each column value against a cached value. However, that is cumbersome. Some DBMS systems have features that make this simpler. For example, SQL Server has the ROWVERSION (formerly known as TIMESTAMP) data type, which lets you check easily using a single value whether someone else has changed a record since the last time you read it.
The basic idea behind pessimistic concurrency is that you put a lock on a record in the expectation that you're going to change it. While you hold the lock, the DBMS will prevent anyone else from getting their own lock.
The advantage of optimistic concurrency is that it's pretty light weight, doesn't interfere too much with your application, and let's you manually (or automatically) resolve any conflicts on those rare occasions when they happen. You also don't have to worry about someone reading a record, locking it and then going home for the weekend.
The advantage of pessimistic concurrency is that it prevents collisions, but it can stop one user from working while they wait for another to finish what they're doing.
From the perspective of notifying users when records change in the background (i.e. they're changed by another user) that isn't a database design feature. It may be a feature of your application logic or of your application's data access layer.
Related
As per this answer, it is recommended to go for single table in Cassandra.
Cassandra 3.0
We are planning for below schema:
Second table has composite key. PK(domain_id, item_id). So, domain_id is partition key & item_id will be clustering key.
GET request handler will access(read) two tables
POST request handler will access(write) into two tables
PUT request handler will access(write) details table(only)
As per CAP theorem,
What are the consistency issues in having multi-table schema? in Cassandra...
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
recommended to go for single table in Cassandra.
I would recommend the opposite. If you have to support multiple queries for the same data in Apache Cassandra, you should have one table for each query.
What are the consistency issues in having multi-table schema? in Cassandra...
Consistency issues between query tables can happen when writes are applied to one table but not the other(s). In that case, the application should have a way to gracefully handle it. If it becomes problematic, perhaps running a nightly job to keep them in-sync might be necessary.
You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process. In that case, a given data point may have only a subset of its intended replicas.
This scenario can be countered by running regularly-scheduled repairs. Additionally, consistency can be increased on a per-query basis (QUORUM vs. ONE, etc), and consistency levels of QUORUM and higher will occasionally trigger a read-repair (which syncs all replicas in the current operation).
Can we avoid consistency issues in Cassandra? with these terms QUORUM, consistency level etc...
So Apache Cassandra was engineered to be highly-available (HA), thereby embracing the paradigm of eventual consistency. Some might interpret that to mean Cassandra is inconsistent by design, and they would not be incorrect. I can say after several years of supporting hundreds of clusters at web/retail scale, that consistency issues (while they do happen) are rare, and are usually caused by failures to components outside of a Cassandra cluster.
Ultimately though, it comes down to the business requirements of the application. For some applications like product reviews or recommendations, a little inconsistency shouldn't be a problem. On the other hand, things like location-based pricing may need a higher level of query consistency. And if 100% consistency is indeed a hard requirement, I would question whether or not Cassandra is the proper choice for data storage.
Edit
I did not get this: "Consistency issues between query tables can happen when writes are applied to one table but not the other(s)." When writes are applied to one table but not the other(s), what happens?
So let's say that a new domain is added. Perhaps a scenario arises where the domain_details_table gets updated, but the id_table does not. Nothing wrong here on the database side. Except that when the application expects to find that domain_id in the id_table, but cannot.
In that case, maybe the application can retry using a secondary index on domain_details_table.domain_id. It won't be fast, but the decision to be made is more around which scenario is more preferable; no answer, or a slow answer? Again, application requirements come into play here.
For your point: "You can also have consistency issues within a table. Maybe something happens (node crashes, down longer than 3 hours, hints not replayed) during the write process." How does RDBMS(like MySQL) deal with this?
So the answer to this used to be simple. RDBMSs only run on a single server, so there's only one replica to keep in-sync. But today, most RDBMSs have HA solutions which can be used, and thus have to be kept in-sync. In that case (from what I understand), most of them will asynchronously update the secondary replica(s), while restricting traffic only to the primary.
It's also good to remember that RDBMSs enforce consistency through locking strategies, as well. So even a single-instance RDBMS will lock a data point during an update, blocking any reads until the lock is released.
In a node-down scenario, a single-instance RDBMS will be completely offline, so instead of inconsistent data you'd have data loss instead. In a HA RDBMS scenario, there would be a short pause (during which you would likely encounter connection/query failures) until it has failed-over to the new primary. Once the replica comes up, there would probably be additional time necessary to sync-up the replicas, until HA can be restored.
Question : Is keeping a lock on a record for a long period of time common practice with modern database systems ?
My understanding is locking records in a database (optimistic or pessimistic) is usually for very short period of time during a transaction.
The software I'm working with right now keeps locks on records for long periods of time :
A lock is kept on the record of the logged in user (in the ACTIVE_USERS' table) for the whole time the user is logged in the software.
Let say USER A is working on a file. The record corresponding to the file is locked until USER A saves the file or exit the file. So if a colleague, USER B tries to work on the same file, a popup shows up saying 'You can't work on this file because USER A is working on it right now'.
The company I'm working for to implement compatibility with Microsoft SQL Server wants the changes to be minimal : so I need to implement such a locking mechanism. I've hacked something that is working on a minimal test project but I'm not sure it is up to the industry and MSSQL's standards ...
This is a bit long for a comment.
Using the database locking mechanism for this application-level locking seems unusual. Database locks could be on the row, page, or table level, and they also affect indexes, so there could be unexpected side effects. Obviously, a proliferation of locks also makes deadlocks much more likely.
Normally, application locks would be handled on the record level. Using flags (of some sort) in the record, the application would ensure that only one row would have access to the file.
I would say, it might work. But I would never design a system that way and I'd be wary of unexpected consequences.
I recently came up with a case that makes me wonder if I'm a newbie or something trivial has escaped to me.
Suppose I have a software to be run by many users, that uses a table. When the user makes login in the app a series of information from the table appears and he has just to add and work or correct some information to save it. Now, if the software he uses is run by many people, how can I guarantee is he is the only one working with that particular record? I mean how can I know the record is not selected and being worked by 2 or more users at the same time? And please I wouldn't like the answer use “SELECT FOR UPDATE... “
because for what I've read it has too negative impact on the database. Thanks to all of you. Keep up the good work.
This is something that is not solved primarily by the database. The database manages isolation and locking of "concurrent transactions". But when the records are sent to the client, you usually (and hopefully) closed the transaction and start a new one when it comes back.
So you have to care yourself.
There are different approaches, the ones that come into my mind are:
optimistic locking strategies (first wins)
pessimistic locking strategies
last wins
Optimistic locking: you check whether a record had been changed in the meanwhile when storing. Usually it does this by having a version counter or timestamp. Some ORMs and frameworks may help a little to implement this.
Pessimistic locking: build a mechanism that stores the information that someone started to edit something and do not allow someone else to edit the same. Especially in web projects it needs a timeout when the lock is released anyway.
Last wins: the second person storing the record just overwrites the first changes.
... makes me wonder if I'm a newbie ...
That's what happens always when we discover that very common stuff is still not solved by the tools and frameworks we use and we have to solve it over and over again.
Now, if the software he uses is runed by many people how can I guarantee is he
is the only one working with that particular record.
Ah...
And please I wouldn't like the answer use “SELECT FOR UPDATE... “ because for
what I've read it has too negative impact on the database.
Who cares? I mean, it is the only way (keep a lock on a row) to guarantee you are the only one who can change it. Yes, this limits throughput, but then this is WHAT YOU WANT.
It is called programming - choosing the right tools for the job. IN this case impact is required because of the requirements.
The alternative - not a guarantee on the database but an application server - is an in memory or in database locking mechanism (like a table indicating what objects belong to what user).
But if you need to guarantee one record is only used by one person on db level, then you MUST keep a lock around and deal with the impact.
But seriously, most programs avoid this. They deal with it either with optimistic locking (second user submitting changes gets error) or other programmer level decisions BECAUSE the cost of such guarantees are ridiculously high.
Oracle is different from SQL server.
In Oracle, when you update a record or data set the old information is still available because your update is still on hold on the database buffer cache until commit.
Therefore who is reading the same record will be able to see the old result.
If the access to this record though is a write access, it will be a lock until commit, then you'll have access to write the same record.
Whenever the lock can't be resolved, a deadlock will pop up.
SQL server though doesn't have the ability to read a record that has been locked to write changes, therefore depending which query you're running, you might lock an entire table
First you need to separate queries and insert/updates using a data-warehouse database. Which means you could solve slow performance in update that causes locks.
The next step is to identify what is causing locks and work out each case separately.
rebuilding indexes during working hours could cause very nasty locks. Push them to after hours.
I am using rowversion for optimistic concurrency over a set of data: the client gets a batch of data, makes some updates, and then the updates are sent to the database. The simplest solution for managing optimistic concurrency appears to be the one described here: on retrieval, just get the single largest rowversion from the data of interest (or even just the database's last-used rowversion value), and send it to the client. When updates are requested, have the client send the value back, and then ensure that all rows involved in the update have a rowversion value that is less than or equal to the value sent by the client. On update, any row in the database with a higher rowversion than the one sent to the client must have been updated after the initial retrieval, and the user should be prompted to refresh and try again, or whatever the desired experience is.
The problem that seems obvious to me in this is that it would be easy for the client to simply send back UInt64.MaxValue or some other large value and completely defeat this.
After some searching, I've seen quite a few descriptions of solutions that involve sending rowversions to the client to manage optimistic concurrency, but not a single mention of this kind of concern.
Should data values used for optimistic concurrency checking be signed and verified by the server, or perhaps stored server-side in a user session cache or something similar instead of actually sent to the user? Or should the design of an application consider optimistic concurrency checks to be only part of a good user experience and not a security feature - i.e. concurrency checking should only exist to help ensure that users (who should be properly authorized to touch this data in the first place anyways) are making decisions based on fresh data, and the app should function properly even if someone goes out of their way to defeat the concurrency checks?
I'm leaning toward the latter, but it gives me pause to think about apps that use insecure, client-provided rowversion values and just throw user updates blindly into the database without performing any kind of sanity checks on the rows being updated...
Making my way through the GAE documents.
I have a question I can't find an obvious answer to. Given that transaction to an entity group is limited to 1/sec, how can you scale a request where say, 10,000 users all want to access a particular user's page, at the same time?
Wouldn't this give you 10,000 reads on the particular user's entity group in 1/sec, thereby causing catastrophic system failure and unhappy users?
Or am I confused, and only writes get contentious.
AppEngine uses for transactions a optimistic concurrency control, meaning that they do not lock the data, but throw an exception when they detect that data is "dirty". So, first transaction to change data is ok, the second gets the exception and must retry.
Given this, I assume that reads do not block if they are not part of transaction, even if some other transaction is in progress.
Also, to make transactions less of a bottleneck, one should carefully organize entity groups and make them as small as possible and also have them organized in such a way that there is as few contention (parallel requests) as possible. Meaning:
Have small entity graphs - do not put a lot of entities under common parent.
Try having user entity as a root parent. Users usually do not create parallel transactions (e.g. make multiple money transfers at the same time, etc..)
Right. I wasn't thinking. The answer is memcache. At least partially. That, and an efficient data model/ schema.