A common bit of programming logic I find myself implementing often is something like the following pseudo-code:
Let X = some value
Let Database = some external Database handle
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
However, in a multi-threaded program we have a race condition here. Thread A might check if X is in Database, find that it's not, and then proceed to call SomeCalculation(). Meanwhile, Thread B will also check if X is in Database, find that it's not, and insert a duplicate entry.
So of course, this needs to be synchronized like:
Let X = some value
Let Database = some external Database handle
LockMutex()
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
UnlockMutex()
This is fine, except what if the application is a distributed app, running across multiple computers, all of which communicate with the same back-end database machine? In this case, a Mutex is useless, because it only synchronizes a single instance of the app with other local threads. To make this work, we'd need some kind of "global" distributed synchronization technique. (Assume that simply disallowing duplicates in Database is not a feasible strategy.)
In general, what are some practical solutions to this problem?
I realize this question is very generic, but I don't want to make this a language-specific question because this is an issue that comes up across multiple languages and multiple Database technologies.
I intentionally avoided specifying whether I'm talking about an RDBMS or SQL Database, versus something like a NoSQL Database, because again - I'm looking for generalized answers based on industry practices. For example, is this situation something that Atomic Stored Procedures might solve? Or Atomic Transactions? Or is this something that requires something like a "Distributed Mutex"? Or more generally, is this problem generally addressed by the Database system, or is it something the Application itself should handle?
If it turns out this question is impossible to answer at all without further information, please tell me so I can modify it.
One sure way to ensure against data stomping is to lock the data row. Many databases allow you to do that, via transactions. Some don't support transactions.
However, this is overkill for most cases, where contention is low in general. You might want to read up on Isolation levels to get more background on the topic.
A better general approach is often Optimistic Concurrency. The idea behind it is that each data row includes a signature, a timestamp works fine but the signature need not be time oriented. It could be a hash value, for example. This is a general concurrency management approach and is not limited to relational stores.
The app that changes data first reads the row, and then performs whatever calculations it requires, and then at some point, writes the updated data back to the data store. Via Optimistic concurrency, the app writes the update with the stipulation (expressed in SQL if it is a SQL database) that the data row must be updated only if the signature has not changed in the interim. And, each time a data row is updated, the signature must be updated as well.
The result is that updates don't get stomped on. But for a more rigorous explanation of the concurrency issues, refer to that article on DB Isolation levels.
All distributed updaters must follow the OCC convention (or something stronger, like transactional locking) in order for this to work.
You can obviously move the "synch" part to the DB layer itself, using an exclusive lock on a specific resource.
This is a bit extreme (in most cases, attempting the insert and managing the exception when you actually discover that someone already inserted the row) would be more adequate, I think.
Well, since you ask an general question, I will try to provide another option. Its not very orthodox, but may be useful: You could "define" a machine or a process responsible for doing that. For example:
Let X = some value
Let Database = some external Database handle
xResposible = Definer.defineResponsibleFor(X);
if(xResposible == me)
if !Database.contains(X):
SomeCalculation()
Database.insert(X)
The trick here is to make defineResponsibleFor always return the same value independent of who is calling. So, if you have a fair distributed range of X and a fair Definer all machines will have work to do. And you could use simple threading mutex to avoid race conditions. Of course, now you have to take care of fail-tolerance (if a machine or process is out of business your Definer must know and not define any job for it). But you should do this anyawy... :)
Related
Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - https://dba.stackexchange.com/questions/72325/postgresql-if-i-run-multiple-queries-concurrently-under-what-circumstances-wo. It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.
If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.
My company is using Cassandra (2.1.8), running on debian servers.
Without setting a consistency on my queries/preparedStatements, I would sometimes get "null" back from a "SELECT * FROM table_name WHERE key = ?". After adding:
statementGet.setConsistencyLevel(ConsistencyLevel.ONE);
I get an answer every time. But from the article below:
Read/Write Strategy For Consistency Level
If you don't care about consistency, then this discussion is moot;
read/write with whatever Consistency Level and let Cassandra's
asynchronous replication and anti-entropy features do their work. That
said, though your goal is minimizing network traffic/cost, if your
workload is mostly reads then the added cost of doing writes at CL
QUORUM or ALL may not actually be that much.
it is said that my consistency levels do not matter, if I don't care about consistency. And I don't. It doesn't not matter much to me if the data is hours or days old. But I do care about getting data everytime.
What I need to know is: Do I need to put a consistency level on my Writes as well, to ensure that they are actually written, and that my Read will always retrieve a value, if I have previously written anything.
And what might that consistency level be, given the above criteria?
While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.
BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).
I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.