In my recent interview, I got the following question:
There are multiple instances of a microservice, each connects to the
same relational database (let's say postgres). How can you ensure data
consistency and concurrency, if you cannot rely on the database to do
it for you?
Generally I'd say that there is no problem with multiple instances using the same database, as the db can handle concurrent connections, even when the services try to update the same table, or same rows.
But what I don't understand is what can we do if the database, for some reason, cannot handle concurrency?
Related
we have two databases in different servers for maria DB, Primary and secondary.As Master-Slave relation is not in place. We need to update both the data bases for each user actions.
Example: If add employee action performed from front end by user. We need to insert employee details in Primary data base first and later in secondary.
We designed to insert separately to database ie two insert call's from application for each Database.
As we have multiple DB interaction for single operation, this solution will effect performance.
Is there any way we can achieve this by using procedure or UDF's?
Any better approaches or suggestion are helpfull
There are multiple ways we can do what you are looking for. Of course, there are pros and cons of each approach.
The simplest one would be to use MaxScale's "Tee" filter, that can automatically execute all your queries on multiple servers. This is, of course, transparent to the application code.
Refer to: https://mariadb.com/kb/en/mariadb-maxscale-24-tee-filter/
The benefit is that it does with the least amount of strain on your application performance as it doesn't have to make two calls to execute your statements on the databases and your application code remains simple.
Negative of this is that, there are no return values form the "tee" filter! You won't be able to tell if the insert, update, delete was successful or not.
The other method that might be much better is to use the "Spider" storage engine in MariaDB and connect to the remote table on the remote MariaDB server.
Now you can create a trigger on your primary table and depending on your business logic/requirements, you can update the Spider table, a.k.a Remote table, from within the trigger. This will give you more control and your application is still kept clean without dual connections.
Another benefit of this is that if due to some reason your Trigger fails to update the remote "Spider" table, your primary transaction will also fail! This will ensure proper data integrity is maintained :)
There are other engines that can be used similarly like "CONNECT" but "Spider" is the only officially supported among the ones that can connect to a remote database's table. Spider, however, can only connect as long as the remote database is also MariaDB.
Hope this answers your question and is what you are looking for :)
Cheers,
Faisal.
I'm managing an online book store website. For sake of high availability I've setup two Tomcat instances for running the website application, they are the exactly same program, and they are sharing the same database which located in another server.
My question is that how can I avoid conflicts or dirty data when the two applications do the same updates/inserts at the same time to the database.
For example: update t_sale set total='${num}' where category='cs', if there are two processes execute the above sql simultaneously would cause data lost.
If by "database" you are talking about a well designed schema that is running on an RDBMS such as Oracle, DB2, or SQL Server, then the database itself will prevent what you call "conflicts" by locking parts of the database during each update transaction.
You can prevent "dirty data" from getting into the database by adding features such as check clauses and primary-foreign key structures in the database itself.
The current single application server can handle about 5000 concurrent requests. However, the user base will be over millions and I may need to have two application servers to handle requests.
So the design is to have a load balancer to hope it will handle over 10000 concurrent requests. However, the data of each users are being stored in one single database. So the design is to have two or more servers, shall I do the followings?
Having two instances of databases
Real-time sync between two database
Is this correct?
However, if so, will the sync process lower down the performance of the servers
as Database replication seems costly.
Thank you.
You probably want to think of your service in "tiers". In this instance, you've got two tiers; the application tier and the database tier.
Typically, your application tier is going to be considerably easier to scale horizontally (i.e. by adding more application servers behind a load balancer) than your database tier.
With that in mind, the best approach is probably to overprovision your database (i.e. put it on its own, meaty server) and have your application servers all connect to that same database. Depending on the database software you're using, you could also look at using read replicas (AWS docs) to reduce the strain on your database.
You can also look at caching via Memcached / Redis to reduce the amount of load you're placing on the database.
So – tl;dr – put your DB on its own, big, server, and spread your application code across many small servers, all connecting to that same DB server.
Best option could be the synchronizing the standby node with data from active node as cost effective solution since it can be achievable using open source relational database(e.g. Maria DB).
Do not store computable results and statistics that can be easily doable at run time which may help reduce to data size.
If history data is not needed urgent for inquiries , it can be written to text file in easily importable format to database(e.g. .csv).
Data objects that are very oftenly updated can be kept in in-memory database as key value pair, use scheduled task to perform batch update/insert to relation database to achieve persistence
Implement retry logic for database batch update tasks to handle db downtimes or network errors
Consider writing data to relational database as serialized objects
Cache configuration data to memory from database either periodically or via API to refresh the changing part.
Here is the situation:
There is a transaction intensive database - used for both routine transactions and reports.
I was wondering if I could isolate these two operations and 2 independent databases, so reports could run off of one database and all the transactions could occur in another one. This would improve performance for the OLTP SQL database.
I have gone over a few options like, Mirroring, Log shipping, Replication, Snapshots, Clustering - but would like to discuss the best possible strategy for the desired result.
Please advise the best solution to implement this strategy, or any other thoughts/suggestion you may have.
I am thinking this is a classic textbook case of separation of frontend and backend database.
For the projects and people I have worked with, there was a strong agreement that the two should be separated.
In one case, there were three tiers of databases:
Frontend transactions Middle summary
repository for reference by frontend transactions
Backend information repository
The frontend transaction speed was so critical, even that layer was dissected into multiple databases, one database per manufacturing area. The transactions were performed by equipment requiring very fast response.
Data from the frontend databases were used, together with customer and management -oriented databases to construct records for the backend reporting repository at an hourly frequency, because management needed short information latency for their operational and engineering decisions. If we could perform the information-compilation at 15 minute intervals, we would have done it. Depending on project, that backend repository could either be Oracle or Sybase IQ.
However, the frontend transactions performed by equipment needed to refer to some meta information. Response time required by the equipment could not run the risk of being interrupted by someone running a huge adhoc query on the backend repository, which was frequent.
So, a middle layer bridging database was created, which consists of nightly abstracts of information from the backend repository.
Schema designed with commonality-keys
Schema design is very important, to optimise the response and performance of all the databases. You have to ensure your database records are commonality-key-indexed and discrete-time-indexed.
For a manufacturing plant filled with robots and equipment, divided into manufacturing areas, each area has a frontend transaction database. Each area database needs to have a commonality-key dispatcher. When
a piece of equipment needed to perform a batch of operations, the beginOp event requests for a discrete-key from the dispatcher. An operation cycle may take seconds, or days, or weeks. Every time a piece of equipment needed to perform a transaction on its state of operation, it includes that commonality-key. An operation could have sub-operations and sub-sub-operations, etc - but each of such operation is required to obtain a commonality-key from its area dispatcher.
The commonality-key dispatcher is simply the beginOp table in the database with an auto-increment key. Any equipment sharing a same begun operation, it is able to infer/obtain that commonality-key from the table due to meticulous process sequencing strategy.
For areas where we could ensure that no two operations on the whole floor could start at the same 100 millisecond, there was no need for a dispatcher because we could simply use the date-time of a beginOp event, where the datetime function of the database server is the natural/spontaneous key dispatcher.
The reason for this discussion on commonality-key is because the transaction response required is so quick, you do not want pieces of equipment to have to communicate with each other unnecessarily just to tell each other they are recording events of the same operation. The robots and equipment simply perform the transactions with the commonality-key they are holding.
The hourly compilation of information for insertion into the backend repository conveniently uses the composite-key of commonality+area, to construct the hierarchy of events.
Frontend piping database
OK, this is really extreme. In some areas, the transactions were so frequent, that we had a FIFO database. We introduced a fourth tier database. For optimal transaction response, we had to keep a database size below 1GB. A transaction-piping process existed to empty old transactions into the fourth tier databases. I found that it was easier (and better response) to create a pool of new databases, so that every time its size reaches 1GB, it is moved out and immediately replaced with a new database from the pool - leaving the machines performing the hourly compilation to join up the databases. So that left us with depending on an existing metadata database to house the commonality-key dispatcher table with some meta-data tables.
In retrospect, one might think the commonality-key dispatcher table and metadata tables could have been housed in the middle tier bridging database, but because the database management processes were automated and cookie-cut, it was cleaner to create a new process than to modify the process managing the mid-tier bridging database. Those management routines were used across the world, so you cannot willy-nilly change them without causing havoc to the financial performance of the company or offending the respective data layer architects maintaining them.
It took a lot of organisational skills for the managers to pull all these together. So transactional data design is not just simply a technical skill but process planning skills involving a whole lot of people head-butting each other until you get it right.
What you ask for is totally standard - OLAP and OLTP do not mix in heavy load scenarios.
You use SQL Server. Look into SSAS (SQL Server Analytical Processing) for something to build cubes (different approach than SQL) that you can then report against.
If you do not wnat that, then mirriroring is the next best solution - you can put a mirror online in read only mode for reporting, and it gives you, also, a backup to activate if the main server fails ;) Always good.
CLustering is a non-issue - it will allow you to move the database to another node, but it does not solve the performance issue at all. Log file shipping, replication - good, though I would go with mirroring, read only copy for reporting, loading the data into SSAS.
We have a read/write Cluster which replicates (using transactional replication) to "read only" servers (not physically read only , the web app just performs reads on them). We do the same for reporting and this scales pretty good.
We have multiple sites, 32+ servers and a couple of reporting servers in this configuration with very high volume of inserts, updates and reads.
We primarily use reporting services for internal reporting. Reporting doesnt effect our core business , which I guess is your main concern.
Note: I have no intention of implementing this, it's more of a thought experiment.
Suppose I had multiple services available through a web interface. At least two of which required user registration and some data in a database. A single registration would grant access to all services. Like Google (GMail, Google Docs, etc.).
Would all of these services, which are related to registered users, be located within a single database, perhaps with table-prefixes for what service they were for?
Or would each service have it's own database? The only plus I can see to doing this is that it would make table names cleaner. Any time any user interaction would be needed, interacting with at least two different databases would be needed, which would needlessly complicate sql queries.
Would this suggest that the 'big boys' use only a single database, and load it with tons of different (and perhaps completely unrelated) tables?
If you use the right DBMS, you can have the best of both strategies. In PostgreSQL, within a 'database' you can have separate schemas. The authentication service would access a single schema and provide the other services a key which is used as a reference for data in the other schemas. You can still deal with the entire database as a single entity i.e:
query across schemas without using dblink
store personally identifiable information separately (schemas can have separate per-user permissions to further protect data)
DBMS managed foreign key constrains (I believe)
consistent (re the data) backup and restore
You get these advantages at the cost of a more complex DAL (may not be supported by your favorite DAL framework) and less portability between DBMS's.
I do not think it is a good idea to make multiple services dependent on a single database. If you need to restore some service from a backup, you'll have to restore all.
You are overloading a database server probably too.
I would do that only if it is likely they will share much data at future point.
Also you might consider smaller database with only the shared user data.
I would consider having 1 user / role repository with a separate database for services.
I've never done this, but I think it would depend on performance. If there's almost no overhead to do separate databases, that might be the answer. Doing separate DBs may also make it easy to split DBs across machines.
Complexity is also an issue. Hopefully your schema would be defined in such a way that you wouldn't need to dip into several different databases for different queries.
There's always a problem with potentially overloading databases and access thereof; replication is one potential good solution.
There are several strategies.
When you move to multiple databases (or multiple servers), things get more complex. Your core user information could be in a single database. The individual services could be in other databases. The problem with that is that the database is the outer unit of referential integrity, so you cannot design in foreign keys across databases. One way around this is to distribute changes to the core master tables (additions and updates only, obviously, since deletions would be forbidden due to a foreign-key constraint) to separate databases on a regular basis, and then enforce RI against these copies of the core master database tables within the service databases. This also means that the service databases and their services can run while the other databases are down for maintenance. Obviously this is an increased architectural complexity for an improvement to your service windows and reduced coupling.
I would recommend starting with a single database. If your RDBMS supports it, I would organize components according to SCHEMAs which would allow you to at least maintain a logical separation by design. You can more easily refactor later.
Many databases have tables which can be considered unrelated. Sometimes in a system you have multiple entity networks that hardly connect (sometimes not at all). You can use SCHEMAs in these cases too.