keeping databases in sync (after write/update) across regions/zones - database

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.

You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.

Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

To CouchDB or not to?

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

Synchronizing intranet and web data

I am just getting started breaking a .NET application and its SQL Server database into two systems - an intranet and a public website.
The various database tables will need to be synchronised between the two databases in different ways, for example:
Moving from web to intranet, with the intranet data becoming read-only
Moving from intranet to web, with the web data becoming read-only
Tables that need to be synchronised and are read/write on both the intranet and web databases.
Some of the synchronisation needs to occur relatively quickly with minimal lag, possibly with some type of transaction locking to ensure repeatable reads etc. Other times it doesn't matter if there is a delay between synchronisation.
I am not quite sure where to start with all this, as there seems to be many different ways of achieving this. Which technologies and strategies should I be looking at?
Any tips?
A system like that looks like the components are fairly tightly coupled. An upgrade across several systems all at once can turn into quite the nightmare.
It looks like this is less of a replication problem and more of a problem of how to maintain a constant connection to a remote database without much I/O lag. While it can be done, probably isn't going to work out very well in terms of scalability and being able to troubleshoot problems.
You might look at using some message queueing and asynchronous data processing from the remote site to the intranet. You'll probably have to adjust some expectations of the business side so that they don't assume that everything is accessible real-time all the time.
Of course, its hard to give specifics without more details. It might be a good idea to look into principles of SOA and messaging systems for what you're trying to do.
Out of the box you have SQL Server Replication. Sounds like a pair of filtered transactional replication publications can do the job. Transactional replication has a low overhead on the publisher and can ensure transactional consistency of the published changes.
Nathan raises some very valid points about the need for a more loosely coupled solution. Service Broker can fit that shoe quite well with its loosely coupled asynchronous nature, and provide a headache free upgrade future since SSB is compatible between SQL Server versions and editions. But this freedom comes at the cost of letting the heavy lifting of actually detecting the changes and applying them to the tables to you, as application code, not a trivial feats.

Sharing data between remote locations

I'm currently estimating how to best share data between offices at different geographical locations.
My current preference is for using SQL Server Merge Replication and have a main database and handful of subscribers.
The system will also need to allow a few work sites to work disconnected (no or little connectivity on construction sites).
The amount of data is not going to be large, we're talking about sharing data from a custom ERP system between a manufacturing plant, a handful of regional offices and work sites.
The Sync Framework also looks good and seems to have good support in SQL Server 2008.
What other proven system out there should I investigate that can answer these needs?
For those with experience on sharing data in a similar environment, do you have any particular recommendation and tips?
How difficult has it been for you to deal with data conflicts?
Definitely stick with SQL Server replication, then decide to go down the path of 'build your own replication framework.' I've seen some applications become horrible messes that way.
I've had environments that are setup for snapshot replication in a disconnected model, but the remote sites were read-only. They worked quite well with minimal issues.
I'd also be interested in hearing people's experiences with the sync framework.
You may want to look at what microsoft calls smart clients which is an architecture microsoft talk about for applications that may have temporary network connectivity.
I have already discussed my own experience of SQLServer2005 with #cycnus. My answer is not a real one, just a few arguments to initiate a subject I am very interested in.
Our choice for 'not allways connected' sites is to implement web-based merge replication. Data exchanges happen to be even quicker than through VPNs (as we also have a combination of LAN merge replications). I will easily get a speed of 30 to 40 rows per second through web (512 Down/128 Up, shared) while I'll get a 5 rows per second through LAN (overseas, 256 Up/Down, dedicated). Don't ask me why ...
Tips are numerous: subscription should be of the client type (data circulating basically from the suscriber to the publisher before being distributed). Primary Keys should allways be GUID, for many reasons exposed here, but also for replication issues: we are then sure that any newly created record will be able to find its way up to the publisher, as its PK will be unique. Moreover, I recently had a non-convergence issue with one of my replications (bad experience, exposed here) , where I felt very happy not to use natural keys, as the problem occured on the potential "natural key" column.
Data conflicts should then be basically limited to work organisation problems, where (ususally for bad reasons) the same data is modified by different users in different places at the same time. With our "PK is GUID rule", we do not have conflicts out of these specific situations.
One should always have the possibility to modify its database structure, even if replications are running. It is possible to keep on adding fields, indexes, constraints while running merge replication processes. I also find a workaround for adding tables without reinitialising the replication process (exposed here, still did not understand why I was downvoted on this answer!)

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources