Accessing data from other databases - system architecture - database

I have a system that we have recently developed - a web application over a SQL server database. The SQL server database has been set up to be a 'multi-tenant' database, with many different 'installations' of our web site accessing the same database.
We have another application that runs along similar lines, the main difference being that it has many different 'installations' all accessing their own seperate databases.
All these websites run on the same server and all the databases reside in the same SQL server instance.
Each of our clients would have one of each of these systems and up to this point, we have had some fairly light integration between these two systems, which has been handled via web service calls.
We now have a new change that is going to require me to return a list of data from the multi-tenant system, but filter it based on criteria stored in the databases of the other system. I can see a few ways of doing this, but was wondering if anybody had any bright ideas:
Web service again - don't like this idea, as it means taking a list of data and making a call for each individual item, which is both slow and ugly.
Writing some dynamic SQL within the database layer to do a join on .dbo.table, which is also a bit ugly, and can be hard to maintain.
Replicate the data from one database to the other. This is where I am tending towards, however there then comes a risk of the data getting out of sync.
I'd like to do something clever about views in my multi-tenant database, but I don't want to have to create a seperate set of views each time we create a new database for the second system...

depending on business size I go with #1 or #2.
#1 is more scalabe and good for heterogenus clients but harder to implement and maintain. Since you do't have public APIs you can go to #2.
#2 needs an expert DBA and very error-prone
#3 is the worst solution IMO since redundacy would happen and it's hard to resolve later.
What I suggest is a short-term plan and a long-term plan. In short term use #1 or #2 and at the same time redesign your database. Then you can add new data model to system and it can coexist with legacy dbase. When you are insure of it's functionality switch to new db but still remain lgacy system. And finally when new db has no problem after a while exit the legacy db from circuit.

Don't change the data model. It's risky. Just make another abstract wrapper over it.
You can replicate database on another server and let this new wrapper work with copy of data.
If any data corruption happened, simply restore to main copy.

Related

keeping databases in sync (after write/update) across regions/zones

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.
You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.
Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

Transfer data between NoSQL and SQL databases on different servers

Currently, I'm working on a MERN Web Application that'll need to communicate with a Microsft SQL Server database on a different server but on the same network.
Data will only be "transferred" from the Mongo database to the MSSQL one based on a user action. I think I can accomplish this by simply transforming the data to transfer into the appropriate format on my Express server and connecting to the MSSQL via the matching API.
On the flip side, data will be transferred from the MSSQL database to the Mongo one when a certain field is updated in a record. I think I can accomplish this with a Trigger, but I'm not exactly sure how.
Do either of these solutions sound reasonable or are there more better/industry standard methods that I should be employing. Any and all help is much appreciated!
There are (in general) two ways of doing this.
If the data transfer needs to happen immediately, you may be able to use triggers to accomplish this, although be aware of your error handling.
The other option is to develop some form of worker process in your favourite scripting language and run this on a schedule. (This would be my preferred option, as my personal familiarity with triggers is fairly limited). If option 1 isn't viable, you could set your schedule to be very frequent, say once per minute or every x seconds, as long as a new task doesn't spawn before the previous is completed.
The broader question though, is do you need to have data duplicated across two different sources? The obvious pitfall with this approach is consistency, should anything fail you can end up with two data sources wildly out of sync with each other and your approach will have to account for this.

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

How do you keep two related, but separate, systems in sync with each other?

My current development project has two aspects to it. First, there is a public website where external users can submit and update information for various purposes. This information is then saved to a local SQL Server at the colo facility.
The second aspect is an internal application which employees use to manage those same records (conceptually) and provide status updates, approvals, etc. This application is hosted within the corporate firewall with its own local SQL Server database.
The two networks are connected by a hardware VPN solution, which is decent, but obviously not the speediest thing in the world.
The two databases are similar, and share many of the same tables, but they are not 100% the same. Many of the tables on both sides are very specific to either the internal or external application.
So the question is: when a user updates their information or submits a record on the public website, how do you transfer that data to the internal application's database so it can be managed by the internal staff? And vice versa... how do you push updates made by the staff back out to the website?
It is worth mentioning that the more "real time" these updates occur, the better. Not that it has to be instant, just reasonably quick.
So far, I have thought about using the following types of approaches:
Bi-directional replication
Web service interfaces on both sides with code to sync the changes as they are made (in real time).
Web service interfaces on both sides with code to asynchronously sync the changes (using a queueing mechanism).
Any advice? Has anyone run into this problem before? Did you come up with a solution that worked well for you?
This is a pretty common integration scenario, I believe. Personally, I think an asynchronous messaging solution using a queue is ideal.
You should be able to achieve near real time synchronization without the overhead or complexity of something like replication.
Synchronous web services are not ideal because your code will have to be very sophisticated to handle failure scenarios. What happens when one system is restarted while the other continues to publish changes? Does the sending system get timeouts? What does it do with those? Unless you are prepared to lose data, you'll want some sort of transactional queue (like MSMQ) to receive the change notices and take care of making sure they get to the other system. If either system is down, the changes (passed as messages) will just accumulate and as soon as a connection can be established the re-starting server will process all the queued messages and catch up, making system integrity much, much easier to achieve.
There are some open source tools that can really make this easy for you if you are using .NET (especially if you want to use MSMQ).
nServiceBus by Udi Dahan
Mass Transit by Dru Sellers and Chris Patterson
There are commercial products also, and if you are considering a commercial option see here for a list of of options on .NET. Of course, WCF can do async messaging using MSMQ bindings, but a tool like nServiceBus or MassTransit will give you a very simple Send/Receive or Pub/Sub API that will make your requirement a very straightforward job.
If you're using Java, there are any number of open source service bus implementations that will make this kind of bi-directional, asynchronous messaging a snap, like Mule or maybe just ActiveMQ.
You may also want to consider reading Udi Dahan's blog, listening to some of his podcasts. Here are some more good resources to get you started.
I'm mid-way through a similar project except I have multiple sites that need to keep in sync over slow connections (dial-up in some cases).
Firstly you need to track changes, if you can use SQL 2008 (even the Express version is enough if the 2Gb limit isn't a problem) this will ease the pain greatly, just turn on Change Tracking on the database and each table. We're using SQL Server 2008 at the head office with the extended schema and SQL Express 2008 at each site with a sub-set of data and limited schema.
Secondly you need to track your changes, Sync Services does the trick nicely and supports using a WCF gateway into the main database. In this example you will need to use the Sync using SQL Express Client sample as a starting point, note that it's based on SQL 2005 so you'll need to update it to take advantage of the Change Tracking features in 2008. By default the Sync Services uses SQL CE on the clients, which I'm sure isn't enough in your case. You'll need a service that runs on your Web Server that periodically (could be as often as every 10 seconds if you want) runs the Synchronize() method. This will tell your main database about changes made locally and then ask the server for all changes made there. You can set up the get and apply SQL code to call stored procedures and you can add event handlers to handle conflicts (e.g. Client Update vs Server Update) and resolve them accordingly at each end.
We have a shop as a client, with three stores connected to the same VPN
Two of the shops have a computer running as a "server" for that shop and the the third one has the "master database"
To synchronize all to the master we don't have the best solution, but it works: there is a dedicated PC running an application that checks the timestamp of every record in every table of the two stores and if it is different that the last time you synchronize, it copies the results
Note that this works both ways. I.e. if you update a product in the master database, this change will propagate to the other two shops. If you have a new order in one of the shops, it will be transmitted to the "master".
With some optimizations you can have all the shops synchronize in around 20minutes
Recently I have had a lot of success with SQL Server Service Broker which offers reliable, persisted asynchronous messaging out of the box with very little implementation pain.
It is quick to set up and as you learn more you can use some of the more advanced features.
Unknown to most, it is also part of the desktop editions so it can be used as a workstation messaging system
If you have existing T-SQL skills they can be leveraged as all the code to read and write messages is done in SQL
It is blindingly fast
It is a vastly under-hyped part of SQL Server and well worth a look.
I'd say just have a job that copies the data in the pub database input table into a private database pending table. Then once you update the data on the private side have it replicated to the public side. If you don't have any of the replicated data on the public side updated it should be a fairly easy transactional replication solution.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources