Synchronizing intranet and web data - sql-server

I am just getting started breaking a .NET application and its SQL Server database into two systems - an intranet and a public website.
The various database tables will need to be synchronised between the two databases in different ways, for example:
Moving from web to intranet, with the intranet data becoming read-only
Moving from intranet to web, with the web data becoming read-only
Tables that need to be synchronised and are read/write on both the intranet and web databases.
Some of the synchronisation needs to occur relatively quickly with minimal lag, possibly with some type of transaction locking to ensure repeatable reads etc. Other times it doesn't matter if there is a delay between synchronisation.
I am not quite sure where to start with all this, as there seems to be many different ways of achieving this. Which technologies and strategies should I be looking at?
Any tips?

A system like that looks like the components are fairly tightly coupled. An upgrade across several systems all at once can turn into quite the nightmare.
It looks like this is less of a replication problem and more of a problem of how to maintain a constant connection to a remote database without much I/O lag. While it can be done, probably isn't going to work out very well in terms of scalability and being able to troubleshoot problems.
You might look at using some message queueing and asynchronous data processing from the remote site to the intranet. You'll probably have to adjust some expectations of the business side so that they don't assume that everything is accessible real-time all the time.
Of course, its hard to give specifics without more details. It might be a good idea to look into principles of SOA and messaging systems for what you're trying to do.

Out of the box you have SQL Server Replication. Sounds like a pair of filtered transactional replication publications can do the job. Transactional replication has a low overhead on the publisher and can ensure transactional consistency of the published changes.
Nathan raises some very valid points about the need for a more loosely coupled solution. Service Broker can fit that shoe quite well with its loosely coupled asynchronous nature, and provide a headache free upgrade future since SSB is compatible between SQL Server versions and editions. But this freedom comes at the cost of letting the heavy lifting of actually detecting the changes and applying them to the tables to you, as application code, not a trivial feats.

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

keeping databases in sync (after write/update) across regions/zones

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.
You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.
Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

To CouchDB or not to?

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

Replication vs Sync Framework vs Service Broker

I've asked about each of these technologies separately, and really haven't found a suitable answer.
We have a server in our central office running SQL Server 2005 Enterprise that has several large (large in the sense that DSL is the limiting factor) databases that we need local copies of at each of our locations. We currently have a few dozen locations, and are needing to bring even more online. The total number of locations we'll need to sync these databases to will be in the several hundreds in the next 2 years.
We are trying to overcome issues with the WAN connection at each location. These are DSL lines and the wiring at the locations isn't always the best. We currently have issues with some of the locations going down as often as every hour. While we are working to resolve these issues with rewiring and assistance from the local telcos, it mainly highlights the problem at hand: we need a two-way sync that can handle being occasionally-connected.
We tried transactional replication for a while, and while it worked some of the time, it was too high maintenance for us, and it seemed to randomly error out often with no possible explanations, forcing us to reinitialize subscriptions (which could take upwards of 4 hours assuming the location would stay connected long enough to get the entire snapshot in one go). We've looked at rolling our own solution from scratch, but I don't feel this would be the best idea given the scale and reliability we are needing.
So far we've also looked at Sync Framework, and as suggested by someone else, Service Broker. Sync Framework seems a better fit, but I was told that Service Broker scales better and is more reliable? I can't find any empirical data on the overhead involved with Sync Framework or Service Broker, so it's proving impossible to compare the two in this regard.
What we really need is a two-way sync between the central office server and a remote client that can run autonomously and can report to an admin in the event of a failure that requires our intervention.
There are so many possible solutions to this problem, all involving completely different technologies, that I need a fresh eye on this.
What do you think would be the optimal solution for our situation, and why?
EDIT: Obviously, upgrading to SQL Server 2008 would solve this problem easily. However, we would like to try to less expensive options first.
I don't have any hard data to offer on this, but we used the sync framework on a project a while ago. My experience with it is really bad. It's slow (even when synchronizing relatively small tables across a LAN), scales terribly and requires a lot of work to manually handle error conditions (it'll happy produce larger packets than WCF can handle by default -- and is only able to split updates into batches when syncing one way, not the other.) And it only works with a few select databases (the client must use MS SQL Compact Edition, as I recall), unless you're willing to write your own SyncAdapter.
Overall, a lot of work just to get a fragile and inefficient solution to your problem. I wouldn't recommend it.
You can Use sync framwork with SQL express 2008 R1/R2 on one end and multitenant db SQl server enterprise on central end. Below is the sample application for n-tier sync over secure WCF channel.You could write windows service to sync data from backend:
http://www.rajneeshnoonia.com/blog/2012/03/n-tier-sync-framework/
It sould be capable enough to handle large number of clients (thousands).
I think we'll look into the SQL Server 2008 upgrade route. It seems the native change tracking support will be the easiest way to accomplish this.

Keeping distributed databases synchronized in a unstable network

I'm facing the following challenge:
I have a bunch of databases in different geographical locations where the network may fail a lot (I'm using cellular network). I need to keep all the databases synchronized but there is no need to be in real time. I'm using Java but I have the freedom to choose any free database.
How can I achieve this?
It's a problem with a quite established corpus of research (of which people is apparently unaware). I suggest to not reinvent a poor, defective wheel if not absolutely necessary (such as, for example, so unusual requirements to allow a trivial solution).
Some keywords: replication, mobile DBMSs, distributed disconnected DBMSs.
Also these research papers are relevant (as an example of this research field):
Distributed disconnected databases,
The dangers of replication and a solution,
Improving Data Consistency in Mobile Computing Using Isolation-Only Transactions,
Dealing with Server Corruption in Weakly Consistent, Replicated Data Systems,
Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication,
The Case for Non-transparent Replication: Examples from Bayou,
Bayou: replicated database services for world-wide applications,
Managing update conflicts in Bayou, a weakly connected replicated storage system,
Two-level client caching and disconnected operation of notebook computers in distributed systems,
Replicated document management in a group communication system,
... and so on.
I am not aware of any databases that will give you this functionality out of the box; there is a lot of complexity here due to the need for eventual consistency and conflict resolution (eg, what happens if the network gets split into 2 halves, and you update something to the value 123 while I update it on the other half to 321, and then the networks reconnect?)
You may have to roll your own.
For some ideas on how to do this, check out the design of Yahoo's PNUTS system: http://research.yahoo.com/node/2304 and Amazon's Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Check out SymmetricDS. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.
I don't know your requirements or your apps, but this isn't a quick answer type of question. I'm very interested to see what others have to say. However, I have a suggestion that may or may not work for you, depending on your requirements and situation. particularly, this will not help if your users need to use the app even when the network is unavailable (offline access).
Keeping a bunch of small databases synchronized is a fairly complex task to do correctly. Is there any possibility of just having one centralized database, and either having the client applications connect directly to it or (my preferred solution) write some web services to handle accessing/updating data rather than having a bunch of client databases?
I realize this limits offline access, but there are various caching strategies you can use. (Which of course, leads you back to your original question.)

Resources