Best Practice for synchronizing common distributed data - database

I have a internet application that supports offline mode where users might create data that will be synchronized with the server when the user comes back online. So because of this I'm using UUID's for identity in my database so the disconnected clients can generate new objects without fear of using an ID used by another client, etc. However, while this works great for objects that are owned by this user there are objects that are shared by multiple users. For example, tags used by a user might be global, and there's no possible way the remote database could hold all possible tags in the universe.
If an offline user creates an object and adds some tags to it. Let's say those tags don't exist on the user's local database so the software generates a UUID for them. Now when those tags are synchronized there would need to be resolution process to resolve any overlap. Some way to match up any existing tags in the remote database with the local versions.
One way is to use some process by which global objects are resolved by a natural key (name in the case of a tag), and the local database has to replace it's existing object with this the one from the global database. This can be messy when there are many connections to other objects. Something tells me to avoid this.
Another way to handle this is to use two IDs. One global ID and one local ID. I was hoping using UUIDs would help avoid this, but I keep going back and forth between using a single UUID and using two split IDs. Using this option makes me wonder if I've let the problem get out of hand.
Another approach is to track all changes through the non-shared objects. In this example, the object the user assigned the tags. When the user synchronizes their offline changes the server might replace his local tag with the global one. The next time this client synchronizes with the server it detects a change in the non-shared object. When the client pulls down that object he'll receive the global tag. The software will simply resave the non-shared object pointing it to the server's tag and orphaning his local version. Some issues with this are extra round trips to fully synchronize, and extra data in the local database that is just orphaned. Are there other issues or bugs that could happen when the system is in between synchronization states? (i.e. trying to talk to the server and sending it local UUIDs for objects, etc).
Another alternative is to avoid common objects. In my software that could be an acceptable answer. I'm not doing a lot of sharing of objects across users, but that doesn't mean I'd NOT be doing it in the future. Which means choosing this option could paralyze my software in the future should I need to add these types of features. There are consequences to this choice, and I'm not sure if I've completely explored them.
So I'm looking for any sort of best practice, existing algorithms for handling this type of system, guidance on choices, etc.

Depend on what application semantics you want to offer to users, you may pick different solutions. E.g., if you are actually talking about tagging objects created by an offline user with a keyword, and wanting to share the tags across multiple objects created by different users, then using "text" for the tag is fine, as you suggested. Once everyone's changes are merged, tags with the same "text", like, say "THIS IS AWESOME", will be shared.
There are other ways to handle disconnected updates to shared objects. SVN, CVS, and other version control system try to resolve conflicts automatically, and when cannot, will just tell user there is a conflict. You can do the same, just tell user there have been concurrent updates and the users have to handle resolution.
Alternatively, you can also log updates as units of change, and try to compose the changes together. For example, if your shared object is a canvas, and your application semantics allows shared drawing on the same canvas, then a disconnected update that draws a line from point A to point B, and another disconnected update drawing a line from point C to point D, can be composed. In this case, if you keep those two updates as just two operations, you can order the two updates and on re-connection, each user uploads all its disconnected operations and applies missing operations from other users. You probably want some kind of ordering rule, perhaps based on version number.
Another alternative: if updates to shared objects cannot be automatically reconciled, and your application semantics does not support notifying user and asking user to resolve conflicts due to disconnected updates, then you can also use version tree to handle this. Each update to a shared object creates a new version, with past version as the parent. When there are disconnected updates to a shared object from two different users, two separate children versions/leaf nodes result from the same parent version. If your application's internal representation of state is this version tree, then your application's internal state remains consistent despite disconnected updates, and you can handle the two branches of the version tree in some other way (e.g. letting user know of branches and create tools for them to merge branches, as in source control systems).
Just a few options. Hope this helps.

Your problem is quite similar to versioning systems like SVN. You could take example from those.
Each user would have a set of personal objects plus any shared objects that they need. Locally, they will work as if they own the all the objects.
During sync, the client would first download any changes in the objects, and automatically synchronize what is obvious. In your example, if there is a new tag coming from the server with the same name, then it would update the UUID correspondingly on the local system.
This would also be a nice place in which to detect and handle cases like data committed from another client, but by the same user.
Once the client has an updated and merged version of the data, you can do an upload.
There will be to round trips, but I see no way of doing this without overcomplicating the data structure and having potential pitfalls in the way you do the sync.

As a totally out of left-field suggestion, I'm wondering if using something like CouchDB might work for your situation. Its replication features could handle a lot of your online/offline synchronisation problems for you, including mechanisms to allow the application to handle conflict resolution when it arises.

Related

Domain Driven Design, should I use multiple databases or a single source of truth

I'm about to propose some fundamental changes to my employers and would like the opinion of the community (opinion because I know that getting a solid answer to something like this is a bit far-fetched).
Context:
The platform I'm working on was built by a tech consultancy before I joined. While I was being onboarded they explained that they used DDD to build it, they have 2 domains, the client side and the admin side, each has its own database, its own GraphQl server, and its own back-end and front-end frameworks. The data between the tables is being synchronized through an http service that's triggered by the GraphQl server on row insertions, updates, and deletes.
Problem:
All of the data present on the client domain is found in the admin domain, there's no domain specific data there. Synchronization is a mess and is buggy. The team isn't large enough to manage all the resources and keep track of the different schemas.
Proposal:
Remove the client database and GraphQl servers, have a single source of truth database for all the current and potentially future applications. Rethink the schema, split the tables that need to be split, consolidate the ones that should be joined, and create new tables according to the actual current business flow.
Am I justified in my proposal, or was the tech consultancy doing the right thing and I'm sending us backwards?
Normally you have a database, or schema, for each separated boundary context. That means, that the initial idea of the consultancy company was correct.
What's not correct is the way that the consistency between the two is managed. You don't do it on tables changes but with services inside one (or both) the domains listening to the events and taking the update actions. It's a lot of work, anyway, because you have to update the event handlers on every change (in the events or tables structure).
This code is what's called anti corruption layer, that's exactly what it does: it avoids any corruption between the copies of the domain in another domain.
Said this, as you pointed out, your team is small and it could be that maintaining such a layer (and hence code) could cost a lot of energies. But, you've also to remember that once you've done, you have just to update it when needed.
Anyway, back to the proposal, you could also take this route. What you should (must, I would say) is that in each domain the external tables should be accessed only by some services, or queries, and this code should never ever modify the content that it access. Never. But I suppose that you already know this.
Nothing is written in the stone, the rules should always be adapted when put in a real context. Two separated databases means more work, but also a much better separation of the domains. It could never happen that someone accidentally modifies the content of the tables of the other domain. On the other side, one database means less work, but also much more care about what the code does.

Flutter: Shared Preference or Scoped Model for speed

I will be storing many small data strings in both scoped model and shared preferences. My question is, in order to retrieve this data back are there any significant speed differences in retrieving this data from either of these sources?
Since I will be doing many "sets" and "gets" I would like to know if anybody has seen any differences in performance using one more than another.
I understand Shared preferences is persistent and scoped model is not however after the app is loaded, the data is synced and I would rather access the data from the fastest source.
Firstly, understand that they are not alternatives. You will likely want to back certain parts of your model using shared preferences and this can be done behind scoped model (or BLoC etc). Note that simply updating a shared preference will not trigger a rebuild, which is why you should use one of the shared state patterns and then have that update those items it wants to persist to shared preferences.
Shared preferences is actually implemented as an in memory map that triggers a background write to storage on each update. So 'reads' from shared preferences are inexpensive.

How to structure/coordinate multiple databases?

Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api
Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.
I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.
I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.

Store some relational data in a config-file instead of database

Is it recommended to store certain data in a config-file while referencing to it by ID in database? Specifically, when storing information that's rarely or never updated.
For example, I may have different roles for users in my application. As these roles are hardly ever (or never) updated, is it really necessary to store them in a roles-table and reference to them by ID in users? I could as well have the roles defined in an array. This way the database wouldn't need to be called every time to get role information (which is required on every page).
Another option would be to cache the whole bunch of role-information from the database. Not sure if this is any better than simply having an array stored somewhere.
The same question can be asked when storing any application-related data that's not edited by the users but the developers.
Another option would be to cache the whole bunch of role-information from the database. Not sure if this is any better than simply having an array stored somewhere.
It isn't until you have a bug in the application, a network failure or a good-old power cut. At that point, the cached solution will just re-read the "master" data from the database and keep churning away, safe in the knowledge that data was protected by database transactions and backup regime.
If such calamity befell "separate array" solution, it could corrupt the array storage or make it inconsistent with the rest of the data (no referential integrity).
On top of that, storing data centrally means changes are automatically visible to all clients. You can store files "centrally" via shared folders or FTP and such, but why not use the database if it's already there?

Does anyone else think instance variables are problematic in database-backed applications?

It occurs to me that state control in languages like C# is not well supported.
By this, I mean, it is left upto the programmer to manage the state of in-memory objects. A common use-case is that instance variables in the domain-model are copies of information residing in persistent storage (i.e. the database). Clearly this violates the single point of authority principle, and "synchronisation" has to be managed by the developer.
I envisage a system where instead of instance variables, we have simple public access/mutator methods marked with attributes that link them to the database, and where reads and writes are mediated by a framework that decides whether to hit the database. Does such a system exist?
Am I completely missing the point, or is there some truth to this idea?
If i understand correctly what you want: Any OR-Mapper with Lazy Loading is working this way. For example i use Genome and there every entity is a pure proxy and you have quite much influence to tell the OR-Mapper how to cache the fields.
Actually there's the concept of data prevalence (as implemented by prevayler in Java) where the in-memory objects are the single point of authority (SPA) for the data.
Also, some object databases (as db4o) blur lines a bit between the object representation and the "store" representation.
On the other hand, by bringing the SPA for the data inside the application, you need to handle transactions and/or data persistence by yourself. There is some work done on transactional memory systems such as JVSTM (currently in use by the information system of my old college) but it's not in widespread use.
On the other hand, if the data lives in a database, you can just commit the data when everything is good (or use the support for transactions built in the database) and be sure that data isn't corrupted or lost. You trade in the SPA principle for better data reliability and transactions (and other advantages of using a separate data store)

Resources