How to maintain master data across multiple parties - database

The scenario is simple: a number of companies need to share some reference data (let's say, a list of products and their attributes).
The problem is that currently each entity collects/cleans the data internally, then shares the data with other companies which leads to a length process of exchanging files (spreadsheets).
What are some of the modern approaches for solving this? Surely this scenario is very common in modern corporate life, so I am looking for some guidance on standard processes / technologies to look into.
Thanks!

This is indeed a common scenario, and there are several ways of solving the problem, ranging from dedicated commercial offerings (like Tibco MDM or similar offerings from IBM, Dell/Boomi, etc) that tend to be large, monolithic master data solutions to lightweight microservices that each own a particular subset of the data and publish changes in the relevant data to the other microservices that need it. If the database technology and schemas are largely similar between companies, sometimes you can use things like replication or database log shipping to synchronize data between companies. If your data does not need to be synchronized automatically, an ETL procedure could be used to extract changes to master data from one company and load it in the others.
A near-realtime approach that utilizes Domain-Driven Design would identify a Bounded Context for each set of Master Data elements that need to be managed, such as Products. Then, each company would define their Domain Model for Products, defining a Product Aggregate that exposes the public surface of the Product and abstracts the company-specific details. The Domain Model would also define the Events that each company can publish related to their Product aggregate. If different companies own different parts of the product, they may publish changes to those fields as Domain Events to which the other companies subscribe.
The underlying repositories and data structures that define the Product in each company would be updated by a command processor that receives the changes and applies them to the company-specific domain. This can be orchestrated using an ESB, microservices, or a message broker, depending on the complexity of the processing required between companies, such as routning, transformation and translation.

Related

How big tech companies share databases across multiple teams?

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you
In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here
If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2
There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?
There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.
If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Universal data model and microservices integration

Since the native-cloud applications or microservices architecture requires decentralized data model (each microservices has its own database), and universal data model is centralized data model
So, how we have microservices architecture with universal data model patterns?
Is there any reference or implementation of universal data model and microservices?
In general the two concepts are not compatible. Using a universal data model for all of your services would clash with a couple of key ideas behind using Microservices, e.g. Polyglot Persistence, separate development & deployment of each service. Also, let's not forget that the "Data Model Resource Book" was last updated in 2009.
However, if you must combine the two approaches, e.g. because management insists on it, you can encapsulate all access to the universal data model by a dedicated service and make your other services dependent on it.
Some good thoughts on the subject can be found here: http://plainoldobjects.com/2015/09/02/does-each-microservice-really-need-its-own-database-2/
Yes to #Fritz's point -- universal data modeling and microservices are really two different concepts and are very difficult if not impossible to be used together. I would like to add that the reasoning for polyglot persistence is also because of how the data should be modeled. Microservices allow the use of different data stores that can best model the data according to their domain.
To elaborate more, I don't think it would do justice to mention microservices and data modeling but not domain driven design. From my experience, domain driven design really helps in thinking about services, their responsibilities, and their right to exist. For instance, I found it often to be the case that there are usually a collection of services that carries out a particular domain functionality. An example could be an e-commerce application that have payments, shopping carts, etc. These could be separated into different "bounded contexts" based on domain driven design terminology.
With the different bounded contexts, each microservice no longer sees the same concept in the system the same, so in effect, there is no real universal data model. The easiest example that I can think of to show this, is when you also want reporting on the metrics in the system. If the example was an ecommerce application, the notion of a transaction in the orders microservice are going to be different than transactions in a reporting service. The reporting service for instance may want to know about transactions at a sub-level such as the profit or revenue generated for a particular order instead of the particular line items in an order. However, in the perspective of the orders service, the order details such as the line items and the address of the individual that made the purchase are probably important and should be known. This should then require two different data models.
With respect to domain modeling, I may be a bit extreme but I would go as far as saying that if there are multiple services sharing the same data source, they should really be the same service; there should be only one service for a single data source. My arguments for that would be that the domain hasn't been properly modeled and that the coupling makes it different to evolve any one service if there are multiple services that relies on a single data source. The case could be that one service requires the schema of the data source to change while the other one does not but still is required to accommodate the schema change. Hope this helps!

How to classify data to different categories with their purpose in ERP Applications?

all
I am recently pondering that how to classify data into different categories in Erp solutions, basing on that, I can decide which data should I strip out and put it into a shared database for multiple tenants instances.
As a industry practice, the Erp product is separated into 2 layers. The technology platform layer provide a lot of reusable components and modeling tools, make business applications follow the consistent architecture, the business application layer witch based on it provide the business functions.
So, basically the data can be categorized into 2 main types.one is platform data, the other is business data.Further,the platform data can be categorized into sub categories:
platform
1)environment
2)engine related(Form engine, workflow engine, data access engine...which make the business function work)
3)metadata(for example:Form Description,Business Object Description, Data Model, Workflow definition)
4)configurations(organization or user related configurations)
5)management related(data structures for manage the models)
business
1)model instances(actual orders data)
2)business configurations
3)derived data(from model instance data, and form query or analysis)
After analysis, I found that the environment data, configurations, management related data, business data are in a high degree of coupling. The only category can be separated from the instance database is the metadata.
1.Does my analysis reasonable?
2.Are there any patterns for reference?
Thanks.
I would like to suggest the following pattern for splitting the data in the database
MetaData[3] that you use to identify, authenticate and authorize the users, the Settings or customization Congifurations [4] per users or tenants along with the basic and the other components [1 throught 4] that are platform specific, (because you have not specified what environment does etc..) and common for all tenants can be in a separate database and the rest of the business specific data to reside in another database[s].
This will help the tenants that come from a geographically different location to store their own data in their separate database, even as part of the national laws and regulations and data safety.
Post your understanding in this regard

Separating data from different sites

We are creating a web solution that contains large number of users, their events, calendars and content to be managed. This solution can be white-labeled and can be sold to other vendors as a services, i.e. Though the hosting is in our SINGLE server but thy will have their own administrator and there own users and separate contents, that are completely disconnected to the other vendors. For example we are going to host the solution as
www.example.com/company1
www.example.com/company2
www.example.com/company3
The question is should we use different database for different company, or we should use single database for managing all the company.
Thanks
You should use separate databases for each company, unless you are offering some sort of service where the companies know that data is being pooled.
This is a question of data protection. No matter how much you swear that one company can only see their data in the table, you may not be able to convince prospective clients of this fact.
In addition, you need to keep the options open of running the databases on different servers. You don't want peak performance at one company to affect another company. Or, you don't want a special change for one company -- which might require bringing down the application with their knowledge -- to affect other clients.

Resources