How big tech companies share databases across multiple teams? - database

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you

In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here

If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2

There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

Related

Extending data models within a MicroService architecture

As part of a personal research project, I'm attempting to learn more about Microservice architecture and how to incorporate it within the industry I work with.
I've been reading a lot of booked and articles around Microservice architecture, and have been researching and working with multiple different software components to support this architecture, such as RabbitMQ, however I have come unstuck at initiation of the data models.
To put the requirement in it's simplest form, lets say I have the requirement for two Microservices for the following high level requirement (please note, I am excluding the WebAPI / Bridge microservice and UI microservices as part of this process, I am just focusing on the backend Microservices that house core data):
Requirement:
Provide the ability for a customer to log into a portal and add register for a "scheme", which allows them to add money or credit to their record. This will be a multi-tenanted solution, where data can be placed in a single database or over multiple databases (already covered) and each Microservice will be responsible for it's own table(s).
Some "tenants" may or may not have the "Credit Microservice" enabled as defined below and may only have the "Customer Microservice" and potentially other Microservices such as "Marketing Microservice" (not defined below, but referenced as an example)
Customer Microservice:
Responsible for managing customer information within the system, such as First Name, Last name, email address, etc. This Microservice will expose various functions such as Creating a new Customer, Updating an Existing customer, Deleting a customer and finally retrieving customers (Basic CRUD operations).
This Microservice will be fed data directly from our internal CRM (Customer Relationship Management) system via integration (this part is covered)
Data Scheme:
CustomerId
FirstName
LastName
EmailAddress
Once a customer is created, an event is posted on the message queue informing other microservices that a Customer has been created.
Credit Microservice:
Responsible for managing a customers balance and scheme enrolled within. By default, a customer will NOT be enrolled in a scheme, and will therefore NOT be able to deposit credit to their account.
This Microservice will expose functions to "Enroll" a member to a scheme and "AddCredit" to their account as well as retrieving the information for a particular customer (their scheme and balance). In real life, this would also let them change scheme, and have various other options, but I'm keeping that out of this for this simple example.
Within the system, this Microservice would likely be responsible for storing the "transactions" to credit as well, in a separate table within it's database (or unique schema within a single database instance). This is not defined below as it is irrelevant to the issue at hand.
Data Scheme:
CustomerId
SchemeName
Balance
This Microservice will be responsible for listening to events from the Customer Microservice and creating new customers within it's own unique SQL table, allowing that customer to Enrol or Add Credit to their account.
Issue:
Within the UI element of the application, I need to show admin users a list of customers, including if they are enrolled in the application or not, as well as how much credit the customer has, but only if they have enabled the "Credit" functionality (the identification of this is already covered, each tenant can enable certain Microservices when setup)
The issue comes in the fact that the data is stored over two tables, and mastered by two different Microservices...
Do I create a new Microservice (EG: CustomerCredit) that joins the two tables together (readonly) to display the results? Does the API call the Customer Microservice "retrieve" first then call the Credit Microservice "retrieve" after with the relevant IDs and then join them together?
The first example above does not work in practice, as I might have MULTIPLE microservices extending the schema of the customer model, for example, Marketing where I might store "Last Email Date" against a customer id.
EDIT:
To confirm, the structure would look similar to the below:

How to maintain master data across multiple parties

The scenario is simple: a number of companies need to share some reference data (let's say, a list of products and their attributes).
The problem is that currently each entity collects/cleans the data internally, then shares the data with other companies which leads to a length process of exchanging files (spreadsheets).
What are some of the modern approaches for solving this? Surely this scenario is very common in modern corporate life, so I am looking for some guidance on standard processes / technologies to look into.
Thanks!
This is indeed a common scenario, and there are several ways of solving the problem, ranging from dedicated commercial offerings (like Tibco MDM or similar offerings from IBM, Dell/Boomi, etc) that tend to be large, monolithic master data solutions to lightweight microservices that each own a particular subset of the data and publish changes in the relevant data to the other microservices that need it. If the database technology and schemas are largely similar between companies, sometimes you can use things like replication or database log shipping to synchronize data between companies. If your data does not need to be synchronized automatically, an ETL procedure could be used to extract changes to master data from one company and load it in the others.
A near-realtime approach that utilizes Domain-Driven Design would identify a Bounded Context for each set of Master Data elements that need to be managed, such as Products. Then, each company would define their Domain Model for Products, defining a Product Aggregate that exposes the public surface of the Product and abstracts the company-specific details. The Domain Model would also define the Events that each company can publish related to their Product aggregate. If different companies own different parts of the product, they may publish changes to those fields as Domain Events to which the other companies subscribe.
The underlying repositories and data structures that define the Product in each company would be updated by a command processor that receives the changes and applies them to the company-specific domain. This can be orchestrated using an ESB, microservices, or a message broker, depending on the complexity of the processing required between companies, such as routning, transformation and translation.

How to structure/coordinate multiple databases?

Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api
Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.
I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.
I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.

Separating data from different sites

We are creating a web solution that contains large number of users, their events, calendars and content to be managed. This solution can be white-labeled and can be sold to other vendors as a services, i.e. Though the hosting is in our SINGLE server but thy will have their own administrator and there own users and separate contents, that are completely disconnected to the other vendors. For example we are going to host the solution as
www.example.com/company1
www.example.com/company2
www.example.com/company3
The question is should we use different database for different company, or we should use single database for managing all the company.
Thanks
You should use separate databases for each company, unless you are offering some sort of service where the companies know that data is being pooled.
This is a question of data protection. No matter how much you swear that one company can only see their data in the table, you may not be able to convince prospective clients of this fact.
In addition, you need to keep the options open of running the databases on different servers. You don't want peak performance at one company to affect another company. Or, you don't want a special change for one company -- which might require bringing down the application with their knowledge -- to affect other clients.

Software as a service - Database

If I am building a CRM web application to sell as a membership service, what is the best method to design and deploy the database?
Do I have 1 database that houses 100s of records per table or deploy multiple databases for different clients?
Is it really an issue to use a single database since I believe sites like Flickr use them?
Multiple clients is called "multi-tenant". See for example this article "Multi-Tenant Data Architecture" from Microsoft.
In a situation like a CRM system, you will probably need to have separate instances of your database for each customer.
I say this because if you'd like larger clients, most companies have security policies in place regarding customer data. If you store their customer data in the same database as another customer, you're running the risk of exposing one companies confidential data to another company (a competitor, etc.).
Sites like Flickr don't have to worry about this as much since the majority of us out on the Interwebs don't have such strict policies regarding our personal data.
Long term it is easiest to maintain one database with multiple clients' data in it. Think about deployment, backup, etc. However, this doesn't keep you from having several instances of this database, each containing a subset of the full client dataset. I'd recommend to grow the number of databases after you have established the usefulness/desirability of your product. Having complex infrastructure is not necessary if you have no traffic....
So, I'd just put a client id in the relevant tables and smile when client 4 comes in and the extent of your new deployment is one insert statement.

Resources