Combination Single-tenant and multi-tenant infrastructure - database

I am evaluating the best approach for migrating our current on-premises Java Web app to a SAAS platform. Application multi-tenancy seems straight-forward, but less so with the database. We're probably all aware of the database-per-tenant pros at this point: isolation, performance, reduced backup/restore complexity, and much lower retrofit complexity. Naturally the row-per-tenant approach has its benefits as well, reduced infrastructure costs being a major one.
Is it unheard of to combining the two approaches? That way the database-per-tenant approach faster time-to-market while the development changes to support a multi-tenant database are being made gradually. Once both approaches are operational customers with particularly heavy workloads or security constraints could have their own isolated database, but the default would be using a shared common database (for cost/efficiency reasons). Does anyone have any experience using/seeing this combination of approaches in the real world?
Whether requests are routing to datasource by tenant ID, or the tenant ID is an argument to the SQL queries, the major differences should be contained with in the persistence layer/database somewhat limiting the added complexity of combining the two approaches.

There are complexities when we scale out a tenant, i.e. moving a tenant data from the shared database to that of the isolated database.
The automation of this process requires effort and testing due to the identification of the entity tables, mapping tables and ordering of these steps to process the migration successfully. The strategy used for the database like ORM or ADO.NET also needs to be considered for this process.
Compared to having a row wise tenantid, if we can use a schema per tenant within the same database, it will be easier to perform this kind of migration.
We did try this out initially, but since there was framework data and application / business data, it was little difficult to resolve the migration to happen automatically, given lesser time-frame, however with the right time and plan, this can be achieved.

Related

SAAS application with microservices and database per tenant

I am designing web application for multi-tenant SAAS model, where due to some regulation we decided to have one database per tenant and we are considering microservices for our middleware. But I confused bit, where microservice architecture talks about 'every microservice has their own database'. So here are my questions
If I use shared database for microservices, it violate concept/purpose of microservices design where they all should be isolated and any change should limited to that microservice.
I can think of a layer which mimic my database tables as DTO and every microservice communicate to database with this layer, correct me if I am wrong here.
If I use every microservice has their own database then its almost impossible to have one database per tenant, otherwise each microservice end-up with n number of database.
So what should be right approach to design our middleware with one database per tenant
Or if any one has better approach feel free to share.
Below is high-level design we are started with
here
You should distinguish 2 things here:
Database sharding
Sharding is a database architecture pattern related to horizontal partitioning in which you split your database based on some logical key. In your case your logical key is your Tenant(tenantId or tenantCode). Sharding will allow you to split your data from one database to multiple physical databases. Ideally you can have a database per logical shard key. In your case this means that you can have in best case database per tenant. Keep in mind that you don't have to split it that far. If your data is not that big enough to be worth putting every tenant data to separate database start with 2 databases and put half of your tenants to 1 physical database and half to a second physical database. You can then coordinate this on your application layer by saving in some configuration or another table in database which tenant is in which database. As your data grows you can migrate and/or create additional physical databases or physical shards.
Database per micro-service
It is one of the basic rules in micro-services architecture that each micro-service has its own database. There are multiple reasons for it some of them are:
Scaling
Isolation
Using different database technologies for different micro-services(if needed)
development team separation
You can read more about it here. Sure it has some drawbacks but keep in mind that is one of the key rules in micro-services architecture.
Your questions
If I use shared database for microservices, it violate concept/purpose
of microservices design where they all should be isolated and any
change should limited to that microservice.
If you can try to avoid shared database for multiple micro-services. If you end up doing it you should maybe consider your architectural design. Sometimes forcing one database is a indicator that some micro-services should be merged to one as the coupling between them is to big so the overhead of working with them becomes very complex.
If I use every microservice has their own database then its almost
impossible to have one database per tenant, otherwise each
microservice end-up with n number of database.
I don't really agree that its impossible. Yes it is hard to manage but if you decided to use micro-services and you need database sharding you need to deal with the extra complexity. Sure having one database per micro-service and then for each micro-service n databases could be very challenging.
As a solution I would suggest the following:
Include the tenant(tenantId or tenantCode) as a column in every table in your database. This way you will be able to migrate easily later if you decide that you need to shard that table,set of tables in schema, or whole db belonging to some micro-service. As already said in the above part regarding Database sharding you can start with one Physical shard(one physical database) but already define your logical shard(in this case using the tenant info in each table).
Separate physically the data to different shards only in the micro-services where you need it. Lets say you have 2 micro-services: product-inventory-micro-service and customers-micro-service. Lets say you have 300 million products in your product-inventory-micro-service db and only 500 000 Customers. You don't need to have a database per tenant in the customers-micro-service but in product-inventory-micro-service with 300 million records that would be very helpful performance wise.
As I said above start small with 1 or 2 physical databases and increase and migrate during the time as the time goes your data increases and you have the need for it. This way you will save yourself some overhead in development and maintenance of your servers at least for the time that you don't need it.
Our application is SaaS and multi-tenant with a microservices architecture. We do indeed use 1 database per service, though in some cases it's actually just a separate schema per tenant, not a separate instance.
The key concept in "database per service" is all to do with avoiding sharing of tables/documents between services and less to do with exactly how you implement that. If two services are both accessing the same table, that becomes a point of tight coupling between the services as well as a common point of failure - two things microservices are designed to avoid.
Database-per-service means don't share persistence between multiple services, how you implement that is up to you.
Multi-tenancy is another challenge and there are multiple ways to approach it. In our case (where regulations do not dictate our architecture) we have designed our Tenant-aware table-structures with TenantId as part of the Primary Key. Each Tenant-aware service implements this separation and our Authorization service helps keep users within the boundaries of their assigned Tenant(s).
If you are required to have more separation than key/value separation, I would look to leverage schemas as a great tool to segregate both data and security:
You could have a database instance (say a SQL Server instance) per microservice, and within each instance have a schema per tenant.
That may be overkill at the start, and I think you'd be safe to do a schema per service/tenant combination until that number grew to be a problem.
In either case, you'd probably want to write some tooling in your DB of choice to help manage your growing collection of schemas, but unless you are going to end up with thousands of tenants, that should get you pretty far down the road.
The last point to make is that you're going to lean heavily on an Integration bus of some sort to keep your multiple, independent data stores in sync. I strongly encourage you to spend as much time on designing that as you do the rest of your persistence as those events become the lifeblood of the system. For example, in our multi-tenant setup, creating a new tenant touches 80% of our other services (by broadcasting a message) so that those services are ready to interact with the new tenant. There are some integration events that need to happen quickly, others that can take their time, but managing all those moving parts is a challenge.

Microservices and database joins

For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?
When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.
It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.
CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.
I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.
In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.

Pluggable database interface

I am working on a project where we are scoping out the specs for an interface to the backend systems of multiple wholesalers. Here is what we are working with,
Each wholesaler has multiple products, upwards of 10,000. And each wholesaler has customized prices for their products.
The list of wholesalers being accessed will keep growing in the future, so potentially 1000s of wholesalers could be accessed by the system.
Wholesalers are geographically dispersed.
The interface to this system will allow the user to select the wholesaler they wish and browse their products.
Product price updates should be reflected on the site in real time. So, if the wholesaler updates the price it should immediately be available on the site.
System should be database agnostic.
The system should be easy to setup on the wholesalers end, and be minimally intrusive in their daily activities.
Initially, I thought about creating databases for each wholesaler on our end, but with potentially 1000s of wholesalers in the future, is this the best option as far as performance and storage.
Would it be better to query the wholesalers database directly instead of storing their data locally? Can we do this and still remain database agnostic?
What would be best technology stack for such an implementation? I need some kind of ORM tool.
Java based frameworks and technologies preferred.
Thanks.
If you want to create a software that can switch the database I would suggest to use Hibernate (or NHibernate if you use .Net).
Hibernate is an ORM which is not dependent to a specific database and this allows you to switch the DB very easy. It is already proven in large applications and well integrated in the Spring framework (but can be used without Spring framework, too). (Spring.net is the equivalent if using .Net)
Spring is a good technology stack to build large scalable applications (contains IoC-Container, Database access layer, transaction management, supports AOP and much more).
Wiki gives you a short overview:
http://en.wikipedia.org/wiki/Hibernate_(Java)
http://en.wikipedia.org/wiki/Spring_Framework
Would it be better to query the wholesalers database directly instead
of storing their data locally?
This depends on the availability and latency for accessing remote data. Databases itself have several posibilities to keep them in sync through multiple server instances. Ask yourself what should/would happen if a wholesaler database goes (partly) offline. Maybe not all data needs to be duplicated.
Can we do this and still remain database agnostic?
Yes, see my answer related to the ORM (N)Hibernate.
What would be best technology stack for such an implementation?
"Best" depends on your requirements. I like Spring. If you go with .Net the built-in ADO.NET Entity Framework might be fit, too.

To CouchDB or not to?

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

Schema-free/flexible ACID database for a SaaS application?

I am looking at rewriting a VB based on-premise (locally installed) application (invoicing+inventory) as a web based Clojure application for small enterprise customers. I am intending this to be offered as a SaaS application for customers in similar trade.
I was looking at database options: My choice was an RDBMS: Postgresql/ MySQL. I might scale up to 400 users in the first year, with typically a 20-40 page views/ per day per user - mostly for transactions not static views. Each view will involve fetch data and update data. ACID compliance is necessary(or so I think). So the transaction volume is not huge.
It would have been a no-brainer to pick either of these based on my preference, but for this one requirement, which I believe is typical of a SaaS app: The Schema will be changing as I add more customers/users and for each customer's changing business requirement (I will be offering some limited flexibility only to start with). As I am not a DB expert, based on what I can think of and has read, I can handle that in a number of ways:
Have a traditional RDBMS schema design in MySQl/Postgresql with a single DB hosting multiple tenants. And add enough "free-floating" columns in each table to allow for future changes as I add more customers or changes for an existing customer. This might have a downside of propagating the changes to the DB every time a small change is made to the Schema. I remember reading that in Postgresql schema updates can be done real time without locking. But not sure, how painful or how practical is it in this use case. And also, as the schema changes might also introduce new/ minor SQL changes as well.
Have an RDBMS, but design the database schema in a flexible manner: with a close to entity-attribute-value or just as a key-value store. (Workday, FriendFeed for example)
Have the entire thing in-memory as objects and store them in log files periodically.(e.g., edval, lmax)
Go for a NoSQL DB like MongoDB or Redis. But based on what I can gather, they are not suitable for this use-case and not fully ACID compliant.
Go for some NewSQL Dbs like VoltDb or JustoneDb(cloud based) which retain the SQL and ACID compliant behaviour and are "new-gen" RDBMS.
I looked at neo4j(graphdb), but not sure if that will fit this use-case
In my use case, more than scalability or distributed computing, I am looking at a better way to achieve "Flexibility in Schema + ACID + some reasonable Performance". Most of the articles I could find on the net speak of flexibility in schema as a cause leading to performance(in the case of NoSQL DBs) and scalability while leaving out the ACID/Transactions side.
Is this an "either or" case of 'Schema flexibility vs ACID' transactions or Is there a better way out?
I think tarantool can help you. That solution have transactions, lua, msgpack, and etc. And also see that video

Resources