Orleans - how to implement storage sharding by Tenant with AdoNetStorageProvider - sql-server

I am trying to find how we can implement storage sharding by tenant using the built in AdoNetStorageProvider.
We are planning for SQL Server on-premises.
for example:
Grain that belong to tenant 1 should persist to shard A
Grain that belong to tenant 2 should persist to shard B
Grain that belong to tenant 3 should persist to shard A
Where our Sharding function will indicate which shard to use.
For this purpose the sharding function gets its grain sharding from DB based on grain extended key. (so should not all be in configuration files, as number of shards rarely changes but new tenants added frequently).
If this can be implemented by some built in framework then even better.
AS Per https://dotnet.github.io/orleans/Documentation/Core-Features/Grain-Persistence.html?q=sharded#shardedstorageprovider
shardedstorageprovider will distribute (shard) the data equally across the shards based on hash function. which does not achieve this purpose. The shards may (or may not) be geo located.
The sharding example in github is referring to Elastic SQL Client on Azure which as per my understanding is not available for SQL server.
I know we can write our own storage provider. but whenever possible we try to stay with the core.

Based on the answer on the project Gitter,
There is no built in ability to shard by tenant.
The way to implement this is by deriving from https://github.com/dotnet/orleans/blob/master/src/OrleansProviders/Storage/ShardedStorageProvider.cs
and overriding HashFunction method.
Credit to #SebastianStehle

Related

Databse design in Microservices

I am planning to move one monolith ASP.Net MVC application to micro-service architecture.
The application is on Educational domain ,and bellow are sub modules currently has,
System Admin
Institute Admin
Candidate/Student Portal
Facilitator/Teacher Portal
Course Setup (this can be done by Facilitator or Institute Admin)
Exam Portal
Reporting Portal [NEW]
Video Conf Portal [NEW]
To achieve micro-service architecture ,I break the current system like the diagram below ,and Create DB for each module.
Here I face a problem ,say for Exam Db currently there are some relation with Course and Subject etc..
and the Course and Subject related table moved to another DB , then how do I maintain the relation and data in Exam DB ?
Should I create a replica of the Course and Subject table inside Exam DB and populate them when new data is inserted in the original db (with event Que) ?
Or is there any other technique to achieve this with minimal change and effort.
Also how I maintain transactions between different DBs .
Database per service
Using a database per service has the following benefits:
Helps ensure that the services are loosely coupled. Changes to one service’s database does not impact any other services.
Each service can use the type of database that is best suited to its needs. For example, a service that does text searches could use ElasticSearch. A service that manipulates a social graph could use Neo4j.
Using a database per service has the following drawbacks:
Implementing business transactions that span multiple services is not straightforward. Distributed transactions are best avoided because of the CAP theorem. Moreover, many modern (NoSQL) databases don’t support them.
Implementing queries that join data that is now in multiple databases is challenging.
The complexity of managing multiple SQL and NoSQL databases
There are various patterns/solutions for implementing transactions and queries that span services:
Implementing transactions that span services:
use the Saga pattern.
Implementing queries that span services:
API Composition - the application performs the join rather than the database. For example, a service (or the API gateway) could retrieve a customer and their orders by first retrieving the customer from the customer service and then querying the order service to return the customer’s most recent orders.
Command Query Responsibility Segregation (CQRS) - maintain one or more materialized views that contain data from multiple services. The views are kept by services that subscribe to events that each services publishes when it updates its data. For example, the online store could implement a query that finds customers in a particular region and their recent orders by maintaining a view that joins customers and orders. The view is updated by a service that subscribes to customer and order events.
Read more here
Shared database
The benefits of this pattern are:
A developer uses familiar and straightforward ACID transactions to enforce data consistency
A single database is simpler to operate
The drawbacks of this pattern are:
Development time coupling - a developer working on, for example, the OrderService will need to coordinate schema changes with the developers of other services that access the same tables. This coupling and additional coordination will slow down development.
Runtime coupling - because all services access the same database they can potentially interfere with one another. For example, if long-running CustomerService transaction holds a lock on the ORDER table then the OrderService will be blocked.
A single database might not satisfy the data storage and access requirements of all services.
Read more here
Additional must reads: CQRS & API Composition

SAAS application with microservices and database per tenant

I am designing web application for multi-tenant SAAS model, where due to some regulation we decided to have one database per tenant and we are considering microservices for our middleware. But I confused bit, where microservice architecture talks about 'every microservice has their own database'. So here are my questions
If I use shared database for microservices, it violate concept/purpose of microservices design where they all should be isolated and any change should limited to that microservice.
I can think of a layer which mimic my database tables as DTO and every microservice communicate to database with this layer, correct me if I am wrong here.
If I use every microservice has their own database then its almost impossible to have one database per tenant, otherwise each microservice end-up with n number of database.
So what should be right approach to design our middleware with one database per tenant
Or if any one has better approach feel free to share.
Below is high-level design we are started with
here
You should distinguish 2 things here:
Database sharding
Sharding is a database architecture pattern related to horizontal partitioning in which you split your database based on some logical key. In your case your logical key is your Tenant(tenantId or tenantCode). Sharding will allow you to split your data from one database to multiple physical databases. Ideally you can have a database per logical shard key. In your case this means that you can have in best case database per tenant. Keep in mind that you don't have to split it that far. If your data is not that big enough to be worth putting every tenant data to separate database start with 2 databases and put half of your tenants to 1 physical database and half to a second physical database. You can then coordinate this on your application layer by saving in some configuration or another table in database which tenant is in which database. As your data grows you can migrate and/or create additional physical databases or physical shards.
Database per micro-service
It is one of the basic rules in micro-services architecture that each micro-service has its own database. There are multiple reasons for it some of them are:
Scaling
Isolation
Using different database technologies for different micro-services(if needed)
development team separation
You can read more about it here. Sure it has some drawbacks but keep in mind that is one of the key rules in micro-services architecture.
Your questions
If I use shared database for microservices, it violate concept/purpose
of microservices design where they all should be isolated and any
change should limited to that microservice.
If you can try to avoid shared database for multiple micro-services. If you end up doing it you should maybe consider your architectural design. Sometimes forcing one database is a indicator that some micro-services should be merged to one as the coupling between them is to big so the overhead of working with them becomes very complex.
If I use every microservice has their own database then its almost
impossible to have one database per tenant, otherwise each
microservice end-up with n number of database.
I don't really agree that its impossible. Yes it is hard to manage but if you decided to use micro-services and you need database sharding you need to deal with the extra complexity. Sure having one database per micro-service and then for each micro-service n databases could be very challenging.
As a solution I would suggest the following:
Include the tenant(tenantId or tenantCode) as a column in every table in your database. This way you will be able to migrate easily later if you decide that you need to shard that table,set of tables in schema, or whole db belonging to some micro-service. As already said in the above part regarding Database sharding you can start with one Physical shard(one physical database) but already define your logical shard(in this case using the tenant info in each table).
Separate physically the data to different shards only in the micro-services where you need it. Lets say you have 2 micro-services: product-inventory-micro-service and customers-micro-service. Lets say you have 300 million products in your product-inventory-micro-service db and only 500 000 Customers. You don't need to have a database per tenant in the customers-micro-service but in product-inventory-micro-service with 300 million records that would be very helpful performance wise.
As I said above start small with 1 or 2 physical databases and increase and migrate during the time as the time goes your data increases and you have the need for it. This way you will save yourself some overhead in development and maintenance of your servers at least for the time that you don't need it.
Our application is SaaS and multi-tenant with a microservices architecture. We do indeed use 1 database per service, though in some cases it's actually just a separate schema per tenant, not a separate instance.
The key concept in "database per service" is all to do with avoiding sharing of tables/documents between services and less to do with exactly how you implement that. If two services are both accessing the same table, that becomes a point of tight coupling between the services as well as a common point of failure - two things microservices are designed to avoid.
Database-per-service means don't share persistence between multiple services, how you implement that is up to you.
Multi-tenancy is another challenge and there are multiple ways to approach it. In our case (where regulations do not dictate our architecture) we have designed our Tenant-aware table-structures with TenantId as part of the Primary Key. Each Tenant-aware service implements this separation and our Authorization service helps keep users within the boundaries of their assigned Tenant(s).
If you are required to have more separation than key/value separation, I would look to leverage schemas as a great tool to segregate both data and security:
You could have a database instance (say a SQL Server instance) per microservice, and within each instance have a schema per tenant.
That may be overkill at the start, and I think you'd be safe to do a schema per service/tenant combination until that number grew to be a problem.
In either case, you'd probably want to write some tooling in your DB of choice to help manage your growing collection of schemas, but unless you are going to end up with thousands of tenants, that should get you pretty far down the road.
The last point to make is that you're going to lean heavily on an Integration bus of some sort to keep your multiple, independent data stores in sync. I strongly encourage you to spend as much time on designing that as you do the rest of your persistence as those events become the lifeblood of the system. For example, in our multi-tenant setup, creating a new tenant touches 80% of our other services (by broadcasting a message) so that those services are ready to interact with the new tenant. There are some integration events that need to happen quickly, others that can take their time, but managing all those moving parts is a challenge.

Modeling Forward and Reverse Query Questions in Bigtable

Let's say that we have the following three entities:
Organization
- id
Role
- id
Member
- id
A Role can be granted to a Member within an Organization, thus giving that Member certain access control rights to that Organization. I'd like to be able to answer the following two queries:
List the IDs of all the members who have a given Role within a given Organization (e.g. given a Role ID and Org ID give me the list of Members).
List all of the IDs of the Roles that a member has been granted within a given Organization (e.g. given a Member ID and Org ID give me the list of Roles).
I'm trying to find recommendations on how to model this in Bigtable (ideally with a single row for atomic mutations)... I'm also open to other technology recommendations (I'm trying to design within the constrains my company has given me).
If we model the relationship described above using the Bigtable row key org#{orgID}#role#{roleID}#member#{memberID}, I can easily answer the first question. However, it doesn't allow me to easily answer the second question. If I duplicate data and store another row key org#{orgID}#member#{memberID}#role#{roleID} then I can easily answer the second question, but now I have two rows to manage and atomic updates cannot be guaranteed between the two, so that may lead to consistency issues.
Has anyone in the community ran into a similar problem, and if so, how did you solve it?
Cloud Bigtable doesn't natively support secondary indexes, which is what you would need to only need a single row and be able to efficiently run both of those queries without requiring a full table scan. The alternative to that that you've already identified would be to write two rows via a process that would ensure eventual consistency. This might be sufficient for your needs depending on the underlying requirements of your system.
Depending on your constraints (cloud provider, data scale, atomicity, multi-region replication, etc.), you might be better served with a standard relational database (e.g. Postgres, MySQL), or Google Cloud Spanner.
Possible approaches with Spanner to accomplish this:
Have a single table that represents a a Member <-> Role relationship. Have RoleID being the primary index for the row, and then add a Secondary Index for MemberID and you'd be able to run queries against either.
Go the traditional relational database route of having Member, Role and MemberRole joining table. With Spanner you should have atomic updates via a Transaction. When querying you could potentially have issues with reads going across multiple splits, but you'd have to do some real world testing to see what your performance would be like.
Disclosures:
I lead product management for Cloud Bigtable.
I co-founded the JanusGraph project.
Reading through your problem statement, i sounds like you want to use either a relational database, or a graph database. Each one will have its own pros/cons.
Relational DBMS approach
As Dan mentioned in his answer, you can use a managed MySQL or PostgreSQL via Google Cloud SQL, or Google Cloud Spanner, depending on your needs for scale, replication, consistency, compatibility with existing code/frameworks, etc.
Graph database approach
Alternatively, you can use a graph database which can help you model this information easily and query it efficiently.
For example, you can deploy Janusgraph on GKE with Bigtable and Elasticsearch and query the data using the Gremlin language, which is a standard graph traversal/query language supported by many graph databases.
Note that JanusGraph + Bigtable inherits the transactionality of Bigtable (which as you noted, is row-level atomic). Since JanusGraph stores each vertex in a separate row in Bigtable, only single-vertex updates will be atomic. If you want transactional updates via JanusGraph, you may need to use a different storage backend, e.g.,
BerkeleyDB (local, non-distributed storage backend)
FoundationDB (recent contribution by the JanusGraph community)
There are many other graph databases you can consider, some of which also support Gremlin or other graph query languages. For example, you can deploy Neo4j on GCP if you prefer, which supports Gremlin as well as Cypher.

Database sharding on Heroku

At some point in the next few months our app will be at the size where we need to shard our DB. We are using Heroku for hosting, Node.js/PostgreSQL stack.
Conceptually, it makes sense for our app to have each logical shard represent one user and all data associated with that user (each user of our app generates a lot of data, and there are no interactions between users). We need to retain the ability for the user to do complex ad-hoc querying on their data. I have read many articles such as this one which talk about sharding: http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
Conceptually, I understand how Sharding works. However in practice I have no idea how to go about implementing this on Heroku, in terms of what code I need to write and what parts of my application I need to modify. A link to a tutorial or some pointers would be much appreciated.
Here are some resources I have already looked at:
http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
MySQL sharding approaches?
Heroku takes care of multiple database servers?
http://petrohi.me/post/30848036722/scaling-out-postgres-partitioning
http://adam.heroku.com/past/2009/7/6/sql_databases_dont_scale/
https://devcenter.heroku.com/articles/heroku-postgres-follower-databases
Why do people use Heroku when AWS is present? What distinguishes Heroku from AWS?
As the author of the first article happy to chime in further. When it comes to sharding one of the very key components is what key are you sharding on. The complexity of sharding really comes into play when you have data that is intermingled across different physical nodes. If you're something like a multi-tenant app then modeling all your data around this idea of a tenant or customer can fit very cleanly in this setup. In that case you're going to want to break up all tables that are related to customer and shard them the same way as other tenant related tables.
As for doing this on Heroku, there are two options. You can roll your own with Heroku Postgres and application logic, or using something like Citus (which is an add-on that helps manage more of this for you.
For rolling your own, you'll first create the various application logic to handle creating all your shards and knowing where to route the appropriate queries to. For Rails there are some gems to help wtih this like activerecord-multi-tenant or apartment. When it comes to actually moving to sharding and that migration, what you'll want to do is create a Heroku follower to start. During the migration you'll have it start un-following. Then you'll remove half of the data from the original primary and the other half from the follower you separated accordingly.
I am not sure I would call this "sharding."
In LedgerSMB here is how we do things. Each company (business entity) is a separate database with fully separate data. Data cannot be shared between companies. One postgreSQL cluster can run any number of company databases. We have an administrative interface that creates the database and loads the schema. The administrative interface can also create new users, which can be shared between companies (optionally). I don't know quite how well it would work to share users between dbs on Heroku but I am including that detail in terms of how we work with PostgreSQL.
So this is a viable approach.
What you really need is something to spin up databases and manage users in an automated way. From there you can require that the user specifies a company name that you can map to a database however you'd like (this mapping could be stored in another database for example).
I know this is fairly high level. It should get you started however.

Solr in a multi-tenant environment

I am considering using Solr in a multi-tenant application and I am wondering if there are any best practices or things I should watch out for?
One question in particular is would it make sense to have a Solr Core per tenant. Are there any issues with have a large number of Solr Cores?
I am considering use a core per tenant because I could secure each core separately.
Thanks
Solr Cores are an excellent idea for multitenant, particularly as they can be managed at runtime (so not requiring a server restart). You shouldn't run into too many problems with performance for having multiple Solr cores, but be aware the performance of one core will be impacted by the work on other cores - they're probably going to be sharing the same disk.
I can see why you might want to give direct API access - for example if each 'user' is a Drupal site or similar, for a shared hosting type environment. The best thing would be to secure the different URLs, e.g. if you had /solr/admin/cores, /solr/client1 for a client core, and /solr/client2 for another, you would have three different authentications, one for your admin, and one each for your tenants. This is done in the container (Jetty, Tomcat etc.), take a look at the general Solr Security page: http://wiki.apache.org/solr/SolrSecurity - you'll want to setup a basic access login for each path in the same way.
You would no more use a separate table in a database for each tenant than you would a solr core for each tenant.
If you think of a core like a database table and organize your project in such a way that each core represents an object in your problem space then you can better leverage solr.
Where solr shines in when you need to index text and then search it quickly. If you are not doing that you might as well use a relational database.
Also from your question about securing solr for each tenant , I hope you're not suggesting allowing your logged in users to access the solr output directly? Your users should not be able to directly access your solr instance.
Good luck.
That's OK .. you can not use cache(inbuild) properly and for your requirements. You add permission bit in which you can change the query component in which you can. It should work properly according to the permission. There is a bitwise operation also available for this. Make use of this for your needs.

Resources