Database sharding on Heroku - database

At some point in the next few months our app will be at the size where we need to shard our DB. We are using Heroku for hosting, Node.js/PostgreSQL stack.
Conceptually, it makes sense for our app to have each logical shard represent one user and all data associated with that user (each user of our app generates a lot of data, and there are no interactions between users). We need to retain the ability for the user to do complex ad-hoc querying on their data. I have read many articles such as this one which talk about sharding: http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
Conceptually, I understand how Sharding works. However in practice I have no idea how to go about implementing this on Heroku, in terms of what code I need to write and what parts of my application I need to modify. A link to a tutorial or some pointers would be much appreciated.
Here are some resources I have already looked at:
http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
MySQL sharding approaches?
Heroku takes care of multiple database servers?
http://petrohi.me/post/30848036722/scaling-out-postgres-partitioning
http://adam.heroku.com/past/2009/7/6/sql_databases_dont_scale/
https://devcenter.heroku.com/articles/heroku-postgres-follower-databases
Why do people use Heroku when AWS is present? What distinguishes Heroku from AWS?

As the author of the first article happy to chime in further. When it comes to sharding one of the very key components is what key are you sharding on. The complexity of sharding really comes into play when you have data that is intermingled across different physical nodes. If you're something like a multi-tenant app then modeling all your data around this idea of a tenant or customer can fit very cleanly in this setup. In that case you're going to want to break up all tables that are related to customer and shard them the same way as other tenant related tables.
As for doing this on Heroku, there are two options. You can roll your own with Heroku Postgres and application logic, or using something like Citus (which is an add-on that helps manage more of this for you.
For rolling your own, you'll first create the various application logic to handle creating all your shards and knowing where to route the appropriate queries to. For Rails there are some gems to help wtih this like activerecord-multi-tenant or apartment. When it comes to actually moving to sharding and that migration, what you'll want to do is create a Heroku follower to start. During the migration you'll have it start un-following. Then you'll remove half of the data from the original primary and the other half from the follower you separated accordingly.

I am not sure I would call this "sharding."
In LedgerSMB here is how we do things. Each company (business entity) is a separate database with fully separate data. Data cannot be shared between companies. One postgreSQL cluster can run any number of company databases. We have an administrative interface that creates the database and loads the schema. The administrative interface can also create new users, which can be shared between companies (optionally). I don't know quite how well it would work to share users between dbs on Heroku but I am including that detail in terms of how we work with PostgreSQL.
So this is a viable approach.
What you really need is something to spin up databases and manage users in an automated way. From there you can require that the user specifies a company name that you can map to a database however you'd like (this mapping could be stored in another database for example).
I know this is fairly high level. It should get you started however.

Related

Designing databases and applications for hosted / cloud solutions

Are there any resources available that can guide someone on how to 'think' about the various components of a hosted / cloud solution before going ahead and starting to make a hosted application? If that made no sense, what I mean to ask is are there any guidance books/websites on what things need to be considered when making a cloud application?
I am attempting to make a hosted CRM-style software application that will serve many hundreds of customers. The application is powered by a SQL server database with many tables and a ColdFusion, HTML5, CSS, Javascript front-end. If I was installing this application and its components at each client site, then each installation is unique to that customer. But somehow I have to replicate this uniqueness in the cloud which is baffling me.
Only two things have come to mind so far:
The need for a unique database per customer in SQL server
The need to change DB connection strings per customer in the web application
My thought process has come to a block when I am trying to envisage how to design the application to serve so many different customers. Even though the application that all customers use will is the same (same DB tables, same front-end), the data that they store and retrieve will be specific to them. So I was thinking that surely each customer needs a separate database creating for them? Is it feasible to create a replica database for each customer? If I need to update some tables or add a new table, how would I do this for hundreds of different databases?
From the front-end I guess each unique customer log-in would change DB connection strings so that they can only access their database. Other than this I can't think of anything else that needs to change per customer basis.
When a new customer wants to sign up, it needs to be clear to me what I need to create for them to have access to the application. I guess this is ultimately what I need to think of but I'm stuck.
If anyone can suggest some things to think of or if there is a book or website on this kind of thing that someone could point me to I'd really be very thankful.
EDIT:
I was looking at an article about Salesforce.com and it says
"In order to ensure privacy of data for each user and give an effect of each having their own database, the data from different users are securely isolated from one another."
Anyone know how this is achieved or how it may be done?
Found some great information here. It is called multi-tenant database design and seems to be a common topic. Once I get the database designed then the application can sit nicely on top.
https://dba.stackexchange.com/questions/1043/what-problems-will-i-get-creating-a-database-per-customer

Application database/instance decomposintion

I'm designing a service that will serve some business entites. Logically it will be divided into two parts:
Frontend - bells and whistels like Wiki, Pricing, Landing Page, maybe account information (billing, account status, and so on).
Service itself, where business entity's empoyers will do theirs work.
It is play 2.x framework, planning to host in heroku.
It is not clear for now how to decompose intstances and DB stuff.
Should I decompose DB for clients: business entity - one database? Or should I store all data in one database, but add for all tables id of business entity that ownes some row? What issues (performance, administrative, scaling) may come up with this decision?
If I will choose to divide databases, how can I do this? For that I need to launch app instance with DB for client that instance belongs to. Thus we have non-uniform instances that can be obstacle for scaling. And as I know, heroku doesn't support non-uniform (web)instances.
Please help, i'm totally stucked here.
Expected stack:
Scala
Play 2.0
Anorm
JDBC
PostgresSQL
Heroku
All (except Scala, and may be Play 2.0) of this are interchangeable.
This is a pretty classic problem. You have many clients and you wonder if you should create separate databases for each client - or if they should share a database.
I would recommend starting with one shared database and then use that until you out grow it. Think of some of the disadvantages to having each client with their own database instance:
Like you mention the schema management can be tough. You'd need to write tools to maintain all databases across all servers.
If you tell clients you have structured your system this way, some of them might push you to fork the database. In other words they might argue, "I have my own database! I want a new table just for me."
It's a bit harder to run queries across servers/databases. If you wanted to count how many items all clients have, you'd have to think about that a bit.
But if you want to start by sharding based on client (http://en.wikipedia.org/wiki/Shard_(database_architecture)), you might consider:
As mentioned previously, you'll need some tools/scripts to launch a new database instance for a client. Often those tools will need to "seed" the database with configuration information - like populating a states table for addresses.
You'll want to have a tool to monitor/maintain the databases. Start one, stop another, see if one has high CPU usage etc.
You'll need some kind of system to aggregate statistics across all clients.
You'll need a tool to roll out schema changes and a plan on how you can gracefully upgrade the database while their web application is running.
Overall I would advise to start small and simple and only start worrying about scale when you get there.

Building a web application with multiple database instances or just a single instance

I am currently designing a web application where I will have customers signing up as companies. Each company will have its own set of users. As I am designing this I am wondering which approach would work best. I see sites like fogbugz or basecamp which use subdomains. In cases with subdomains do you have a database instance per sub domain? I'm wondering if it is recommended to have a database instance per company or if I should have some kind of company table and manage the company and user data/credentials all from one database.
Which approach is best? Is there literature on this subject (i.e. any web or book)?
thanks in advance!
You have to weigh up your options, as some of this will be a matter of opinion and might not be feasible for your implementation.
That being said, I'd consider the single database approach, for these reasons:
Maintenance: when running a database per registered 'client', you will very easily reach a situation where any changes or upgrades you make to your app's schema have to be applied to every single database instance. This will get ridiculous, fast.
Convenience: You might want analytics and usage stats, or some way to administrate all these databases. Querying a single database is comparatively trivial to trying to aggregate the same query for all your databases. This isn't going to scale.
Scalability *: As mentioned in 2, you're going to require a special sort of aggregation to query things about your clients, and your app as a whole. The bigger your app gets, the more complex your querying. The other issue is, if one client uses the app a lot more than another, what will you be encouraged to optimise? Your app, the bigger client's database, or the smaller client's? Not forgetting anything you do change has to be copied to all databases.
Backups: You can backup one database easily, just by creating a dump and stashing it somewhere. Get a thousand clients and now you have to run 1000 database dumps, and name them well enough to be able to identify them if one single database corrupts. How will you even know if this happens? Database errors will be localised to that specific one, as opposed to your entire app.
UI: A user signs up or is invited to use your app, and belongs to one particular client. Are you going to save that user account to the client's database? If so, see scalability for the issue of working with that data when the user wants to change their password, or you want to email them. So, do you tell the user to let you know which database they're in so you can find them?
Simplification: You have a database per client and want to just use a single one. How do you merge them all together without significantly breaking things? There'll be primary key conflicts if you use auto incremented IDs; bookmarked URLs will break if you decide to just regenerate the keys; foreign keys across tables will no longer point to the right records. Your data integrity will go down the pan.
You mention 'white label' services that offer their product through custom subdomains. I'm not privy to how these work, but the subdomain is only a basic CNAME or A record in their DNS zonefile. The process of adding these can be automated, and the design of the application and a bit of server configuration can deal with linking these subdomains to the correct accounts and data. They're just URLs, so maybe on the backend, the app doesn't differentiate between:
http://client.example.com
http://example.com/client
Overall though, you may decide that all these problems are things you can and would prefer to deal with. Be warned, however, that by doing so you may be shooting yourself in the foot, and you can gain a lot more from crafting a well-designed single database schema and a well-abstracted front-end.
*#xQbert mentions the very real benefit of scalability with multiple databases. I've amended this answer to clarify that I was more concerned with other aspects.

How to design a DB for several projects

Im wondering what will be the best way to organize my DB. Let me explain:
Im starting a new "big" project. This big project will be composed by few litle ones. In general the litle projects are not related to each other, they are just features of the big one.
One thing that all the projects have in common is the users that are going to use it.
So my questions are:
Should i create different DB for each one of the litle projects
(currently each project will contain 4-5 tables)
How to deal with the users? Should I create one DB for all the users
or should i
duplicate the users table in every DB? Have in mind that the
information about the users is used a lot in every litle project,
it's NOT only for identification purposes.
Thanks in advance for your advice.
This greatly depends on the database you choose to use.
If these "sub-projects" are designed to work as one coherent unit, then I strongly recommend you keep it all in the same database. One backup, one restore, one unit.
For organizational purposes, if you are using a database which supports it, select a different Schema per project. PostgreSQL and SQL Server are two databases (among others) which support this effortlessly.
In the case of a database like MySQL, I recommend you pick a short prefix for each subproject and prefix all tables accordingly. "P1_Customer" for example.
Shared data would go in it's own schema or prefix, like Global or something like that.
Actually, this was one of the many reasons we switched our main database from MySQL to PostgreSQL. We've been heavy users of both, and I really appreciate the features that PostgreSQL offers. SQL Server, if you are in a windows environment, is a great database IMO as well.
If the little projects are "features of the big one" then I don't see a reason why you wouldn't want just one user table for the main project. The way you setup the question makes this seem true "If there is a user A in little project 1, then there must be a user A in the 'big' project." If that is true, you should likely have the users in the big db instead of doing duplication unless you have more qualifying details.
i think the proper answer is 'it depends'.
Starting your organization down the path of single centralized system is good on many levels. I think in general i would recommend this.
however:
if you are going to have dramatically different development schedules, or dramatically different user experiences with the various sub projects, then you may be better off keeping them separate.
I'd have a look at OpenID or some other single sign-on protocol depending on the nature of your application. OpenID includes a mechanism called "attribute exchange", which allows applications to retrieve profile information from the OpenID provider.
This allows you to create a central user profile repository, with an authentication scheme, and have your individual apps query that repository for profile information.
The question as to how to design your database is hard to answer without more information. In most architectures, "features" within an application tend to be closely linked - "users" are related to "accounts" are related to "organisations" etc.
I'd recommend looking at the foreign key relationships to answer this question. If you have lots of foreign keys, build a single database for all tables. If you have "clusters" of foreign keys, and you want to have a different life cycle for each application (assuming the clusters map neatly to the applications), consider separate databases.
By "life cycle", I mean mostly the development lifecycle - app 1 might deploy weekly, app 2 monthly, app 3 once only and then be frozen.

Any recomendations for an efective way to sync data from one database, to other app's databases?

Here's my problem. I built a web app, and naturally kept the data in a database which describes that app's domain. Afterwords, I built another web app for the same organization, and used a seperate database to describe that app's domain and store data... and naturally a couple more projects came up and for each app I've isolated it's data to a single database. Deveolpment wise, I think it's ok, as I can maintain changes to the data structure and data at the app's database.
Considering these apps belong to the same organization, there tends to be plenty of data replicated between them, like department names, job titles, shop names, etc. Most of these tables hold the same data, but are not exactly the same in each database, and are not always used by all of the apps. Changes to this data, though, needs to be changed at all the apps (sometimes in a diferent ways) creating a growing management "hassle".
So I've been think of a way to get some syncronization between the data. I want an easier management - update at one app (or a central app) and update all the databases as needed by each app - and also a better way to share data between apps (like maybe mash up data from differnt apps in a new app to alow specific analysis). Most of the data I'm refering to is used as contraints more than being core domain concept, describing the organization rather than describing a particular domain.
I'm looking for opinions on some ways to get this done.
My first idea was to grab comun data structures, like the department names' table i mentioned, and stick'em in a core database. Any updates to the data would be done at this database, through a dedicated web app, and I'd apply some sort of Observer or Publisher / Suscriber Pattern for these changes - on changes the app would notify observing apps (through there dedicated webservice) that the changes occured and allow for the app to grab the new data and use it as it needs. GUIDs could be user as a reference to identify the same data throughout the apps. Also, I could build web services for read and search operations that don't need to be in a specific app's database, but could be useful to it.
A second idea would be that each app manage it's own data, and the apps could observe one another. A change in one could notify others that share the same data structure that the change occurred. I could still use some GUIDs and even build services on any of the apps. I think this would also be less excessive in terms of duplication of data, but might be harder to manage as each app would eventually be coupled to other apps, and I would some how have to distribute responsabilities as to which app controls what information.
I'm really curious as to something of this genre of data distibuition and syncing would work and even be recomended. Opions and other ideas are more than welcome!
What you describe here is a typical case for a "Master Data Management" system. EAI vendors (Oracle, TIBCO, IBM) offer such products. They resemble your first solution, being centralised databases with synchronization processes, detecting changes in external data sources, grabbing the changes and synchronizing data out to other external databases. They also provide a user interface to change master data directly.
MDM software are expensive, but you can implement a custom solution which will be - at least initially - cheaper than purchasing one. Both of your solutions make technical sense but there is a difference in their manageability.
The first one is better, if you can dedicate a responsible person/organization to take care of it and the business owners of your services can agree on making changes via this new centralised system.
The second solution shares the responsibility between the service owners. The hard task here is to identify the owner of each type of information (business object).
I cannot advise a solution without a deeper knowledge of your systems and organizations, but I hope I could give some ideas.

Resources