How many databases are needed for a social website? I have my tech team working on developing a social site but all their tables are in 1 database. I wanted to create separate table sets for user data, temporary tables, etc and thinking maybe have one separate database only for critical data, etc but I am not a tech person and now sure how this works? The site is going to be a local reviews website.
This is what happens when management tries to make tech decisions...
The simple answer, as always, is as few as possible.
The slightly more complicated answer is that once your begin to push the limits of your server and begin to think about multiple servers with master/slave replication then your may want your frequent write tables separated from your seldom write tables which will lower the master-slave update requirements.
If you start using seperate databases you can also run into an with you backup / restore strategy. If you have 5 databases and backup all five, what happens when you need to restore one of them, do you then need to restore all five?
I would opt for the fewest number of databases.
The reason you would want to have multiple databases is for scaling-out to multiple machines. In the context of a "social application" where large volume / high availability is a concern. If you anticipate the need to scale out to multiple machines to handle high volumes then the breakout of tables should be those that logically need to stay together.
So, for example, maybe you want to keep tables related to a specific subject area (maybe status updates) together in one database and other tables that are related to a different subject area (let's say user's picture libraries) together in a different database.
There are logical and performance reasons to keep tables in separate physical or logical databases.
What is the reason that you want it in different databases?
You could just put all tables in one database without a problem, even with for example multiple installations of an open source package. In that case you can use table prefixes.
Unless you are developing a really BIG website, one database is the way to proceed (by the way, did you consider the possible issues that may raise when working with various databases?).
If you are worried about performance, you can always configure different tablespaces on several storage devices in order to improve timings.
If you are worried about security, just increase it (better passwords, no direct root login, no port forwarding, avoid tunneling, etc.)
I am not a tech person only doing the functional analysis but I own the project so I need to oversee the tech team. My reason to have multiple database is security and performance.
Since this is going to be a new startup, there is no money to invest into strong security or getting the database designed flawless. Plus there are currently no backup policies in place so:
1) I want to separate critical data like user password/basic profile info, then separate out user media (photos they upload on their profile) and then the user content. Then separate out the system content. Current design is to have to layers of tables: Master tables for entire system and module tables for each individual module.
2) Performance: There are a lot of modules being designed and this is a data intensive social site with lots of reporting / analytic being builtin so lots of read/writes. Maybe better to distribute load across database based on purpose?
Since there isn't much funding hence I want to get it right the first time with my investment so the database can scale & work well until revenue comes in to actually invest in getting it right. Ofcourse that could be maybe 6 months away and say a million users away too.
Oh & there is plan to add staging/production mode also so seperate or same database?
You'll be fine sticking with using one database for now. Your developers can isolate/seperate application data by making use of database schema. Working with multiple databases can quickly become a journey through a world of pain and is to be avoided unless its absolutely crucial.
Related
Raven DB creating multiple Databases to support different replication strategies.
Recently I was tasked with creating an additional raven database to store information pertaining to users. So the solution I working on would have some information in one Raven database and user information in another Raven Database. The reason for the request is so we could support different replications strategies for the two databases. Given my understanding raven only supports a single replication strategy per RavenDB.
First I would like to know if anyone has created an application with two raven databases?
Second I would like to know what problems you might have encountered, and a general sense of what issues I can plan for or mitigate early on?
Thank you ahead of time,
Having multiple Raven databases is possible, but only advisable in certain situations.
If each database could potentially be on a different server (as one would assume since you're talking about replicating differently) then each must have its own DocumentStore, which is fairly expensive to set up, but this only should happen once at application startup anyway, and you're talking about 2, not 50.
As Matt mentions in the comments, if you have two databases on the same server, then you can use the same DocumentStore and specify the database name when you open the session.
Each database should be for logically very different things. You won't easily be able to commingle data between the two databases. If a document in one database contained a reference to a document id in the other database, you wouldn't be able to use the Include features to get both of those documents in one round trip - there would essentially be a wall between the databases. Indexes could not span between the databases, for example.
Accessing both databases would require spinning up an IDocumentSession for each, both of which would need to be managed separately. If you're managing your document sessions at an infrastructure level (i.e. one session per HTTP request) then having two complicates things quite a bit.
However, if you have a segmented type of application, this can work quite well. For example, if you had a Users database that provided single sign-on across multiple websites (or areas of a website) then this could be a good fit. On most pages the user info would be essentially read-only (like to display the black bar at the top of Stack Overflow), except for the user management pages.
This could also be common if you're going for an Udi Dahan style SOA application structure where you define service boundaries and each service has its own independent database.
Im wondering what will be the best way to organize my DB. Let me explain:
Im starting a new "big" project. This big project will be composed by few litle ones. In general the litle projects are not related to each other, they are just features of the big one.
One thing that all the projects have in common is the users that are going to use it.
So my questions are:
Should i create different DB for each one of the litle projects
(currently each project will contain 4-5 tables)
How to deal with the users? Should I create one DB for all the users
or should i
duplicate the users table in every DB? Have in mind that the
information about the users is used a lot in every litle project,
it's NOT only for identification purposes.
Thanks in advance for your advice.
This greatly depends on the database you choose to use.
If these "sub-projects" are designed to work as one coherent unit, then I strongly recommend you keep it all in the same database. One backup, one restore, one unit.
For organizational purposes, if you are using a database which supports it, select a different Schema per project. PostgreSQL and SQL Server are two databases (among others) which support this effortlessly.
In the case of a database like MySQL, I recommend you pick a short prefix for each subproject and prefix all tables accordingly. "P1_Customer" for example.
Shared data would go in it's own schema or prefix, like Global or something like that.
Actually, this was one of the many reasons we switched our main database from MySQL to PostgreSQL. We've been heavy users of both, and I really appreciate the features that PostgreSQL offers. SQL Server, if you are in a windows environment, is a great database IMO as well.
If the little projects are "features of the big one" then I don't see a reason why you wouldn't want just one user table for the main project. The way you setup the question makes this seem true "If there is a user A in little project 1, then there must be a user A in the 'big' project." If that is true, you should likely have the users in the big db instead of doing duplication unless you have more qualifying details.
i think the proper answer is 'it depends'.
Starting your organization down the path of single centralized system is good on many levels. I think in general i would recommend this.
however:
if you are going to have dramatically different development schedules, or dramatically different user experiences with the various sub projects, then you may be better off keeping them separate.
I'd have a look at OpenID or some other single sign-on protocol depending on the nature of your application. OpenID includes a mechanism called "attribute exchange", which allows applications to retrieve profile information from the OpenID provider.
This allows you to create a central user profile repository, with an authentication scheme, and have your individual apps query that repository for profile information.
The question as to how to design your database is hard to answer without more information. In most architectures, "features" within an application tend to be closely linked - "users" are related to "accounts" are related to "organisations" etc.
I'd recommend looking at the foreign key relationships to answer this question. If you have lots of foreign keys, build a single database for all tables. If you have "clusters" of foreign keys, and you want to have a different life cycle for each application (assuming the clusters map neatly to the applications), consider separate databases.
By "life cycle", I mean mostly the development lifecycle - app 1 might deploy weekly, app 2 monthly, app 3 once only and then be frozen.
I'm planing to create an accounting application in asp.net mvc.
Each user will pay monthly subscription, I'll provide daily backups, etc.
I don't know which strategy to choose:
To use SQL CE4 and run application in separate virtual directory for each user.
To put everything in a single SQL Server database
What are cons and pros of these two options?
If I choose option 2, I'll have to write more logic inside code. I must prevent users from seeing things they should not see (more trips to database).
The two most important things are:
General performance of the system
Ability to easily create backups for separate users.
I expect between 20 and 30 users.
Any suggestion will be appreciated.
I'd go for option 2 as always. I come from an intense financial integration background so option 2 (to me at least) is best.
There are no con's when considering the use of data. There are no more round trips than there would be with option 1.
Performance of the system is generally left up to the programmer. If you write shitty code you can expect terrible performance. Option 2 provides the easiest backup option as you only have to worry about 1 database. There's no need to consider that for separate users at all if you go with 1 database.
RE: the round trip's for users (or more "logic" in your code to handle that), how hard is it to add "Where userID = x" ?
You should be separating your .net code with stored procedures anyway so writing from the ground up shouldn't have anymore (or even less) round trips/file access issues than you would have with option 1.
You are probably not going to realistically be able to use SQL Server's backup feature if you go for a multi-tenant design in a single database, because the unit of backup is not going to be fine enough to back up each user's data individually. You can use partitions and filegroups and backup only filegroups, but the partition feature is not exactly easy to administer, and would probably not be good for this purpose.
If you are expecting to use SQL Server's backup and restore capability independently for users (which it sounds like from your comment), you probably need to have them in separate databases.
Having said that, I would re-think that requirement (think about implementing selective export/import), because a multi-tenant architecture is a lot easier to deal with overall.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm designing a few applications that will share 2 or 3 database tables and all of the other tables will be independent of each app. The shared databases contain mostly user information, and there might occur the case where other tables need to be shared, but that's my instinct speaking.
I'm leaning over the one database for all applications solution because I want to have referential integrity, and I won't have to keep the same information up to date in each of the databases, but I'm probably going to end with a database of 100+ tables where only groups of ten tables will have related information.
The database per application approach helps me keep everything more organized, but I don't know a way to keep the related tables in all databases up to date.
So, the basic question is: which of both approaches do you recommend?
Thanks,
Jorge Vargas.
Edit 1:
When I talk about not being able to have referential integrity, it's because there's no way to have foreign keys in tables when those tables are in different databases, and at least one of the tables per application will need a foreign key to one of the shared tables.
Edit 2:
Links to related questions:
SQL design around lack of cross-database foreign key references
Keeping referential integrity across multiple databases
How to salvage referential integrity with mutiple databases
Only the second one has an accepted answer. Still haven't decided what to do.
Answer:
I've decided to go with a database per application with cross-database references to a shared database, adding views to each database mimicking the tables in the shared database, and using NHibernate as my ORM. As the membership system I'll be using the asp.net one.
I'll also use triggers and logical deletes to try and keep to a minimum the number of ID's I'll have flying around livin' la vida loca without a parent. The development effort needed to keep databases synced is too much and the payoff is too little (as you all have pointed out). So, I'd rather fight my way through orphaned records.
Since using an ORM and Views was first suggested by svinto, he gets the correct answer.
Thanks to all for helping me out with this tough decision.
Neither way looks ideal
I think you should consider not making references in database layer for cross-application relations, and make them in application layer. That would allow you to split it to one database per app.
I'm working on one app with 100+ tables. I have them in one database, and are separated by prefixes - each table has prefix for module it belongs to. Then i have built a layer on top of database functions to use this custom groups. I'm also building data administrator, which takes advantage of this table groups and makes editing data very easy.
It depends and your options are a bit different depending on the database and frameworks you're using. I'd recommend using some sort of ORM and that way you don't need to bother that much. Anyways you could probably put each app in it's own schema in the database and then either reference the shared tables by schemaname.tablename or create views in each application schema that's just a SELECT * FROM schemaname.tablename and then code against that view.
There are no hard and fast rules to choose one over the other.
Multiple databases provide modularity. As far as sync-ing across multiple databases are concerned, one can use the concept of linked servers and views thereof and can gain the advantages of integrated database (unified access) as well.
Also, keeping multiple databases can help better management of security, data, backup & restore, replication, scaling out etc!
My 2cents.
THat does not sound like "a lot of applications" at all, but like "one application system with different executables". Naturally they can share one database. Make smart usage of Schemata to isolate the different funcational areas within one database.
One database for all application in my opinion .Data would be stored once no repitation.
With the other approach you would end up replicating and in my opinion when you start replicating it will bring its own headache and data would go out of sync too
The most appropriate approach from scalability and maintenence point of view would be to make the "shared/common" tables subset self-sufficient and put it to "commons" database, for all others have 1 db per application of per logical scope (business logic determined) and maintain this structure always
This will ease the planning and execution commissioning/decommissioning/relocation/maintenence procedures of your software (you will know exactly which two affected DBs (commons+app_specific) are involved if you know which app you are going to touch and vice versa.
At our business, we went with a separate database per application, with cross database references for the small amount of shared information and an occasional linked server. This has worked pretty well with a development, staging, build and production environments.
For users, our entire user base is on windows. We use Active Directory to manage the users with application references to groups, so that the apps don't have to manage users, which is nice. We did not centralize the group management, that is each application has tables for groups and security which is not so nice but works.
I would recommend, that if your applications are really different, to have a database per application. Looking back, the central shared database for users sounds workable as well.
You can use triggers for cross database referential integrity:
Create a linked server to the server that holds the database that you want to reference. Then use 4-part naming to reference the table in the remote database that holds the reference data. Then put this in the insert and update triggers on the table.
EXAMPLE(assumes single row inserts and updates):
DECLARE #ref (datatype appropriate to your field)
SELECT #ref = refField FROM inserted
IF NOT EXISTS (SELECT *
FROM referenceserver.refDB.dbo.refTable
WHERE refField = #ref)
BEGIN
RAISERROR(...)
ROLLBACK TRAN
END
To do multi row inserts and updates you can join the tables on the reference field but it can be very slow.
I think the answer to this question depends entirely on your non functional requirements. If you are designing a application that will one day need to be deployed across 100's of nodes then you need to design your database so that if need be it could be horizontally scaled. If on the other hand this application is to be used by a hand full of users and may have a short shelf life then you approach will be different. I have recently listened to a pod cast of how EBAY's architecture is set-up, http://www.se-radio.net/podcast/2008-09/episode-109-ebay039s-architecture-principles-randy-shoup, and they have a database per application stream and they use sharding to split tables across physical nodes. Now their non-functional requirements are that the system is available 24/7, is fast, can support thousands of users and that is does not lose any important data. EBAY make millions of pounds and so can support the effort that this takes to develop and maintain.
Anyway this does not answer your question:) my personnel opinion would be to make sure your non-functional requirements have been documented and signed off by someone. That way you can decide on the best Architecture. I would be tempted to have each application using its own database and a central database for shared data. And I would try to minimise the dependencies between them, which I'm sure is not easy or you would have done it:), but I would also try to steer clear of having to produce some sort of middle ware software to keep tables in sync as this could create a headaches for you.
At the end of the day you need to get your system up and running and the guys with the pointy hair wont give a monkeys chuff about how cool your design is.
We went for splitting the database down, and having one common database for all the shared tables. Due to them all being on the save SQL Server instance it didn't affect the cost of running queries across multiple database.
The key in replication for us was that whole server was on a Virtual Machine (VM), so for replication to create Dev/Test environments, IT Support would just create a copy of that image and restore additional copies when required.
In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.