Related
I'm working on a project aimed to analyze biometric data collected from various terminals. The process is not very performance critical. Rather it's I/O bounded. Amount of data is very huge. (hundreds of millions records per table). Unfortunately database is relational. And there are 20 foreign keys. Changing values of referenced keys is very common during completion of job. So there will be lots of UPDATE and SET NULL s during collecting data.
Currently, semantics of database is designed. All programs are almost completed, and also a MySQL prototype for database is created. It works fine with sample (small-scale) data.
I do a search to find a suitable DBMS for the project. Googling around "DBMS comparisons" ,... didn't help. People say antithesis things. Some say MySQL will perform faster inserts and updates, some say Oracle9 is better...
I can't find any reliable, benchmark-based comparison between DBMS. I use MySQL in everyday projects, but this one looks more critical.
What we need:
License and cost of DBMS is not important, but of course an open source (GPL or LGPL) is preferred (since whole project is will be published under LGPL).
Very fast inserts, very fast updates, a lot of foreign keys is needed.
DBMS should response to 0 - 100 connections at a time.
Terminals are connected to server by a local network (LAN).
What I'm actually looking for, is a benchmark of various DBMS's. It may contain charts, separated comparisons of different operations (insert, update, delete) in various situations (on a relation with referenced fields, or normal table)...
For this sort of answer, I would recommend PostgreSQL, Informix, or Oracle. PostgreSQL is open source (BSDL, GPL compatible, as everyone agrees). The reasons have to do with some aspects of data modelling that may be extremely helpful in your case. In general you have two important questions:
1) How far can I tune my db for what I am doing? How far can I scale it?
and
2) How can I model my data?
On the first, Oracle and PostgreSQL are more complex but more flexible. That flexibility may come in handy. On the second, the flexibility may save you a lot of effort later. Moreover it opens up new doors regarding optimization which are not possible in a straight relational model. First I would recommend looking at this: http://db.cs.berkeley.edu/papers/Informix/www.informix.com/informix/corpinfo/zines/whitpprs/illuswp/wave.htm as it will give you some background as to what I am thinking. Additionally, if you look at what Stonebraker is talking about you will see that straight benchmarks are really an apples to oranges comparison here.
The idea of going with an ORDBMS means a few important things:
You can model data functionally dependent on your data. For example you can have a function in Java or Python which manipulates your data and returns a result. You can index the output of those functions, trading insert for select performance if you need to, or not, trading between insert and select performance.
Less data being stored means faster inserts.
An ability to extend your data with custom types and functions, providing higher performance access to your data.
PostgreSQL 9.2 will support up to approx 14000 writes per sec on sufficient hardware, which is nothing to sneeze at. Of course this depends on the width of the write, hardware performance on the server, etc. PostgreSQL is used by Affilias to manage the .org and .info top-level domains (web-scale!) and also by Skype's infrastructure (still, even after Microsoft bought them).
Finally as a part of your information pipeline, if you are processing huge amounts of data and need to do some preprocessing before sending to PostgreSQL, you might look at array-native db's (for a NoSQL approach common in scientific work) or VoltDB (for an in-memory store for high-throughput processing). Despite the fact that they are extremely different systems, VoltDB and Postgres were actually started by the same individual.
Finally regarding benchmark charts, the major db vendors more or less ban publication of such in their license agreements so you won't find them.
I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are happy with it.
But it has some drawbacks:
For new changes, it can be quite development intensive
The user can't experiment much with the data - it is locked to a hard-coded view
It can be slow for big reports
I am considering gradually going to a OLAP-based approach, which can be queried from Excel or some web-based service. But I would like to do this in a way that introduces the least amount of new complexity in the IT environment - the least amount of different services, synchronization jobs etc!
I have some questions in this regard:
1) Workflow-related:
What is a good development route from "black box SQL server" to "OLAP ready to use"?
Which servers and services should be set up, and which scripts should be written?
Which are the hardest/most critical/most time-intensive parts?
2) ETL:
I suppose it is best to have separate servers for their Data Warehouse and Production SQL?
How are these kept in sync (push/pull)? Using which technologies/languages?
For me SSIS looks overly complicated, and the graphical workflow doesn't appeal much to me -- I would rather like a text based script that does the job. Is this feasible?
Or is it advantagous to use the graphical client with only one source and one destination?
3) Development:
How much of this (data integration, analysis services) can be efficiently maintained from a CLI-tool?
Can the setup be transferred back and forth between production and development easily?
I'm happy with any answer that covers just some of this - and even though it is a MS environment, I'm also interested to hear about advantages in other technologies.
I only have experience with Microsoft OLAP, so here are my two cents regarding what I know:
If you are implementing cubes, then separate the production SQL Server from the source for the cubes. Cubes require a lot of SELECT DISTINCT column_name FROM source.table. You don't want cube processing to block your mission critical production system.
Although you can implement OLAP cubes with standard relation tables, you will quickly find that unless your data is a ledger-style system you will probably need to fully reprocess your fact and dimension tables and this will require requerying the source database over and over again. That's a large argument for building a separate data warehouse that uses ledger-style transactions for the fact tables. For instance, if a customer orders something and then cancels it, your source system may track this as a status change. In your fact table, you probably need to show this as a row for ordering that has a positive quantity and revenue stream and a row for cancelling that has a negative quantity and revenue stream.
OLAP may be overkill for your environment. The main issue you appeared to raise was that your reports are static and users want access to the data directly. You could build a data model and give users Report Builder access in SSRS, or report writing access in some other BI suite like Cognos, Business Objects, etc. I don't generally recommend this approach since it is way beyond what most users should have to know to get data, but in a small shop this may be sufficient and it is easy to implement. Let's face it -- users generally just want to get the data into Excel to manipulate it further. So if you don't want to give them a web front-end and you just want them to get to the data from Excel, you could give them direct database access to a copy of the production data. The downside of this approach is users don't generally understand SQL or database relationships. OLAP helps you avoid forcing users to learn SQL or relationships, but is isn't easy to implement on your end. If you only have a couple of power users who need this kind of access, it could be easy enough to teach the few power users how to do basic queries in Excel against the database and they will be happy to get this tomorrow. OLAP won't be ready by tomorrow.
If you only have a few kinds of source data systems, you could get away with building a super-dynamic static report. For instance, I have a report that was written in C# that basically allows users to select as many columns as they want from a list of 30 columns and filter the data on a few date range fields and field filter lists. This simple report covers about 40% of all ad hoc report requests from end-users since it covers all the basic, core customer metrics and fields. We recently moved this report to SSRS and that allowed us to up the number of fields to about 100 and improved the overall user experience. Regardless of the reporting platform, it is possible to give users some dynamic flexibility even in the confines of a static reporting system.
If you only have a couple of databases, you can probably backup and restore the databases as your ETL. However, if you want to do anything beyond that, then you might as well bite the bullet and use SSIS (or some other ETL tool). Once you get into ETL for data warehousing, you are going to use a graphic-oriented design tool. Coding works well for applications, but ETL is more about workflows and that's why the tools tend to converge on a graphical UI. You can work around this and try to code a data warehouse from a text editor, but in the end you are going to lose out on a lot. See this post for more details on the differences between loading data from code and loading data from SSIS.
FEEDBACK ON HOW TO USE CUBES WITH A RELATIONAL DATA STORE
It is possible to implement a cube over a relational data store, but there are some major problems with using this approach. The main reason it is technically feasible has to do with how you configure your DSV. The DSV is essentially a logical layer between the physical database and the cube/dimension definitions. Instead of importing the relational tables into the DSV, you could define Named Queries or create views in the database that flatten the data.
The advantage of this approach are as follows:
It is relatively easy to implement since you don't have to build an entire ETL subsystem to get started with OLAP.
This approach works well for prototyping how you want to build a more long-term solution. You can prototype it in 1-2 days and show some of the benefits of OLAP today.
Some very, very large tables don't have to be completely duplicated just to support an OLAP cube. I have several multi-billion row tables that are almost completely standardized fact tables. The only columns they don't have are date keys and they also contain some NULL values on fields that shouldn't have nulls at all. Instead of duplicating these very massive tables, you can create the surrogate date keys and set values for the nulls in the view or named query. If you aren't going to see a huge performance boon for duplicating the table, then this may be a candidate for leaving in a more raw format in the database itself.
The disadvantages of this approach are as follows:
If you haven't built a true Kimball method data warehouse, then you probably aren't tracking transactions in a ledger-style. Kimball method fact tables (at least as I understand them) always change values by adding and subtracting rows. If someone cancels part of an order, you can't update the value in the cube for the single transaction. Instead, you have to balance out the transaction with a negative value. If you have to update the transaction, then you will have to fully reprocess the partition of the cube to replace the value which can be a very expensive operation. Unless your source system is a ledger-style transaction system, you will probably have to build a ledger-style copy in your ETL subsystem.
If you don't build a Kimball method data warehouse, then you are probably using unobscured and possibly non-integer primary keys in your database. This directly impacts query performance inside the cube. It also sets you up for having a theoretically inflexible data warehouse. For instance, if you have an product ordering system that uses an integer key and you start using a second product ordering system either as a replacement for the legacy system or in tandem with the legacy system, you may struggle to combine the data together merely through the DSV since each system has different data points, metrics, workflows, data types, etc. Worse, if they have the same data types for the order id and the order id values overlap between systems, then you must declare a surrogate key that you can use across both systems. This can be difficult, but not impossible, to implement without using a flattened data warehouse.
You may have to build the system twice if you start with the relational data store and then move to flattened database. Frankly, I think the amount of duplicated work is trivial. Most of what you learned building the cube off a relational data store will translate to setting up the new OLAP cube. The main problem, though, is that you will probably create a new cube altogether and then any users of the old cube will have to migrate to the new cube. Any reports built in SSRS or Excel will probably break at that point and need to be rewritten from the ground up. So the main cost of rebuilding the cube is really on rebuilding dependent objects -- not on rebuilding the cube itself.
Let me know if you want me to expand on any of the above points. good luck.
You're basically asking the million dollar question of "How do I build a DWH". This is not really a question that can decisively be answered.
Nevertheless, here is a kickstart:
If you are looking for a minimum viable product, be aware that you are in a data environment, and not a pure software one. In data-heavy environments, it is much harder to incrementally build a product, because the amount of effort to introduce changes in the system is much greater. Think about it as if every change you make in a piece of software has to be somehow backwards-compatible with anything you've ever done. Now you understand the hell Microsoft are in :-).
Also, data systems involve many third-party tools such as DBs, ETL tools and reporting platforms. The choices you make should be viable for the expected development of your system, else you might have to completely replace these tools down the road.
While you can start with a DB cloning that will be based on simple copy SQLs and then aggregating it or pushing it into an OLAP, I would recommend getting your hands dirty with a real ETL tool from the start. This is especially true if you foresee the need to grow. 9 out of 10 times, the need will grow.
MS-SQL is a good choice for a DB if you don't mind the cost. The natural ETL tool would be SSIS, and it's a solid tool as well.
Even if your first transformations are merely "take this table and dump it in there", you still gain a lot in terms of process management (has the job run? What happens if it fails? etc) and debugging. Also, it is easier to organically grow as requirements and/or special cases have to be dealt with.
Here at work (a multi-billion dollar manufaturing company with a 12 person Windows development team) we are about to go to a single master database for all new applications and will have it broken up with schemas for what we normally would have had databases for before. There will also be a few common schemas with stuff like employee directory and branch directory and so on...
I'm still not sure how I feel about this move, but we're about to have a meeting on this in a few hours to discuss pros, cons, best practices, pitfalls and so on... so I'm looking for your thoughts on this... Is it good? Is it bad? What problems are we going to run into a year from now?
Any thoughts, tips, or advice is welcome. Thanks
EDIT
In response to a comment on this question, we are using SQL Server 2005 and we are actually talking about moving what would have been seperate databases on the same instance into a single database. The driving issue is the complete lack of referential integrity accross databases as the majority of our applications need access to common data such as an employee record, or branch information.
UPDATE
Several people requested that I update this question with the results from our meeting so here it is. We debated back and forth the pros and cons of doing this (I even showed them this question using the projector) and by the time we were done we had pretty much covered the pros and cons covered here. About half of us thought we could get it done with the right resources and commitment, and about half thought we couldn't do it (or that it wouldn't work out well). We decided to use some time with Microsoft to get their thoughts and platform specific advice. I will be sure to update this question and my blog after we've talked to them. Thanks for all the help and helpful answers.
Larger database are harder to maintain due to sheer size: backups take longer, disaster recovery is slower which in turn requires more often backups. You can address these by creating filegroups and using filegroup level backup in your maintenance plans and on crash recovery you can use the 'piecemeal restore' strategy to speed things up.
Proper use of filegroups will make most of the 'cons' cited by previous replies go away: they can distribute the I/O, they can sanitize your maintenance plans and backup/restore strategy, they offer availability by taking offline only the damaged portion of the the db in case of crash. So I'd say that while those 'cons' are legit concerns, they have can be mitigated by a proper deployment strategy. Its true though that these mitigation actions require a true, experienced, dba at the helm as they will go beyond the comfort zone of a developer turned dba by need.
Some of the pros I can think of quickly:
Consistency. You can have a backup-restore so that all data is consistent. Separate dbs don't allow this because you cannot coordinate a consistent set of backups unless you take them all offline, or make them r/o, during the backup.
Dirt cheap high availability: you can deploy database mirroring for disaster recoverability and high availability. Multiple databases have problems because one cannot coordinate a simultaneous failover and apps are faced with the dilemma of seeking each database current location.
Security. While most other posts see one database harder to secure, I'd say is easier to secure. Multiple databases seem harder to secure properly simply because what everyone does is they make one login and add it to that database db_owner group. Having one database will make things harder (unless you end up making everyone dbo, very bad) but once you start doing the right thing (granular access) then one db is not harder than multiple dbs, is actually easier because you won't have to copy/maintain some common groups/rights across multiple dbs.
Control. Will be easier to impose certain policies and good practices on a single db rather than multiple ones (no data access to developers, app data access only through execute rights on the schema to enforce procedures access etc).
There are also some cons I did not see in other posts:
This will be much harder to pull off that you think right now
Increase coupling between formerly separated applications will impose development restrictions: you can't simply alter your schema, you will have to coordinate it with the rest of the apps (you can argue that this was also the case before, but was brushed under the carpet by having separate dbs, and you're right)
Log writes that are now distributed across multiple db logs will be consolidated into one single log file. If your writes are significant, this may turn out to be a serious bottleneck and force you to buy some expensive fast drives for the new, consolidated, log file. In general this can be addresses by making the log drive a stripped array across as many stripes as needed to make it fast enough (usually raid 10).
GAM/SGAM/PFS allocations will also be consolidated, but again this will be alleviated by proper use of file groups.
Pros:
You only need to remember one connection string
When users report that access is slow, you know which DB is causing the trouble
Cons:
Backups of The One DB will take a long time and will get progressively longer over time.
Restoring data from a backup will get increasingly difficult.
Performance Tuning (SQL Profiler, Execution Plan estimation) for a feature for one app will slow down every app.
Restricting access to a single application's data is cumbersome if at all possible which will likely mean in practice that all devs and DBAs will be given keys to the ENTIRE kingdom.
New developers/DBAs have a much larger learning curve as they need to navigate a large and mostly useless (to them) database structure which means higher costs for training/ramp up.
When The One database goes down, everyone in your organization plays solitaire until it is restored.
Creating test instances for app development means copying your entire db
The only "Pro" I can think of is that all of your systems will be in the one database and therefore a single place to backup, store, etc. However, I would consider this to also be one of the biggest "Cons".
Some other general Cons:
Much harder to move an application to a different location/server in the future.
Possible locking issues if any applications make use of tempdb.
Possible unrelated performance degredation on one application when another application is being used.
Much harder to implement an application level security model if all tables are in the same database.
It sounds to me as though your company is transitioning between two completely distinct motives for using database technology. The first is application support. The second is data integration. If I'm right about this, the process will open up a huge can of worms, and many of the issues won't even be addressed by putting all the data in one big database.
Consider two of the points you made. The first is the complete lack of referential integrity across different databases. The second is the idea that each application will have its own schema. What this permits to happen is complete lack of referential integrity across schemas, putting you back in the quicksand you are in now.
Fixing the data so that referential integrity is present, and fixing the schemas so that referential integrity is enforced, and fixing the applications so that the applications agree with the new schemas will turn out to be a monumental task.
Here's what your company really needs to do: Have one single CONCEPTUAL database that contains all "enterprise data", and defined in such a way that both referential integrity and entity integrity are enforced. Revise existing schemas so that they conform to the CONCEPTUAL database except for data that is both purely local to that schema and undocumented in the unified conceptual database. Use constraints wherever needed to guarantee that the data covered by these schemas doesn't lose integrity.
Make the decision about whether these schemas belong in one database or many databases based on database administration, fail soft, security, and performance requirements and NOT on the need to integrate data. Whether you use one platform or multiple platforms is a separable decision.
Where necessary, maintain synchronized copies of the same data in separate databases. Include the overhead of doing this in your performance considerations above.
Document the conceptual database out the gazoo. Don't just settle for definitions of the FORM of data. Insist on definitions of the semantics of the data as well.
Notice that if you use ID fields instead of natural keys to enforce referential integrity, you will have to generate each ID field in one schema, and let the association between ID and dependent data propagate by means of synonyms, views, and synchronized replication.
This is not going to be easy.
If DB is getting bigger, making back-up is getting more difficult because of it's size.
This could mean a serious scalability problem if you want to add high-traffic applications in the future, since it is much easier to add new database servers which run seperate dbs than it is to parrallelize a single DB. At least in SQL Server.
Pros:
The convenience of having everything in one place
Thinking less about good database design
Cons:
Even unrelated things are in one place
Less thinking about good database design leading to poorly normalized data
To me this just sounds like laziness and a belief that all this "fancy ivory tower database stuff" is worthless.
I can see that being scary, but considering the number of businesses that use Oracle EBS, or SAP, or other systems that are, in essence, this same configuration, I don't see it being a Bad Thing™. It's a big move, and will be tough to get correct, but it can really improve integration across the enterprise in the long run.
I've never heard of this approach and would like to know how the meeting goes. I see no real benefit in combining multiple applications into a single database when the data doesn't relate to each other.
I'm thinking you might have issues if you decide that an application requires it's own database server at one point.
Ah, the old EggsInOneBasket design pattern. It's not a favourite.
You're just compounding any problems caused by damage to that database. Spread the risk!
For the referential integrity issue, you can make copies of those shared tables in the subsidiary databases. You can't use real replication, but what you do is deny everything but select on these to most users.
On the same server, you can either push or pull data from the official repository of the master data and insert any new rows/update any changed rows. You can even do this with a trigger in the master database (I don't recommend it, though).
If it's different instances or servers, you can use linked servers or SSIS.
You can put the common data into a "core" schema in each database. Then you can have tools to check that all your core tables in every subsidiary database are consistent. The worse that can happen is that an application is not seeing a new employee because the core isn't updated. And keeping your database separate gives you an ability to decouple and gives you maintenance windows. (You can even decouple and run "standalone" if your master is down for maintenance).
I expect you'll only be seeing a few dozen of these core entity tables in even a largish enterprise.
There are many other ways to solve the referential integrity (RI) issue. I am not as familiar with SQL Server as other DB's. In Informix you can use synonyms to point to objects in other DB's and use these for your RI. In Oracle you can make a DB links to one or more DB's to accomplish the same thing.
These approaches have the issue that if any of the DB's are down the RI will fail causing issues in the dependent DB's. selects would work, but inserts would fail.
Consolidation can be a good idea, depending upon the size of the schema's, and other issues with scalability. SQL Server has serious scalability issues. Other DB platforms allow horizontal scaling with either a share everything approach (Oracle's RAC, latest Informix release) or a partitioned share nothing approach (DB2's DPF, Informix XPS, Netezza, Teradata)
I am with some of the others here interested to hear the results of your meeting.
When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.
In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.