How do you write your applications to be database independent? - database

My boss asks me to write only ANSI SQL to make it database independent. But I learned that it is not that easy as no database fully ANSI SQL compatible. SQL code can rarely be ported between database systems without modifications.
I saw people do different way to make their program database independent. For example:
Externalize SQL statements to resource files.
Write many providers class to support different database.
Write only simple SQL, and keep away from advance functions/joins.
Do you always write your code "any database ready"? Or do it only if needed?
If yes, how do you achieve it?

You could use one of the many Object/Relational Mapper tools, like Hibernate/NHibernate, LLBLGen, and so forth. That can get you a long way to database portability. No matter what you do, you need to have some abstraction layer between your model and the rest of your code. This doesn't mean you need some sort of dependency injection infrastructure, but good OO design can get you pretty far. Also, sticking with plain SQL and thinking that will get you portability is rather naive. That would be true if your application was trivial and only used very trivial queries.
As for always writing an application to be "any database ready," I usually use some sort of abstraction layer so it is not hard to move from one database system to another. However, in many circumstances, this is not required, you are developing for the Oracle platform or SQL Server or MySQL whatever so you shouldn't sacrifice the benefits of your chosen RDBMS just for the possibility of an entirely seamless transition. Nevertheless, if you build a good abstraction layer, even targeting a specific RDBMS won't necessarily be too difficult to migrate to a different RDBMS.

To decouple the database engine from your application, use a database abstraction layer (also data access layer, or DAL). You didn't mention what language you use, but there are good database abstraction libraries for all the major languages.
However, by avoiding database-specific optimizations you will be missing out on the advantages of your particular brand. I usually abstract what's possible and use what's available. Changing database engines is a major decision and doesn't happen often, and it's best to use the tools you have available to the max.

Tell your boss to mind his own business. No, of course one can't say such things to one's boss, but stay tuned.
What's interesting is what business value is supposed to be supported by this requirement. One obvious candidate seems to be that the database code should be ready for working on other database engines than the current. If that's the case then that's what should be stated in the requirement.
From there it's up to you as an engineer to figure out the different ways to achieve that. One might be writing ANSI SQL. One might be using a database abstraction layer.
Further it's your responsibility to inform your boss what the costs of the different alternatives are (in terms of performance, speed of development, etcetera).
"Write ANSI SQL"... gah!

Just for the record. There is a similar question here on Stackoverflow:
Database design for database-agnostic applications

Being 100% compliant to ANSI SQL is a difficult goal to meet, and yet it doesn't guarantee portability anyway. So it's an artificial goal.
Presumably your boss is asking for this in order to make it easy and quick to switch database brands for some hypothetical purpose in the future (which actually may never come). But he's trading that future efficiency for a greater amount of work now, since it's harder to make the code database-neutral.*
So if you can phrase the problem in terms of the goals your manager should be focusing on, like finishing the current project phase on time and on budget, it may be more effective than just telling him it's too difficult to make the code database-neutral.
There is a scenario when you need to make truly database-neutral code, that is when you are developing a shrink-wrap application that is required to support multiple brands of database.
Anyway, even if you currently support only one brand, there are certainly cases where you have a choice to code some SQL using proprietary features, but there also exists a more portable way to achieve the same result. You can treat these as "low-hanging fruit" cases, and you can make it easier to port the code in the future if the need arises. But don't limit yourself either, use proprietary solutions if they give good value. Perhaps add a note in the comments that this deserves review if/when you need to make a port.
* I prefer the word "neutral" instead of "agnostic" when talking about platform-independence. It avoids the religious overtone. :-)

Related

ORM vs traditional database query, which are their fields?

ORM seems to be a fast-growing model, with both pros and cons in their side. From Ultra-Fast ASP.NET of Richard Kiessig (http://www.amazon.com/Ultra-Fast-ASP-NET-Build-Ultra-Scalable-Server/dp/1430223839/ref=pd_bxgy_b_text_b):
"I love them because they allow me to develop small, proof-of-concept sites extremely quickly. I can side step much of the SQL and related complexity that I would otherwise need and focus on the objects, business logic and presentation. However, at the same time, I also don't care for them because, unfortunately, their performance and scalability is usually very poor, even when they're integrated with a comprehensive caching system (the reason for that becomes clear when you realize that when properly configured, SQL Server itself is really just a big data cache"
My questions are:
What is your comment about Richard's idea. Do you agree with him or not? If not, please tell why.
What is the best suitable fields for ORM and traditional database query? in other words, where you should use ORM and where you should use traditional database query :), which kind/size... of applications you should undoubtedly choose ORM/traditional database query
Thanks in advance
I can't agree to the common complain about ORMs that they perform bad. I've seen many plain-SQL applications until now. While it is theoretically possible to write optimized SQL, in reality, they ruin all the performance gain by writing not optimized business logic.
When using plain SQL, the business logic gets highly coupled to the db model and database operations and optimizations are up to the business logic. Because there is no oo model, you can't pass around whole object structures. I've seen many applications which pass around primary keys and retrieve the data from the database on each layer again and again. I've seen applications which access the database in loops. And so on. The problem is: because the business logic is already hardly maintainable, there is no space for any more optimizations. Often when you try to reuse at least some of your code, you accept that it is not optimized for each case. The performance gets bad by design.
An ORM usually doesn't require the business logic to care too much about data access. Some optimizations are implemented in the ORM. There are caches and the ability for batches. This automatic (and runtime-dynamic) optimizations are not perfect, but they decouple the business logic from it. For instance, if a piece of data is conditionally used, it loads it using lazy loading on request (exactly once). You don't need anything to do to make this happen.
On the other hand, ORM's have a steep learning curve. I wouldn't use an ORM for trivial applications, unless the ORM is already in use by the same team.
Another disadvantage of the ORM is (actually not of the ORM itself but of the fact that you'll work with a relational database an and object model), that the team needs to be strong in both worlds, the relational as well as the oo.
Conclusion:
ORMs are powerful for business-logic centric applications with data structures that are complex enough that having an OO model will advantageous.
ORMs have usually a (somehow) steep learning curve. For small applications, it could get too expensive.
Applications based on simple data structures, having not much logic to manage it, are most probably easier and straight forward to be written in plain sql.
Teams with a high level of database knowledge and not much experience in oo technologies will most probably be more efficient by using plain sql. (Of course, depending on the applications they write it could be recommendable for the team to switch the focus)
Teams with a high level of oo knowledge and only basic database experience are most probably more efficient by using an ORM. (same here, depending on the applications they write it could be recommendable for the team to switch the focus)
ORM is pretty old, at least in the Java world.
Major problems with ORM:
Object-Oriented model and Relational model are quite different.
SQL is a high level language to access data based on relational algebra, different from any OO language like C#, Java or Visual Basic.Net. Mixing those can you the worst of two worlds, instead of the best
For more information search the web on things like 'Object-relational impedance mismatch'
Either case, a good ORM framework saves you on quite some boiler-plate code. But you still need to have knowlegde of SQL, how to setup a good SQL databasemodel. Start with creating a good databasemodel using SQL, then base your OO model on that (not the other way around)
However, the above only holds if you really need to use a SQL database. I recommend looking into NoSQL movement as well. There's stuff like Cassandra, Couch-db. While google'ing for .net solutions I found this stackoverflow question: https://stackoverflow.com/questions/1777103/what-nosql-solutions-are-out-there-for-net
I'm the author of the book with the text quoted in the question.
Let me emphatically add that I am not arguing against using business objects or object oriented programming.
One issue I have with conventional ORM -- for example, LINQ to SQL or Entity Framework -- is that it often leads to developers making DB calls when they don't even realize that they're doing so. This, in turn, is a performance and scalability killer.
I review lots of websites for performance issues, and have found that DB chattiness is one of the most common causes of serious problems. Unfortunately, ORM tends to encourage chattiness, in spades.
The other complaints I have about ORM include:
No support for command batching
No support for multiple result sets
No support for table valued parameters
No support for native async calls (making them from a background thread doesn't count)
Support for SqlDependency and SqlCacheDependency is klunky if/when it works at all
I have no objection to using ORM tactically, to address specific business issues. But I do object to using it haphazardly, to the point where developers do things like make the exact same DB call dozens of time on the same page, or issue hugely expensive queries without considering caching and change notifications, or totally neglect async operations when scalability is a concern.
This site uses Linq-to-SQL I believe, and it's 'fairly' high traffic... I think that the time you save from writing the boiler plate code to access/insert/update simple items is invaluable, but there is always the option to drop down to calling a SPROC if you have something more complex, where you know you can write some screaming fast SQL directly.
I don't think that these things have to be mutually exclusive - use the advantages of both, and if there are sections of your application that start to slow down, then you can optimise as you need to.
ORM is far older than both Java and .NET. The first one I knew about was TopLink for Smalltalk. It's an idea as old as persistent objects.
Every "CRUD on the web" framework like Ruby on Rails, Grails, Django, etc. uses ORM for persistence because they all presume that you are starting with a clean sheet object model: no legacy schema to bother with. You start with the objects to model your problem and generate the persistence from it.
It often works the other way with legacy systems: the schema is long-lived, and you may or may not have objects.
It's astonishing how quickly you can get a prototype up and running with "CRUD on the web" frameworks, but I don't see them being used to develop enterprise apps in large corporations. Maybe that's a Fortune 500 prejudice.
Database admins that I know tell me they don't like the SQL that ORMs generate because it's often inefficient. They all wish for a way to hand-tune it.
I agree with most points already made here.
ORM's are not new in .NET, LLBLGen has been around for a long time, I've been using them for >5 years now in .NET.
I've seen very bad performing code written without ORMs (in-efficient SQL queries, bad indexes, nested database calls - ouch!) and bad code written with ORMs - I'm sure I've contributed to some of the bad code too :)
What I would add is that an ORM is generally a powerful and productivity-enhancing tool that allows you to stop worrying about plumbing db code for most of your application and concentrate on the application itself. When you start trying to write complex code (for example reporting pages or complex UI's) you need to understand what is happening underneath the hood - ignorance can be very costly. But, used properly, they are immensely powerful, and IMO won't have a detrimental effect on your apps performance. I for one wouldn't be happy on a project that didn't use an ORM.
Programming is about writing software for business use. The more we can focus on business logic and presentation and less with technicalities that only matter at certain points in time (when software goes down, when software needs upgrading, etc), the better.
Recently I read about talks of scalability from a Reddit founder, from here, and one line of him that caught my attention was this:
"Having to deal with the complexities
of relational databases (relations,
joins, constraints) is a thing of the
past."
From what I have watched, maintaining a complex database schema, when it comes to scalability, becomes a major pain as the site grows (you add a field, you reassign constraints, re-map foreign keys...etc). It was not entirely clear to me as to why is that. They're not using a NOSQL database though, they're in Postgres.
Add to that, here comes ORM, another layer of abstraction. It simplifies code writing, but almost often at a performance penalty. For me, a simple database abstraction library will do, much like lightweight AR libs out there together with database-specific "plain text" queries. I can't show you any benchmark but with the ORMs I have seen, most of them say that "ORM can often be slow".
Richard covers both sides of the coin, so I agree with him.
As for the fields, I really don't quite get the context of the "fields" you are asking about.
As others have said, you can write underperforming ORM code, and you can also write underperforming SQL.
Using ORM doesn't excuse you from knowing your SQL, and understanding how a query fits together. If you can optimize a SQL query, you can usually optimize an ORM query. For example, hibernate's criteria and HQL queries let you control which associations are joined to improve performance and avoid additional select statements. Knowing how to create an index to improve your most common query can make or break your application performance.
What ORM buys you is uniform, maintainable database access. They provide an extra layer of verification to ensure that your OO code matches up as closely as possible with your database access, and prevent you from making certain classes of stupid mistake, like writing code that's vulnerable to SQL injection. Of course, you can parameterize your own queries, but ORM buys you that advantage without having to think about it.
Never got anything but pain and frustration from ORM packages. If I'd write my SQL the way they autogen it - yeah I'd claim to be fast while my code would be slow :-) Have you ever seen SQL generated by an ORM ? Barely has PK-s, uses FK-s only for misguided interpretation of "inheritance" and if it wants to do paging it dumps the whole recordset on you and then discards 90% of it :-))) Then it locks everything in sight since it has to take in a load of records like it went back to 50 yr old IBM's batch processing.
For a while I thought that the biggest problem with ORM was splintering (not going to have a standard in 50 yrs - every year different API, pardon "model" :-) and ideologizing (everyone selling you a big philosophy - always better than everyone else's of course :-) Then I realized that it was really the total amateurism that's the root cause of the mess and everything else is just the consequence.
Then it all started to make sense. ORM was never meant to be performant or reliable - that wasn't even on the list :-) It was academic, "conceptual" toy from the day one, the consolation prize for professors pissed off that all their "relational" research papers in Prolog went down the drain when IBM and Oracle started selling that terrible SQL thing and making a buck :-)
The closest I came to trusting one was LINQ but only because it's possible and quite easy to kick out all "tracking" and use is just as deserialization layer for normal SQL code. Then I read how the object that's managing connection can develop spontaneous failures that sounded like premature GC while it still had some dangling stuff around. No way I was going to risk my neck with it after that - nope, not my head :-)
So, let me make a list:
Totally sloppy code - not going to suffer bugs and poor perf
Not going to take deadlocks from ORM's 10-100 times longer "transactions"
Drastic reduction of capabilities - SQL has huge expressive power these days
Tying you up into fringe and sloppy API (every ORM aims to hijack your codebase)
SQL queries are highly portable and SQL knowledge is totally portable
I still have to know SQL just to clean up ORM's mess anyway
For "proof-of-concept" I can just serialize to binary or XML files
not much slower, zero bug libraries and one XPath can select better anyway
I've actually done heavy traffic web sites all from XML files
if I actually need real graph then I have no use for DB - nothing real to query
I can serialize a blob and dump into SQL in like 3 lines of code
If someone claims that he does it all from DB to UI - keep your codebase locked :-)
and backup your payroll DB - you'll thank me latter :-)))
NoSQL bases are more honest than ORM - "we specialize in persistence"
and have better code quality - not surprised at all
That would be the short list :-) BTW, modern SQL engines these days do trees and spatial indexing, not to mention paging without a single record wasted. ORM-s are actually "solving" problems of 10yrs ago and promoting amateurism. To that extent NoSQL, also known as document

Database independence

We are in the early stages of design of a big business application that will have multiple modules. One of the requirements is that the application should be database independent, it should support SQL Server, Oracle, MySQL and DB2.
From what I have read on the web, database independence is a very bad idea: it would result in a hard-to-maintain code, database design with the least-common features in all supported DBMSs, bad performance and bad scalability. My personal gut feeling is that the complexity of this feature, more than any other feature, could increase the development cost and time exponentially. The code will be dreadful.
But I cannot persuade anybody to ignore this feature. The problem is that most data on this issue are empirical data, lacking numbers to support the case. If anyone can share any numbers-supported data on the issue I would appreciate it.
One of the possible design options is to use Entity framework for the database tier with provider for each DBMS. My personal feeling is that writing SQL statements manually without any ORM would be a "must" since you have no control on the SQL generated by the entity framework, and a database-independent scenario will need some SQL tweaking based on the DBMS the code is targeting, and I think that third-party entity framework providers will have a significant amount of bugs that only appear in the complex scenarios that the application will have. I would like to hear from anyone who has had an experience with using entity framework for database-independent scenario before.
Also, one of the possibilities discussed by the team is to support one DBMS (SQL Server, for example) in the first iteration and then add support for other DBMSs in successive iterations. I think that since we will need a database design with the least common features, this development strategy is bad, since we need to know all the features of all databases before we start writing code for the first DBMS. I need to hear from you about this possibility, too.
Have you looked at Comparison of different SQL implementations ?
This is an interesting comparison, I believe it is reasonably current.
Designing a good relational data model for your application should be database agnostic, for the simple reason that all RDBMSs are designed to support the features of relational data models.
On the other hand, implementation of the model is normally influenced by the personal preferences of the people specifying the implementation. Everybody has their own slant on doing things, for instance you mention autoincremented identity in a comment above. These personal preferences for implementation are the hurdles that can limit portability.
Reading between the lines, the requirement for database independence has been handed down from above, with the instruction to make it so. It also seems likely that the application is intended for sale rather than in-house use. In context, the database preference of potential clients is unkown at this stage.
Given such requirements, then the practical questions include:
who will champion each specific database for design and development ? This is important, inasmuch as the personal preferences for implementation of each of these people need to be reconciled to achieve a database-neutral solution. If a specific database has no champion, chances are that implementing the application on this database will be poorly done, if at all.
who has the depth of database experience to act as moderator for the champions ? This person will have to make some hard decisions at times, but horsetrading is part of the fun.
will the programming team be productive without all of their personal favourite features ? Stored procedures, triggers etc. are the least transportable features between RDBMs.
The specification of the application itself will also need to include a clear distinction between database-agnostic and database specific design elements/chapters/modules/whatever. Amongst other things, this allows implementation with one DBMS first, with a defined effort required to implement for each subsequent DBMS.
Database-agnostic parts should include all of the DML, or ORM if you use one.
Database-specific parts should be more-or-less limited to installation and drivers.
Believe it or not, vanilla-flavoured sql is still a very powerful programming language, and personally I find it unlikely that you cannot create a performant application without database-specific features, if you wish to.
In summary, designing database-agnostic applications is an extension of a simple precept:
Encapsulate what varies
I work with Hibernate which gives me the benefits of the ORM plus the database independence. Database specific features are out of the question and this usually improves my design. Everything (domain model, business logic and data access methods) are testable so development is not painful.
مرحبا , Muhammed!
Database independence is neither "good" nor "bad". It is a design decision; it is a trade-off.
Let's talk about the choices:
It would result in a hard to maintain code
This is the choice of your programmers. If you make your code database-independent, then you should use a layer between your code and the database. The best kind of layer is one that someone else has written.
...Database design with the least common features in all supported DBMSs
This is, by definition, true. Luckily, the common features in all supported databases are fairly broad; they should all implement the SQL-99 standard.
...bad performance and bad scalability
This should not be true. The layer should add minimal cost to the database.
...this is the most feature ever that could increase the development cost and time exponentially with complexity. The code will be dreadful.
Again, I recommend that you use a layer between your code and the database.
You didn't specify which language or platform you're writing for. Luckily, many languages have already abstracted out databases:
Java has JDBC drivers
Python has the Python Database API
.NET has ADO.NET
Good luck.
Database independence is an overrated application feature. In reality, it is very rare for a large business application to be moved onto a new database platform after it's built and deployed. You can also miss out on DBMS specific features and optimisations.
That said, if you really want to include database independence, you might be best to write all your database access code against interfaces or abstract classes, like those used in the .NET System.Data.Common namespace (DbConnection, DbCommand, etc.) or use an O/RM library that supports multiple databases like NHibernate.

Why can application developers do datasebase stuff but database developers try to stay clear of application stuff?

In my experience, this has been a contentious issue between "backend" (database developer) and "frontend" guys (application developer, client and server side).
There have been many heated pub discussions on this subject.
I just want to know is it just people have different mindsets, or lazy to learn more and feel comfortable in what they know, or something else.
I might re-phrase the question: why do (some) application developers think they can do "database stuff" without actually bothering to understand it properly? Whereas database developers do not (in general) assume they can write a good application without some training and experience!
It is about levels of abstraction. A database is the lowest level of abstraction in a typical business application (software-wise). It is much more likely that a developer working on an outer layer of the abstraction would have knowledge of an inner layer than a developer in an inner layer would know about the outer layer.
This is because inner layers of abstraction best perform when they are ignorant of the outer layers who depend on them.
So a designer in the presentation layer of a website may know a bit about the server-side code they depend on because they interact with it. But the developer working on the server does not need to know anything about design at all.
I would say it's on a need to know basis. Applications developers often need to know how to connect to databases, add records, delete records etc... This is taken further with new technologies such as LINQ where developers can write database queries within their actual code.
Database developers on the other hand only really need to know how to write database queries as that is their job and probably won't need to worry about the code at application level.
Because programmers very often must understand and interact with databases to do their job, but DBAs very often don't need to do any programming (outside of the DBMS) to do their jobs.
I believe it stems from the fact that programming in sql looks easy, and to get started you have to have a small amount of knowledge (Really for a programmer to learn SELECT * FROM Table is pretty easy). Application programming is not the same way. It becomes very complex in a small amount of time, and that discourages a lot of people. Now I am not saying that database people are any less intelligent it is just what they do looks easier than building applications.
If you develop applications, then the chances are, that sooner or later, you'll have to connect an app to a back-end.
The opposite is not as true.
I think it stems from necessity. If you consider the roles of each person, a programmer needs to to database related stuff far more than database workers need to do programming tasks.
From my experience, having developed both "databases" and "applications" (following your nomenclature...), I guess there's a big difference in state management.
Properly designed databases are always in a "clean" state, and every transaction keeps this consistency. So when developing a database, you have to very clearly specify your data abstractions into tables and which updates are legal and so on.
I've found that most application developers (myself included :)) do a very sloppy job in keeping this consistent state in the application. Any non-trivial interface has many more possible states to manage than a modest database, and it's not as easy to make sure it's always in a clean state. It's also harder to analyze every possible sequence of steps that users will perform.
From my experience, the application developers don't do all the database stuff. Consider all the administration that is related to the databse, backups, replication, etc.
A typical DBA (at least on most of the projects I've been involved to) takes care about everything that is related to project databases - all administration, cooperates with application developers on performance tuning, gives advices about SQL used by the app, does some of the stored procs coding, creates (or, at least reviews and consults) physical DB designs, etc.
So, aren't the database guys "lazy", or "fine with what they already know" just from an application developer's perspective? I'm an app developer myself and there is a whole lot of things that I just don't know about the DBs we're using on our projects.
Part of my education ensured I got a decent understanding of how Databases work. I went into the field expecting to do database work, and a lot of it. I'm a web app guy; it comes with the territory I guess.
My two jobs as a developer have been at two shops that would best be described as tiny (2 people myself included, and then just me) and tiny (3 developers, briefly having a fourth). I have not observed an immediate business need for, nor worked anywhere that had the resources to employ a dedicated DB guy. I can envision some scenarios where that would change (including a new job :P).
As to the rest, I agree that abstraction is also a factor and as developers we're way up on top/outside looking in. I can't imagine doing web app development without DB skills, and I consider Sql/DB Management to be both an important tool and an area I need to stay sharp in.
I'll add that I treat the database side as its own field. There's skills that translate between the two, but there's a lot of specialized knowledge I need to acquire to get better at it, and that being a good programmer doesn't necessarily mean I'm doing a good job on the back end either (fortunately, I'm not a good programmer ;) ). Also, I'm pretty sure that's what she said.
2 reasons:
DB Vendors facilitate bad SQL, and
SQL is hidden from view while
application UI is front and center.
Most naive developers think SQL is a procedural language and write it as such because vendors ensure that the tools exist so that they can do so. DBAs know that good SQL is set-oriented and has optimization principles that are totally different from those involved in application programming.
The visibility aspect makes it so the application developers can write bad SQL against a database and get it to perform in a marginal way, and no one ever sees quite how bad it is. When a DBA writes an application, there are immediate critiques on its appearance and behavior because it's directly visible to the end user.
Good question. Actually why developers do Database Stuff because where no dedicated Database guys then developers have to do that. But a company have Database Guys also have Development guys.
:) what is your idea ?

Databases versus plain text

When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.

Can you use an object-database in a production system?

Object-databases are used very seldomly, albeit they offer a way to live without SQL, which is, I think, a benefit of its own.
Yet, I have seen them about never in production systems. Is there something fundamentally wrong with object-databases? Can I use a object-database in a production system?
Edit: So, maybe I should confess that I love object-databases. I cannot really get my head around why they are not used a lot more often.
Sure you could, as long as it was stable. The problem is the relative lack of high quality Object Oriented DB systems, as well as the fact that most people don't even know what one is.
db4o is being used a lot by many Fortune 500 companies (especially for embedded applications), so I wouldn't say that OODBs are not used for real-world production systems
There are production systems written using the GemStone OODB. It's a distributed, persistent Smalltalk system.
Heard of Cache? Used by EpicSystems for their Enterprise Health Record(EHR) product. Plenty of production shops using it.
The real question here is whether you require tooling to support your database or not. By tooling I mean reporting, data migration, data mining, etc. Do you need to provide self service reports? Heck, even reports with a fast turnaround time that doesn't require deploying new code? (Providing report functionality in the application is a real drag.)
There are countless tools available to perform these operations against traditional RDBMs. Against OODBs? I'm not familiar with any major products. Though, admittedly, I'm not an OODB kinda guy.
If you don't need those tools, they go for it. Otherwise, stick to traditional RDBMs. With current ORM technology, the pain of mapping objects to records is much less than it used to be.
The problem I believe, is that SQL isn't inherently a bad thing. It is very good at performing set based operations. From what I've seen, object databases work well when working with individual objects, yet fail when trying to do set based operations. Also, people are very good at working with SQL databases. It's easy to find people to work with them. Object databases are another story.

Resources