Good Data Tier Dev & Design: What are the common bad practises in data tier development? - database

I am currently researching the best practises (at a reasonably high level) for application design for highly maintainable systems which result in minimal friction to change. By "Data Tier" I mean database design, Object Relation Mappers (ORMs) and general data access technologies.
From your experiences, what have you found to be the common mistakes & bad practises when it comes to data tier development and what measures have you taken / put in place / or can recommend to make the data tier a better place to be from a developer perspective?
An example answer may include: What is the most common causes of a slow, poorly scalable and extendible data tiers? + What measures can be taken (be it in design or refactoring) to cure this issue?
I am looking for war stories here and some real world advice that I can build into publicly available guidance documents and samples.

Magic.
I have used Hibernate, which automatically stores and fetches objects from a database. It also supports lazy loading, so that a related object is only retrieved from the database when you ask for it. This works in some magic way I don't understand.
This all works fine as long as it works, but when it breaks down it is impossible to track it down. I think we had a problem when we combined Hibernate with AOP, that somehow the object was not yet initialized by Hibernate when our code was executed. This problem was very hard to debug, because Hibernate works in such mysterious ways.

Object-Relational mapping is bad practice. By this, I mean that it tends to produce data schemas that can only loosely be described as "relational", and so they scale poorly and exhibit poor data integrity.
This is because properly relational schemas have been through the process of normalisation, whereas the results of O-R Mapping are normally object classes implemented as database tables. These will not normally have been normalised, but will instead have been designed for the immediate convenience of the OO developer.
Of course, in cases where the persistent data requirements are minimal, this is unimportant.
However, I once worked for a shipping company that had grown by taking over several other companies, and had outsourced development of an integrated operational system (to replace the various company-specific systems it had inherited) to a company using an OO methodology, with a data schema produced by O-R mapping. The performance characteristics of the system being developed were so poor, and the data schema so complex, that the shipping company dropped it after something like two years of development - before it even went live!
This was a direct consequence of the O-R mapping; the worst complexity in the schema (and the consequently poor performance) was caused by the existence of tables created solely as artifacts of the OO design process - they reflected screen layouts, not data relationships.

Related

NoSQL movement - justification for SQL RDBMS

In past year I've made numerous projects with NoSQL json based databases - the rich kinds (not the key/value stores) - such as CouchDB, MongoDB, RavenDB. I talk to fellow programmers often about my adoption, I notice though I'm always quick to add "of course SQL RDBMS system still have a place, its always whats best for particular project/task" - as a little disclaimer so not be seen as kool aid drinker, however its pretty shallow statement. Outside of legacy projects that already have an investment in RDBMS, or corporate mandates insisting on Oracle, I struggling to think of any future green field project I'd opt for a SQL database. Its CouchDB all the way for me as far as I can see with rich map/reduce, changes feed, replication support, RESTFUL api (sorry thats starting to sound like a plug)
I'd like to hear those that do "get" (beyond screencasts) NoSQL Json M/R databases such as CouchDb, what type of projects do you think you'll use MS-SQL, Oracle, Postgresql etc.. in the future ?
Thanks
One of the biggest strengths of SQL is that there is a standard way of modelling just about anything - for any given project it may not provide an optimal solution, but it does provide a reasonable one. This means that in a corporate environment you can decide that everything will be stored in Oracle to get the maintenance benefits of having a single system without the risk that it will be completely inappropriate for future projects.
This ability to handle different requirements without needing a lot of planning is also relevant at the start of projects where the design doesn't get signed off until six months into development - again, something that applies mostly in corporate development.
Using NoSQL properly generally requires better developers than you need for SQL development. One of the many SQL code samples available can be edited into a working system by someone who barely knows what they are doing. NoSQL does things like eliminating integrity checks for performance - a good developer produces well tested code that doesn't insert invalid records or understands why a few invalid records don't matter for a particular app. A bad developer thinks anything without error messages is good code. The average developer at a successful web startup can probably handle it. The average developer maintaining internal corporate apps probably can't.
On my own projects where I have complete control of the platform choice and know both the requirements and who will do the development, NoSQL is as good a complete solution as you suggest. Nothing really matters except the technical advantages - easier to code, easier to maintain, better performance, horizontal scaling.
In the corporate environment it's the other realities that dominate - incompetent developers, unknown/changing requirements, long application life cycles, the need to minimize the number of systems - NoSQL wasn't designed for that problem set.
Off-hand, I'd simply say that if the data is inherently relational, an RDBMS is the optimal choice. For instance, an order management system is a good fit for an RDBMS. If the data is not inherently relational (like Google's search index, for example), than NoSQL is a better choice. They both have their place.
My latest application is massively hierarchical/relational with objects containing sub-objects containing sub-objects, all related by natural keys. It's a perfect fit for an RDBMS, and would have been much trickier in a NoSQL DB.

'e-Commerce' scalable database model

I would like to understand database scalability so I've just heard a talk about Habits of Highly Scalable Web Applications
http://techportal.inviqa.com/2010/03/02/habits-of-highly-scalable-web-applications/
On it, the presenter mainly talk about relational database scalability.
I also have read something about MapReduce and Column oriented tables, big tables, hypertable etc... trying to understand which are the most up to date methods to scale web application data. But the second group, to me, is being hard to understand where it fits.
It serves as transactional, reliable data store? or not, its just for large access and processing and to handle fine graned operations we will ever need to rely on RDBMSs?
Could someone give a comprehensive landscape for those new technologies and how to use it?
Basically it's about using the right tool for the job. Relational databases have been around for decades, which means they are very good at solving the problems that haven't changed in that time - things like keeping track of sales for example. Although they have become the default data store for just about everything, they are not so good at handling the problems that didn't exist twenty years ago - particularly scalability and data without a clearly defined, unchanging schema.
NOSQL is a class of tools designed to solve the problems that are not perfectly suited to relational databases. Scalability is the best known, though unlikely to be a relevant to most developers. I think the other key use case that we don't see so much of yet is for small projects that don't need to worry about the data storage characteristics at all, and can just use the default - being able to skip database design, ORM and database maintenance is quite attractive.
For Ecommerce specifically you're probably better off using sql at least in part - You might use NOSQL for product details or a recommendation engine, but you want your sales data in an easily queried sql table.

Database independence

We are in the early stages of design of a big business application that will have multiple modules. One of the requirements is that the application should be database independent, it should support SQL Server, Oracle, MySQL and DB2.
From what I have read on the web, database independence is a very bad idea: it would result in a hard-to-maintain code, database design with the least-common features in all supported DBMSs, bad performance and bad scalability. My personal gut feeling is that the complexity of this feature, more than any other feature, could increase the development cost and time exponentially. The code will be dreadful.
But I cannot persuade anybody to ignore this feature. The problem is that most data on this issue are empirical data, lacking numbers to support the case. If anyone can share any numbers-supported data on the issue I would appreciate it.
One of the possible design options is to use Entity framework for the database tier with provider for each DBMS. My personal feeling is that writing SQL statements manually without any ORM would be a "must" since you have no control on the SQL generated by the entity framework, and a database-independent scenario will need some SQL tweaking based on the DBMS the code is targeting, and I think that third-party entity framework providers will have a significant amount of bugs that only appear in the complex scenarios that the application will have. I would like to hear from anyone who has had an experience with using entity framework for database-independent scenario before.
Also, one of the possibilities discussed by the team is to support one DBMS (SQL Server, for example) in the first iteration and then add support for other DBMSs in successive iterations. I think that since we will need a database design with the least common features, this development strategy is bad, since we need to know all the features of all databases before we start writing code for the first DBMS. I need to hear from you about this possibility, too.
Have you looked at Comparison of different SQL implementations ?
This is an interesting comparison, I believe it is reasonably current.
Designing a good relational data model for your application should be database agnostic, for the simple reason that all RDBMSs are designed to support the features of relational data models.
On the other hand, implementation of the model is normally influenced by the personal preferences of the people specifying the implementation. Everybody has their own slant on doing things, for instance you mention autoincremented identity in a comment above. These personal preferences for implementation are the hurdles that can limit portability.
Reading between the lines, the requirement for database independence has been handed down from above, with the instruction to make it so. It also seems likely that the application is intended for sale rather than in-house use. In context, the database preference of potential clients is unkown at this stage.
Given such requirements, then the practical questions include:
who will champion each specific database for design and development ? This is important, inasmuch as the personal preferences for implementation of each of these people need to be reconciled to achieve a database-neutral solution. If a specific database has no champion, chances are that implementing the application on this database will be poorly done, if at all.
who has the depth of database experience to act as moderator for the champions ? This person will have to make some hard decisions at times, but horsetrading is part of the fun.
will the programming team be productive without all of their personal favourite features ? Stored procedures, triggers etc. are the least transportable features between RDBMs.
The specification of the application itself will also need to include a clear distinction between database-agnostic and database specific design elements/chapters/modules/whatever. Amongst other things, this allows implementation with one DBMS first, with a defined effort required to implement for each subsequent DBMS.
Database-agnostic parts should include all of the DML, or ORM if you use one.
Database-specific parts should be more-or-less limited to installation and drivers.
Believe it or not, vanilla-flavoured sql is still a very powerful programming language, and personally I find it unlikely that you cannot create a performant application without database-specific features, if you wish to.
In summary, designing database-agnostic applications is an extension of a simple precept:
Encapsulate what varies
I work with Hibernate which gives me the benefits of the ORM plus the database independence. Database specific features are out of the question and this usually improves my design. Everything (domain model, business logic and data access methods) are testable so development is not painful.
مرحبا , Muhammed!
Database independence is neither "good" nor "bad". It is a design decision; it is a trade-off.
Let's talk about the choices:
It would result in a hard to maintain code
This is the choice of your programmers. If you make your code database-independent, then you should use a layer between your code and the database. The best kind of layer is one that someone else has written.
...Database design with the least common features in all supported DBMSs
This is, by definition, true. Luckily, the common features in all supported databases are fairly broad; they should all implement the SQL-99 standard.
...bad performance and bad scalability
This should not be true. The layer should add minimal cost to the database.
...this is the most feature ever that could increase the development cost and time exponentially with complexity. The code will be dreadful.
Again, I recommend that you use a layer between your code and the database.
You didn't specify which language or platform you're writing for. Luckily, many languages have already abstracted out databases:
Java has JDBC drivers
Python has the Python Database API
.NET has ADO.NET
Good luck.
Database independence is an overrated application feature. In reality, it is very rare for a large business application to be moved onto a new database platform after it's built and deployed. You can also miss out on DBMS specific features and optimisations.
That said, if you really want to include database independence, you might be best to write all your database access code against interfaces or abstract classes, like those used in the .NET System.Data.Common namespace (DbConnection, DbCommand, etc.) or use an O/RM library that supports multiple databases like NHibernate.

With Cloud Computing increasingly getting popular, will Relational DBs suffer death?

While computer programming evangelists predicting the future of Cloud Computing to be very bright, is there a chance for relational databases to be on their way out?
What are the DBs that are more suitable for Cloud Computing?
Here's a good article that may answer some of your questions. It features a good comparison between RDBMS systems and the ones usually used for cloud storage infrastructure:
http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
The relational database model has a firm mathematical basis in relational algebra. This makes it easy to reason about, to extend, and to use properly (in theory). Even if database access patterns change significantly as a result of these new APIs and uses, it's likely that a relational database will form the underlying implementation for this reason.
No, RDBMSs will always have a place because of their functionality. Not just on their own, but also as backbones to other systems (like OODBMSs).
Relational databases are still relevant, both for localized storage (such as application-specific storage) and for server storage.
The cloud computing platforms that I've seen each have a relational database offering. So, I don't see cloud computing really changing the picture in reference to database types being used.
However, something will eventually replace the databases that we're all used to. The question is whether that will be a higher-level version of RDBs or something different. Another aspect of that question is how long will it take for the current crop of RDBs to fade out? (I don't have an answer for either.)
Clouds go poof still these days, so I don't think so anytime soon.
I don't think that cloud computing will kill RDBMSs. Something else might though.
First, what type of storage engine a given application uses does not (or should not) depend on where it is running (the cloud or a specific server), but rather on how it needs to store the data.
Second, as far as I can tell the only reason people think RDBMSs are on their way out is because they don't scale as well as non-relational DBMSs (such as document-oriented DBMSs like CouchDB) which can more easily be distributed into the cloud. However, there is no reason that RDBMSs cannot become more cloud-friendly in the future. As an early example, look at Drizzle:
The Drizzle project is building a database optimized for Cloud and Net applications. It is being designed for massive concurrency on modern multi-cpu/core architecture.
So no, I don't think that cloud computing will kill RDBMSs. They will just be forced to adapt. What might kill them, however, is if an existing alternative, or a new one, becomes as robust and easy to use as RDBMSs. What I mean is a solution that has both completely solid software (betas not allowed) and is easy for programmers to switch to. They give out degrees to people who understand RDBMSs. Because of all the assisting software (such as ORMs like ActiveRecord, SQLAlchemy, and whatever the .NET folk use I'm assuming), using RDBMSs has become easy even for people who don't know what the first normal form is. So I think that until there is a way for people to use (for instance) a DODBMS just as easily, RDBMSs will continue to dominate. I'm also not saying that is necessarily bad. Again, which DBMS you use should depend on your data, not what people say is cool and better.
A quote from the article :
"The inherent constraints of a relational database ensure that data at the lowest level have integrity. Data that violate integrity constraints cannot physically be entered into the database. These constraints don't exist in a key/value database, so the responsibility for ensuring data integrity falls entirely to the application. But application code often carries bugs. Bugs in a properly designed relational database usually don't lead to data integrity issues; bugs in a key/value database, however, quite easily lead to data integrity issues."
What this means to me is that RDBMS's are doomed, and hotshot new technologies are facing a great and brilliant future, to the same extent that users aren't anywhere near interested in the correctness of their data.
IMHO.
There's nothing wrong with relational databases for applications that need to query more structured data (e.g., "How many people bought product XYZ, on this date, paid more than $100, but less than $150?"). There are potentially significant architectural issues that will need to be addressed as these systems scale and grow. Once your DB outgrows the one machine you started on and/or traffic/requests begin to overload available resources, then (if you still want to keep your relational database) you have to start adding layers. Thankfully today, there are many more options available then in previous years... including caching, map and reduce, and other functionality - but these add-on layers do add complexity and maintenance overhead. In one sense I'd consider these engineered "band-aids" which will most likely solve the scalability and distribution problems with a relational DB today, but longer term? Who knows. I also see these popular layers today - all of which are basically trying to emulate functionality already available in object DBs, giving developers a "virtual object DB" layer that they can use with their object languages to do things faster and more efficiently, and get past the growth and performance obstacles. So I guess my overall opinion is, relational DBs became the defacto DB probably mostly due to how (relatively) easy it was to query a database, and get results back to the one client/app using it. As volumes have grown though, and application complexity is exponentially greater today, I think more developers will decide to bite the bullet, learn the syntax for object DBs (which is actually about as standardized today as relational DBs), and just skip all the middleware and layers that only emulate functionality that one could get natively in an OODBMS. I've seen OODBs that simply get installed on any number of servers, and automatically distributing data as needed, and giving the developer a single view of any size federation of databases... Seems to me the best solution as systems become more distributed, to get a DB that can has native distributed architecture. Anyway, just a thought.

What are the pros and cons of object databases?

There is a lot of information out there on object-relational mappers and how to best avoid impedance mismatch, all of which seem to be moot points if one were to use an object database. My question is why isn't this used more frequently? Is it because of performance reasons or because object databases cause your data to become proprietary to your application or is it due to something else?
Familiarity. The administrators of databases know relational concepts; object ones, not so much.
Performance. Relational databases have been proven to scale far better.
Maturity. SQL is a powerful, long-developed language.
Vendor support. You can pick between many more first-party (SQL servers) and third-party (administrative interfaces, mappings and other kinds of integration) tools than is the case with OODBMSs.
Naturally, the object-oriented model is more familiar to the developer, and, as you point out, would spare one of ORM. But thus far, the relational model has proven to be the more workable option.
See also the recent question, Object Orientated vs Relational Databases.
I've been using db4o which is an OODB and it solves most of the cons listed:
Familiarity - Programmers know their language better then SQL (see Native queries)
Performance - this one is highly subjective but you can take a look at PolePosition
Vendor support and maturity - can change over time
Cannot be used by programs that don't also use the same framework - There are OODB standards and you can use different frameworks
Versioning is probably a bit of a bitch - Versioning is actually easier!
The pros I'm interested in are:
Native queries - Db4o lets you write queries in your static typed language so you don't have to worry about mistyping a string and finding data missing at runtime,
Ease of use - Defining buissiness logic in the domain layer, persistence layer (mapping) and finally the SQL database is certainly violation of DRY. With OODB you define your domain where it belongs.
I agree - OODB have a long way to go but they are going. And there are domain problems out there that are better solved by OODB,
One objection to object databases is that it creates a tight coupling between the data and your code. For certain apps this may be OK, but not for others. One nice thing that a relational database gives you is the possibility to put many views on your data.
Ted Neward explains this and a lot more about OODBMSs a lot better than this.
It has nothing to do with performance. That is to say, basically all applications would perform better with an OODB. But that would also put lots of DBA's out of work/having to learn a new technology. Even more people would be out of work correcting errors in the data. That's unlikely to make OODBs popular with established companies. Gavin seems to be totally clueless, a better link would be Kirk
Cons:
Cannot be used by programs that
don't also use the same framework
for accessing the data store, making
it more difficult to use across the
enterprise.
Less resources available online for
non SQL-based database
No compatibility across database
types (can't swap to a different db
provider without changing all the
code)
Versioning is probably a bit of a
bitch. I'd guess adding a new
property to an object isn't quite as
easy as adding a new column to a
table.
Sören
All of the reasons you stated are valid, but I see the problem with OODBMS is the logical data model. The object-model (or rather the network model of the 70s) is not as simple as the relational one, and is therefore inferior.
jodonnel, i dont' see how use of object databases couples application code to the data. You can still abstract your application from the OODB through using a Repository pattern and replace with an ORM backed SQL database if you design things properly.
For an OO application, an OO database will provide a more natural fit for persisting objects.
What's probably true is that you tie your data to your domain model, but then that's the crux!
Wouldn't it be good to have a single way of looking at both data, business rules and processes using a domain centric view?
So, a big pro is that an OODB matches how most modern, enterprise level object orientated software applications are designed, there is no extra effort to design a data layer using a different (relational) design. Cheaper to build and maintain, and in many cases general higher performance.
Cons, just general lack of maturity and adoption i reckon...

Resources