How to present a database design? - database

I am doing a project in the university and it includes a MySQL database. I have a design for the database in terms of a list of tables and their respective fields.
In what form should I present this design? Just the list of tables and content? In an ERD? How do you present your designs?
To clarify - whatever you answer, I expect not only specification of how you present your design, but also which tools do you use the create the diagrams/list/tables etc.

ERD is the only way to go. As they say, a picture is worth a thousand words.
But don't try to put the whole database on one diagram. It will, in all but the most trivial cases, be overwhelming to your audience to try to digest the entire database design in one go. Instead, break the diagrams into subject areas depicting only the most relevant tables in each diagram. For example, a point-of-sale system might have separate diagrams for Inventory, Sales, Accounting, Customer Management, Security, Auditing, and Reporting. Some tables will show up in more than one subject area -- this is to be expected.
As far as tooling, nothing beats ErWin, but it is really expensive and only available for Windows. Visio is ubiquitous in a corporate environment, but is only available on Windows and is not exactly cheap either. Macs offer some really nice diagramming tools; most of them are not free.
Dia is a decent, free, and cross-platform diagramming tool. It is a bit quirky, though; and I have not had much success making the diagrams look as nice I want them to look.
For MySQL, I have played with fabFORCE dbDesigner and it is not bad, but I did find its support for multiple subject areas to be a bit lacking at the time -- perhaps they've improved it since. But it is free and works on Windows and Linux.
For the actual presentation, I create images from these diagramming tools and pull them into presentation software (PowerPoint, KeyNote, or OpenOffice Impress). These presentations can be exported to PDF and distributed to the audience; they won't need anything more than a PDF viewer to review the information later.

Let's look at this from your professor's perspective. If I were him/her:
I would require an ERD. Without it, I cannot see one of the most fundamental issues of a database design, how are the tables related.
I would also expect some basic use cases/ requirements. What problems are you trying to solve with this database design?
I would want to see what indexes are in place, especiall on the foreign key columns. I would want to see expected row counts in all tables to determine if indexes are even required.
I would want to see column data types to determine if they meet the requirements. I would want to see what columns accept NULL values, since that often can cause problems if you're not careful.
If I were using SQL Server, I would probably create a diagram in SSMS to display a somewhat basic ERD. Visio can be used as well. I might use Visio to create my use cases, or perhaps Microsoft Word.

mysql workbench will make you pretty graphics for presentation amongst other many sophisticated features.

Depends on the audience. ERD certainly isn't the only answer and may not be the best. You should choose a medium that your audience will understand.

Don't forget to discuss design aspects that can't fit to ERD:
1) how inheritance/aggregation relationships from your analytical model implemented in your db.
2) how you are going to support hierarchies of your objects in the rdb (if you have any)
3) list relationships that are in your analytical model but are not supported by the rdb design.
4) ETL process, track changes, track schema changes, security based on resource.
5) storage partitioning and maintenance aspects (one of the goal optimize backup time)
6) in prod test (test island data) and easy cloning db for test environment

Related

Bad real-world database schemas

Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.
Check the Dell-dvd-store, you can use it for free.
The Dell DVD Store is an open source
simulation of an online ecommerce site
with implementations in Microsoft SQL
Server, Oracle and MySQL along with
driver programs and web applications
Bill Karwin has written a great book about bad designs: SQL antipatterns
I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control
vBulletin has a really bad database schema.
"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?
Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.
If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.

What are the use cases of Graph-based Databases (http://neo4j.org/)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used Relational DB's a lot and decided to venture out on other types available.
This particular product looks good and promising: http://neo4j.org/
Has anyone used graph-based databases? What are the pros and cons from a usability prespective?
Have you used these in a production environment? What was the requirement that prompted you to use them?
I used a graph database in a previous job. We weren't using neo4j, it was an in-house thing built on top of Berkeley DB, but it was similar. It was used in production (it still is).
The reason we used a graph database was that the data being stored by the system and the operations the system was doing with the data were exactly the weak spot of relational databases and were exactly the strong spot of graph databases. The system needed to store collections of objects that lack a fixed schema and are linked together by relationships. To reason about the data, the system needed to do a lot of operations that would be a couple of traversals in a graph database, but that would be quite complex queries in SQL.
The main advantages of the graph model were rapid development time and flexibility. We could quickly add new functionality without impacting existing deployments. If a potential customer wanted to import some of their own data and graft it on top of our model, it could usually be done on site by the sales rep. Flexibility also helped when we were designing a new feature, saving us from trying to squeeze new data into a rigid data model.
Having a weird database let us build a lot of our other weird technologies, giving us lots of secret-sauce to distinguish our product from those of our competitors.
The main disadvantage was that we weren't using the standard relational database technology, which can be a problem when your customers are enterprisey. Our customers would ask why we couldn't just host our data on their giant Oracle clusters (our customers usually had large datacenters). One of the team actually rewrote the database layer to use Oracle (or PostgreSQL, or MySQL), but it was slightly slower than the original. At least one large enterprise even had an Oracle-only policy, but luckily Oracle bought Berkeley DB. We also had to write a lot of extra tools - we couldn't just use Crystal Reports for example.
The other disadvantage of our graph database was that we built it ourselves, which meant when we hit a problem (usually with scalability) we had to solve it ourselves. If we'd used a relational database, the vendor would have already solved the problem ten years ago.
If you're building a product for enterprisey customers and your data fits into the relational model, use a relational database if you can. If your application doesn't fit the relational model but it does fit the graph model, use a graph database. If it only fits something else, use that.
If your application doesn't need to fit into the current blub architecture, use a graph database, or CouchDB, or BigTable, or whatever fits your app and you think is cool. It might give you an advantage, and its fun to try new things.
Whatever you chose, try not to build the database engine yourself unless you really like building database engines.
We've been working with the Neo team for over a year now and have been very happy. We model scholarly artifacts and their relationships, which is spot on for a graph db, and run recommendation algorithms over the network.
If you are already working in Java, I think that modeling using Neo4j is very straight forward and it has the flattest / fastest performance for R/W of any other solutions we tried.
To be honest, I have a hard time not thinking in terms of a Graph/Network because it's so much easier than designing convoluted table structures to hold object properties and relationships.
That being said, we do store some information in MySQL simply because it's easier for the Business side to run quick SQL queries against. To perform the same functions with Neo we would need to write code that we simply don't have the bandwidth for right now. As soon as we do though, I'm moving all that data to Neo!
Good luck.
Two points:
First, on the data I've been working with the past 5 years in SQL Server, I've recently hit the scalability wall with SQL for the type of queries we need to run (nested relationhsips...you know...graphs). I've been playing around with neo4j, and my lookup times are several orders of magnitude faster when I need this kind of lookup.
Second, to the point that graph databases are outdated. Um...no. Early on, as people were trying to figure out how to store and lookup data efficiently, they created and played with graph and network style database models. These were designed so the physical model reflected the logical model, so their efficiency wasnt that great. This type of data structure was good for semi-structured data, but not as good for structured dense data. So, this IBM dude named Codd was researching efficient ways to arrange and store structured data and came up with the idea for the relational database model. And it was good, and people were happy.
What do we have here? Two tools for two different purposes. Graph database models are very good for representing semi-structured data and the relationships between entities (that may or may not exist). Relational databases are good for structured data that has a very static schema, and where join depths do not go very deep. One is good for one kind of data, the other is good for other kinds of data.
To coin the phrase, there is no Silver Bullet. Its very short sighted to say that graph database models are out of date and to use one gives up 40 years of progress. That's like saying using C is giving up all the technological progress we've gone through to get things like Java and C#. That's not true though. C is a tool that is needed for certain tasks. And Java is a tool for other tasks.
I've been using MySQL for years to manage engineering data, and it worked well, but one of the problems we had (but didn't realise we had) was that we always had to plan the schema up-front. Another problem we knew we had was mapping the data up to domain objects and back.
Now we've just started trying out neo4j and it looks like it is solving both problems for us. The ability to add different properties to each node (and relation) has allowed us to re-think our entire approach to data. It is like dynamic versus static languages (Ruby versus Java), but for databases. Building the data model in the database can be done in a much more agile and dynamic way, and that is dramatically simplifying our code.
And since the object model in code is generally a graph structure, mapping from the database is also simpler, with less code and consequently fewer bugs.
And as a additional bonus, our initial prototype code for loading our data into neo4j is actually performing faster than the previous MySQL version. I have no solid numbers on this (yet), but that was a nice additional feature.
But at the end of the day, the choice probably should be based mostly on the nature of your domain model. Does it map better to tables or graphs? Decide by doing some prototypes, load the data and play with it. Use neoclipse to look at different views of the data. Once you've done that, hopefully you know if you're on to a good thing or not.
Here is a good article that talks about the needs that non relational databases fill: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
It does a good job at pointing out (aside from the name) that relational databases arent flawed or wrong, its just that these days people are starting to process more and more data in mainstream software and web sites, and that relational databases just wont scale for these needs.
I am building an intranet at my company.
I am interested in understanding how to load data that was stored in tables (Oracle, MySQL, SQL Server, Excel, Access, various random lists) and loading it into Neo4J, or some other graph database. Specifcally, what happens when common data overlaps existing data already in the system.
Yes, I know some data is best modeled in RDBMS, but I have this idea itching me, that when you need to superimpose several distinct tables, the graph model is better than the table structure.
For instance, I work in a manufacturing environment. There is a major project we are working on and because of the complexity, each department has created a seperate Excel spreadsheet that has a BOM (Bill Of Materials) hierarchy in a column on the left and then several columns of notes and checks made by individuals who made these sheets.
So one of the problems is merging all these notes together into one "view" so that someone can see all the issues that need to be addressed in any particular part.
The second problem is that an Excel spreadsheet sucks at representing a hierarchial BOM when a common component is used in more than one subassembly. Meaning that, if someone writes a note about the P34 relay in the ignition subassembly, the same comment should be associated with the P34 relays used in the motor driver subassembly. This won't occur in the excel spreadsheet.
For the company intranet, I want to be able to search for anything easily. Such as data related to a part number, a BOM structure, a phone number, an email address, a company policy, or procedure. I want to even extend this to manage computer hardware assets, and installed software.
I envision that once the information network starts to get populated you can start doing cool traversals such as "I want to write an email to everyone working on the XYZ project". People will have been associated with the project because they will be tagged as creating and modifying the data within the XYZ project. So by using the XYZ project as a search key, a huge set with everything related to the XYZ project will be created. Including links to people who built the XYZ project. The people links will connect to their email addresses. So by their involvement in the XYZ project, they will be included in my email. This is in stark contrast to some secretary trying to maintain a list of people work on the project. We generate a lot of lists. We spend a lot of time maintaining lists and making sure they are up to date. And most of it doesn't add any value to our products.
Another cool traversal could report all the computers that have a certain piece of software installed, by version. That report could be used to generate tasks to remove extra copies of old software and to update people who need to have the latest copy. It would also be useful for license tracking.
might be a bit late, but there is a growing number of projects using Neo4j, the better known ones listed at Neo4j . Also NeoTechnology, the company behind Neo4j, has some references at their customers page
Note: I am part of the Neo4j team

Tracing Designs - Screen to Database Traceability

This is vaguely related to:
Should I design the application or model (database) first?
Design from the database first through to UI or t’other way round?
But my question is more about modeling and artifacts and less about the right way to do design. I'm trying to figure out what sort of design artifact would best enunciate the link between features (use cases), screens and database elements (tables and columns, most particularly). UML is very code-centric. Database models are very database centric. And of screen designs are UI centric!
Here's the deal... my team is working on the first release of the product. We used use cases, then did screen designs and database design was somewhat isolated from the two. A critical area for bugs was the lack of traceability between the use cases and their accompanying screens and the database. In our product, there's a very high degree of overlap between use cases and database elements. Many use cases touch over 75% of the database infrastructure. So we have high contention over database design areas, and it's easy for a small database change to disrupt the lower levels of business logic.
For our next release, I want the developers and our DBA to have a really clear insight into what parts of the database each feature touches. The use case/screen design approach worked well, so we're keeping it... the trick is linking each use case and screens to the database model so the relationships are really obvious and hard to forget about.
On smaller projects (we're only 10 people, but often I've worked on teams of 3 or less), I've created my own custom diagrams to show this part of the design. Sort of a fusion of screen, UML and database table, done in Visio with no link to actual code or SQL. I'm not sure it will work for a larger team, as its highly manual to keep up to date, and it doesn't auto generate code the way our database modeling tool does.
Any recommendations? Is there a commonly accepted mechanism for this?
FYI - we're pretty waterfall, that isn't going to change any time soon. And we do love artifacts... Saying "switch to agile" is not a viable solution for our group.
I can't tell from your question how detailed your use cases are. I get the impression that they may be high-level use cases, not broken down into detailed uses cases (perhaps through include or extend relationships.
In any case, I prefer to start with Requirements, and trace them to use cases. While I'm writing the use cases and the use case diagrams, I'm also creating a domain model (a high-level class diagram). This is mostly to give me something to discuss with stakeholders ("did I get that right?").
When the use cases and domain model are finished, it's possible to begin to work on screen design, and possibly also for an activity model, if there are complex interactions between screens. I would treat the screens as though they were classes with UI - a screen might have a FirstName attribute, which I'd note as being related to the FirstName attribute of the Person entity in my domain model. Yet the FirstName attribute might be represented on that screen as a text box.
At the same time, physical database design can occur. This would produce a class or ER diagram, with traceability back to the domain model. Eventually, you might find that some of your screen attributes or activity modeling refer to things that are part of the physical database model that are not present in the domain model. It's ok to relate a screen "PersonalName" attribute to the computed PersonalName column in the Person table in the People schema.
The tool I use for this sort of thing is Sparx Enterprise Architect. It's a great tool, and can do all of this and more, even in the Professional edition.
I also have to say for the sake of Truth that I mostly model on my own - I haven't yet worked on a project where the model, code and database were being developed by separate people. If someone told me that the above wouldn't work in "real life", I might be forced to believe them.
I am not sure I understand your question clearly, but I'll try to respond based on some reasonable assumptions...
Essentially, my recommendation is the same as what John Saunders already suggested: to consider using UML along with a good UML tool. But I would like to add a few points that might be important in your specific situation.
First and foremost, I don't think that UML is "too code-centric". To the contrary, it can be used to model pretty much any aspect of a software system, almost at any level of abstraction. With a good tool at hand (like Sparx EA), the beauty of UML is that you actually do get a well-defined model under the hood (as opposed to just a set of unconnected drawings/charts). As a result, even if the tool itself does not give you a feature that you are looking for (like traceability from DB to use cases)... well, at least you have an option to automate (or at least semi-automate) the task yourself: for example, you can export your UML model into an XMI (standard!) and then derive whatever traceability-related info you need from there (e.g. using XSL or any XML-aware library for your favorite programming/scripting language). I am not saying that it would be easy to do (especially if you want traceability on the level of individual DB columns 8-), but it's possible and it is very likely to beat any manual method if it has to scale along the size of the project.
BTW, speaking of Sparx EA... I don't know all of its capabilities yet, but it has so many that I would not be surprized if it allowed you to select a class (or even an attribute of a class) and show you other model elements that depend on it in some way. You might want to check this out.
Having said all that, I do understand that you may have at least the following two important concerns about UML:
It may appear to require too many modeling details to be in place to get what you want.
As any "universal tool", it may be grossly inferior to specialized modeling tools that you already use.
Regarding issue #1: Again, with a good UML tool at hand, you might be able to do as many shortcuts as you want. For example, instead of building a very detailed and accurate activity model for a use case, you could focus just on classes involved in the use case (just enough to enable tracking classes back to use cases). The same applies to UI, of course.
Regarding issue #2: I don't know what exact tools do you use now to model use cases, UI, and DB schema. So, theoretically it is possible that they are all so superior to UML that you wouldn't want to give any of them up in exchange of easier traceability. However, something tells me that your DB modeling tool (with its code-generating capabilities) might be the only one that is truly indispensable. If that's the case, then I would still recommend to consider using UML: you just do not model down to DB schema level and "stop" at the level of domain model (even if you do not have it in your application!). At that point, the UML tool would give you traceability from domain model elements (entities, their attributes, and their relationships) back to use cases and UI elements, and mappings between your domain model and DB schema could be left "in the air" because, in the vast majority of cases, they should be simple enough to track without drawing anything. This might not give you 100% of what you want, but it could give you 80% that would be sufficient to mitigate most of you problems.
The bottom line: if you are using three different tools/technologies to model three different aspects of your system... well, it's obvious that any traceability between those three aspects remains at mercy of those three tools: you could automate only as much as those tools allow (which probably means that you are going to be stuck with a lot of laborous manual tasks to do). As of today, UML appears the only well-defined and widely supported "lingua franca" that could help you to connect your models and could enable automation of substantial part of your analytical activities. Just make sure you distinguish UML "just-drawing tools" (like most of Visio add-ons and stencils) from true UML modeling tools (like Sparx EA and a bunch of others).
Your use cases is a good place to start from.
Convert your use cases into
executable test code. This test code
needs to verify that the resultant
return values are according to the
requirements of your use cases.
The smaller the parts of the work you
can identify and test, the more
robust you will be able to build
your application.
This means that the interaction of
your use cases with a large part
of your database and the GUI, will
be simpler to understand.
It's hard to lock-down the architecture or business logic interplay in complex projects with complete upfront design of the different layers. One truly learns about what will be able to facilitate your requirements only as you get to the point of implementing them.
As a developer, find the techniques, tools and processes that help you do your job in the best way possible. Don't judge these on their origin. Judge them on their value in making you the best developer possible.
Some items from the agile world has certainly made a big difference to the quality and productivity of my work. This doesn't require throwing out the apple cart and putting an experienced waterfall team into disarray.
The database should model your Problem Domain. It should model it completely enough so that you can extract solutions -- truths -- from it. Bad design is essentially "lying" to the database (allowing the possibility of invalid data or relationships), and when you "lie" to your database, it'll "lie" to you when you ask it questions.
Simple examples are modeling many-to-many relation where a relation can only be one-to-many, or assuming values can't be null, or treating a foreign key as an attribute. Many of these can be avoided by proper normalization, which requires you to explicitly find out what is a key and what isn't.
By "making illegal states unrepresentable" in the model, you avoid having to write "defensive" code to check for the impossible or to validate that a relation is possible, as impossible things are made unrepresentable because of table structures or declarative check constraints.
This lowers the cost of writing your code, as you can concentrate for the most part, on what it needs to do, rather than guarding against the impossible.

How would you design your database to allow user-defined schema

If you have to create an application like - let's say a blog application, creating the database schema is relatively simple. You have to create some tables, tblPosts, tblAttachments, tblCommets, tblBlaBla… and that's it (ok, i know, that's a bit simplified but you understand what i mean).
What if you have an application where you want to allow users to define parts of the schema at runtime. Let's say you want to build an application where users can log any kind of data. One user wants to log his working hours (startTime, endTime, project Id, description), the next wants to collect cooking recipes, others maybe stock quotes, the weekly weight of their babies, monthly expenses they spent for food, the results of their favorite football teams or whatever stuff you can think about.
How would you design a database to hold all that very very different kind of data? Would you create a generic schema that can hold all kind of data, would you create new tables reflecting the user data schema or do you have another great idea to do that?
If it's important: I have to use SQL Server / Entity Framework
Let's try again.
If you want them to be able to create their own schema, then why not build the schema using, oh, I dunno, the CREATE TABLE statment. You have a full boat, full functional, powerful database that can do amazing things like define schemas and store data. Why not use it?
If you were just going to do some ad-hoc properties, then sure.
But if it's "carte blanche, they can do whatever they want", then let them.
Do they have to know SQL? Umm, no. That's your UIs task. Your job as a tool and application designer is to hide the implementation from the user. So present lists of fields, lines and arrows if you want relationships, etc. Whatever.
Folks have been making "end user", "simple" database tools for years.
"What if they want to add a column?" Then add a column, databases do that, most good ones at least. If not, create the new table, copy the old data, drop the old one.
"What if they want to delete a column?" See above. If yours can't remove columns, then remove it from the logical view of the user so it looks like it's deleted.
"What if they have eleventy zillion rows of data?" Then they have a eleventy zillion rows of data and operations take eleventy zillion times longer than if they had 1 row of data. If they have eleventy zillion rows of data, they probably shouldn't be using your system for this anyway.
The fascination of "Implementing databases on databases" eludes me.
"I have Oracle here, how can I offer less features and make is slower for the user??"
Gee, I wonder.
There's no way you can predict how complex their data requirements will be. Entity-Attribute-Value is one typical solution many programmers use, but it might be be sufficient, for instance if the user's data would conventionally be modeled with multiple tables.
I'd serialize the user's custom data as XML or YAML or JSON or similar semi-structured format, and save it in a text BLOB.
You can even create inverted indexes so you can look up specific values among the attributes in your BLOB. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql (the technique works in any RDBMS, not just MySQL).
Also consider using a document store such as Solr or MongoDB. These technologies do not need to conform to relational database conventions. You can add new attributes to any document at runtime, without needing to redefine the schema. But it's a tradeoff -- having no schema means your app can't depend on documents/rows being similar throughout the collection.
I'm a critic of the Entity-Attribute-Value anti-pattern.
I've written about EAV problems in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Here's an SO answer where I list some problems with Entity-Attribute-Value: "Product table, many kinds of products, each product has many parameters."
Here's a blog I posted the other day with some more discussion of EAV problems: "EAV FAIL."
And be sure to read this blog "Bad CaRMa" about how attempting to make a fully flexible database nearly destroyed a company.
I would go for a Hybrid Entity-Attribute-Value model, so like Antony's reply, you have EAV tables, but you also have default columns (and class properties) which will always exist.
Here's a great article on what you're in for :)
As an additional comment, I knocked up a prototype for this approach using Linq2Sql in a few days, and it was a workable solution. Given that you've mentioned Entity Framework, I'd take a look at version 4 and their POCO support, since this would be a good way to inject a hybrid EAV model without polluting your EF schema.
On the surface, a schema-less or document-oriented database such as CouchDB or SimpleDB for the custom user data sounds ideal. But I guess that doesn't help much if you can't use anything but SQL and EF.
I'm not familiar with the Entity Framework, but I would lean towards the Entity-Attribute-Value (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model) database model.
So, rather than creating tables and columns on the fly, your app would create attributes (or collections of attributes) and then your end users would complete the values.
But, as I said, I don't know what the Entity Framework is supposed to do for you, and it may not let you take this approach.
Not as a critical comment, but it may help save some of your time to point out that this is one of those Don Quixote Holy Grail type issues. There's an eternal quest for probably over 50 years to make a user-friendly database design interface.
The only quasi-successful ones that have gained any significant traction that I can think of are 1. Excel (and its predecessors), 2. Filemaker (the original, not its current flavor), and 3. (possibly, but doubtfully) Access. Note that the first two are limited to basically one table.
I'd be surprised if our collective conventional wisdom is going to help you break the barrier. But it would be wonderful.
Rather than re-implement sqlservers "CREATE TABLE" statement, which was done many years ago by a team of programmers who were probably better than you or I, why not work on exposing SQLSERVER in a limited way to the users -- let them create thier own schema in a limited way and leverage the power of SQLServer to do it properly.
I would just give them a copy of SQL Server Management Studio, and say, "go nuts!" Why reinvent a wheel within a wheel?
Check out this post you can do it but it's a lot of hard work :) If performance is not a concern an xml solution could work too though that is also alot of work.

Defining the database schema in the application or in the database?

I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).

Resources