Top-down vs. Bottom-up database design: Real world examples - database

Ok, I can find hundreds of references on the internet of the difference between top-down database design vs bottom up database design approaches, however, I can't seeem to find any real world examples, or any inofrmation on which design is really more suitable for what circumstances.
Can anyone help me out?

I'm basing this answer on this Data Modeling Wikipedia article.
About half way down the Wikipedia page, there's a section called "Modeling methodologies".
A top down approach is used to create a new database. You model the objects at a logical level, then you apply the objects to a physical database design. For example, a relational database would need the objects to be mapped to tables.
To use a real world example, a payroll system would have to have person objects, along with other objects that hold pay rules (overtime for over 40 hours a week, overtime for more than 10 hours a day, etc.). There would be a pay period object, which holds the dates of the pay period and the pay day. This description isn't a complete design. As you think about the application more, you come up with additional objects that need to exist, and additional entities that need to be part of existing objects.
A bottom up approach is used to migrate a database from one physical database to another. Migrating from Oracle to IBM's DB2 usually requires some changes, as the column data types are not completely compatible. You would create tables based on the existing tables. Sometimes, you try to make a near exact copy, to minimize the application coding changes. Other times, you alter the table structure, usually to normalize further or to group columns together in a more logical way. Yes, the application code would have to change to accommodate the new database schema. But sometimes, the pain is worth the gain.
I've seen lots of database migrations. They're hard to describe in a post. They are painful to work through.

To understand the differences between these approaches, let's consider some jobs that are bottom-up in nature. In statistical analysis, analysts are taught to take a sample from a small population and then infer the results to the overall population. Physicians are also trained in the bottom-up approach. Doctors examine specific symptoms and then infer the general disease that causes the symptoms.
An example of jobs that require the top-down approach include project management and engineering tasks where the overall requirements must be specified before the detail can be understood. For example, an automobile manufacturer must follow a top-down approach to meet the overall specifications for the car. If a car has the requirement that it cost less than 15,000 dollars, gets 25 miles per gallon, and seating five people. In order to meet these requirements the designers must start by creating a specification document and then drilling down to meet these requirements.
taken from http://www.dba-oracle.com/t_object_top_down_bottom_up.htm

Related

In what cases are bitemporal tables actually used?

I am trying to collect information about temporal databases. I know it is not a modern technology, but I saw that many people who work with databases don't ever know how temporal approach works (I asked some senior programmers and system analysts about temporal databases and they answered something like "Huh?").
I know there are valid-time state tables and transaction-time state tables, along with bitemporal tables. I think that bitemporal tables are way too complex for most usages, because nowadays space is not a problem anymore, and it is more efficient to write the same information on 2 different tables, even if data is redundant. However, I made many searches online trying to see where bitemporal tables are actually used, but I didn't find anything useful.
Are there cases when use of a bitemporal table is really convenient than valid-time and transaction-time state tables separately? Are there real-world examples?
Of course! Take for example, balance sheet data. You will find that this information will change from WD1 (Working Day) to WD x due to late arriving data, adjustments, manual errors and suchlike.
In order to enable repeatable reporting, audit trail and temporal comparisons, a record must be kept of 'old' (invalid?) results. Bitemporal is a great way to manage such updates, especially on an intraday basis. I don't think it's that complicated from a user perspective - just another filter on the where clause.
I admit that the loading process is complicated, but it's not that bad.. I literally just finished writing a generic transform (in SAS, coping with all scenarios for a unique business key) and it took a single day.
Coming back to use cases.. Having both valid (business) time and transaction (version) time on the same table enables:
Repeatable results (having separate tables and corresponding updates could mean getting different results for the same query on two different days)
Comparable results (can answer questions such as "what was the value of X, as we knew it at time Y?")
Rapid results (only dealing with a single table, updated in a transparent and easy-to-query way).
In this sense it is an appropriate structure to use on many, if not all tables in a DWh.
UPDATE 2020: A bitemporal data transform for SAS (both SAS 9 and Viya) is available with Data Controller for SAS. A demo version is available: https://docs.datacontroller.io/dcc-tables/#var_busfrom-var_busto
I think your question raises more issues but it all comes down to how much is enough.
I developed a Bi_Temporal SQL Server engine that supports object versioning and relationship by time as well as all the other beautiful parts of Temporal DB's.
This was because the project needed to be able to be rewound to a place in time and see everything as it was at that time.
I mean everything including data, relationships and User access.
It was the most complex thing I have built but in the end it was so complex no-one else could maintain it, or understand what was happening.
So there was a real world use case and a deliverable.
Is not everyones cup of tea as you have to be able to think in time dimension as well as object version changes as all db's do.
Hope this helps someone. I know the post is old but as it was the first I found when searching Temporal DB's it might be of interest to someone.

What's the attraction of schemaless database systems?

I've been hearing a lot of talk about schema-less (often distributed) database systems like MongoDB, CouchDB, SimpleDB, etc...
While I can understand they might be valuable for some purposes, in most of my applications I'm trying to persist objects that have a specific number of fields of a specific type, and I just automatically think in the relational model. I'm always thinking in terms of rows with unique integer ids, null/not null fields, SQL datatypes, and select queries to find sets.
While I'm attracted to the distributed nature and easy JSON/RESTful interfaces of these new systems, I don't understand how loosely typed key/value hashes will help me with my development. Why would a loose typed, schema-less system be good for keeping clean data sets? How can I for example, find all items with dates between x and y when they might not have dates? Is there any concept of a join?
I understand many systems have their own differences and strengths, but I'm wondering at the difference in paradigm. I suppose this is an open-ended question, but perhaps the community's answers and ways they have personally seen the advantages of these systems will help enlighten me and others about when I would want to make use of these (admittedly more hip) systems instead of the traditional RDBMS.
I'll just call out one or two common reasons (I'm sure people will be writing essay answers)
With highly distributed systems, any given data set may be spread across multiple servers. When that happens, the relational constraints which the DB engine can guarantee are greatly reduced. Some of your referential integrity will need to be handled in application code. When doing so, you will quickly discover several pain points:
your logic is spread across multiple layers (app and db)
your logic is spread across multiple languages (SQL and your app language of choice)
The outcome is that the logic is less encapsulated, less portable, and MUCH more expensive to change. Many devs find themselves writing more logic in app code and less in the database. Taken to the extreme, the database schema becomes irrelevant.
Schema management—especially on systems where downtime is not an option—is difficult. reducing the schema complexity reduces that difficulty.
ACID doesn't work very well for distributed systems (BASE, CAP, etc). The SQL language (and the entire relational model to a certain extent) is optimized for a transactional ACID world. So some of the SQL language features and best practices are useless while others are actually harmful. Some developers feel uncomfortable about "against the grain" and prefer to drop SQL entirely in favor of a language which was designed from the ground up for their requirements.
Cost: most RDBMS systems aren't free. The leaders in scaling (Oracle, Sybase, SQL Server) are all commercial products. When dealing with large ("web scale") systems, database licensing costs can meet or exceed the hardware costs! The costs are high enough to change the normal build/buy considerations drastically towards building a custom solution on top of an OSS offering (all the significant NOSQL offerings are OSS)
The primary concern should be what do you need to do with your data. If you have a huge data set and are finding a traditional RDBMS to be a bottleneck then you may want to experiment with a schemaless or a a NOSQL solution.
Most environments that I am aware of using NOSQL solutions also use an RDBMS solution in some form or fashion. RDBMS based solutions are the norm where data integrity is extremely important and you need ACID transactions. However if your system is not highly transaction based but you need to scale up or scale out real quick, a NOSQL solution may be desirable.
Schemaless is great for two reasons:
Brain optimising intuitiveness of document storage
Resolves Sparse-Matrix and Entity-Attribute-Value storage problems.
I've used both SQL and No-SQL for production applications in Ruby on Rails. I'm not a database expert and I have to confess to googling ACID and similar terms as they're not familiar to me.
"Ah ha! Another know-nothing trend follower jumping on the latest bandwagon" you may say. But, actually, I'm really pleased with my decision to use MongoDB on our most recent 2 year old app and here's why...
The flip-side of brain-optimising intuitiveness was my experience with the Magento e-commerce system. I don't want to bash it because it served me well at the time but it really hit the processor hard trying to calculate the attributes for each product. The underlying reason was the Entity-Attribute-Value store of product data. Cache or be damned was the solution.
The major advantage to me is the optimisation in the only place that really matters - your own brain. So many technologies are critiqued on their efficiency in memory, processors, hardware and yet having a DB that's extremely intuitive to understand brings its own merits. We've found it quick to add features to our code because the database simply looks a lot like the real world we're modelling. When I've asked e-commerce clients to present me with their product list they will naturally tend to use Excel (think table store). The first columns are easy:
Product Name
Price
Product Type (
Then it gets harder and covered in notes, colour coding and links to other tables (yep.. relationships)
Colour (Only some products)
Size (X Large, Large, Small) - only for products 8'9'10, golf clubs use a different scale
Colour 2. The cat collars have two colour choices.
Wattage
Fixing type (Male, Female)
So it ends in a terrible mess of Excel tables that make no sense to me and not much sense to the people who work with the products day in and day out. We throw our arms in the air and decide to go through the catalogue and then it hits me! Wouldn't it be great if you could store the data as it appears in the catalogue!? Just collections of records on each product that just lists the attribute of that product. You can then pick out common attributes to index for retrieval at a later date. Of course, that's a document store.
In summary, document stores are great when you have a sparse matrix problem or objects that mutate their attributes over time. Having lived in a No-SQL world for 2 years, I can't think of a real world application that doesn't have those features because the world itself looks like a document store.
I've only played with MongoDB but one thing that really interested me was how you could nest documents. In MongoDB a document is basically like a record. This is really nice because traditionally, in a RDBMS, if you needed to pull a "Person" record and get the associated address, employer info, etc. you'd frequently have to go to multiple tables, join them up, make multiple database calls. In a NoSQL solution like MongoDB, you can just nest the associated records (documents) and not have to mess with foreign keys, joining, multiple database calls. Everything associated with that one record is pulled.
This is especially handy when dealing with objects. You can in many cases just store an object as a series of nested documents.
NoSQL databases are not schemaless; the schema is embedded in the data. They are properly called semistructured. In some KV data stores, however, the schema may even be embedded in code. The advantage of the semi-structured approach is two fold: flexibility in which columns are part of a row (one row could have 5 columns and another have 5 different columns, and flexibility in the characteristics of the columns (e.g., variable lengths)
Normally the attraction is that of snake oil - most people favourising them have no clue about the relational theorem and speak SQL on a level making professionals puke. No idea what ACID conditions are, ehy they are important etc.
Not saying they do not have valid uses.... just saying that mostly the attraction is people not knowing what they should know and making stupid conclusions. Again, not everyone is like that, but most developers favouring them are - not good in their understanding what a database system acutally is responsible for.

Bad real-world database schemas

Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.
Check the Dell-dvd-store, you can use it for free.
The Dell DVD Store is an open source
simulation of an online ecommerce site
with implementations in Microsoft SQL
Server, Oracle and MySQL along with
driver programs and web applications
Bill Karwin has written a great book about bad designs: SQL antipatterns
I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control
vBulletin has a really bad database schema.
"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?
Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.
If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.

DataWarehouse - What is a good definition?

Could someone give me a good, practical definition of what a data warehouse is?
I'm surprised no one has posted Inmon's definition:
A warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection of data in
support of management's decision
making process
From the same page you can pick up Kimball's definition:
A copy of transaction data
specifically structured for query and
analysis
I think that, unfortunately, data warehousing is a wide-ranging field. There is a lot of variety with very few standard paradigms, specifically I'm thinking of Kimball's dimensional modelling. Inmon does not have as a specific a methodology as Kimball's and thus some 3NF models may or may not conform to his principles.
Because Inmon has broadened his scope for what warehousing is meant to accomplish, it can encompass unstructured data. However, analysis of unstructured data is very different than traditional analysis.
As applied to SQL Server, typically the largest Data Warehouses on SQL Server are modelled dimensionally, because this lends itself well to the non-distributed, non-massively parallel model. Massively parallel systems like Teradata generally perform a lot better with 3NF models. These are still table-based systems with the various tables connected with foreign key constraints (perhaps not enforced, but at least logical).
Of course, we are also seeing NoSQL data processing systems like Map/Reduce which are not really databases at all in the sense of normalized, denormalized or non/poorly-normalized relational databases which we have had for 40 years now.
i just started with Datawarehousing and Buisness Intelligence and looking around the web you can find some interesting links :
Get Start With Datawarehousing
I think this two links could help you to understand the concepts of datawarehousing.
sorry, im new i can post only one link ^^
we're sorry, but as a spam prevention mechanism, new users can only post a maximum of one hyperlink. Earn 10 reputation to post more hyperlinks.
A database optimized for retrieval, in general denormalized data, usually a star schema(but could be snowflake) and uses dimensional modeling (fact and dimension tables)
While this is not an academic definition, it might serve as a practical one. A data warehouse is a collection of datamarts and will combine datasets across the breadth of an organization.
A datamart will contain datasets specific to certain portions of the business. In the datamart you will find fact tables, measurable pieces of information, along with dimensions, attributes of your measurable pieces.
A true data warehouse will have conformed dimension tables that can be shared across datamarts.
An example...
Your company may build a datamart around sales. And another datamart around human resources. If the customer dimension table is shared across both these datamarts, it would be considered a conformed dimension. All three of these entities together would make up a data warehouse.
As someone else stated you can find more detailed information by searching for Ralph Kimball's Data Strategies.
Definition : Datawarehouse is a database used for analysis purpose rather than for transaction processing
Check the below link for more informaion on datawarehouse
http://www.idatastage.com/datawarehouse/

The right way in designing a database

I started my first MySQL project designing the ERD, logical and physical diagrams.
A friend of mine is making the same project as me. I started the plan of my databases by making an ERD and then normalizing.
However, he uses a relational database diagrams where he designs interfaces and other parts first before making the ERD. He for example writes "stack" only to the column phonenumbers, instead of making a "help-table".He says that it is best to make the interfaces first and then make the ERD.
Which one of us is doing the plan in your opinion better?
One could, and many people have, written a book on this.
However to generalise what I would generally do is
Analyse your data and reduce it down to 3rd normal form. This should be pretty formulaic to accomplish.
In light of likely business use of the data decide if and where data should be denormalized. Typically most databases will be overwhelmingly in 3rd normal with a few critical exceptions. This part is where experience and craft come in.
In light of the above create any additional indexes that may be necessary, or modify existing primary indexes (which should have been assigned in phase 1).
Create views for user access as necessary. The number you require may vary from none (as in simple embedded application) to many (as in no direct data access to tables allowed).
Create any procedures you need, and possibly triggers (generally best avoided but appropriate for audit purposes).
In practice of course the process is considerably more iterative, but the general design path from data to interface holds true. Also it's a good idea when designing a database to keep in mind that you will want to change it later and try if possible to make that a reasonably straightforward task.
I'm not sure what you mean by "interfaces and operations", but the way you're designing the schemas is right -- doing an ERD and properly normalizing. A lot of people, when they are just starting out, will take shortcuts on the design in order to fit the schema to their current level of querying skills.
For example, instead of creating a table of phone numbers and mapping these phone numbers to a "customers" table, they might just stick in columns called Phone1 Phone2 Phone3... instead. That can be the kiss of death later on when designing your queries.
So my advice... Create the normalized data model with ERDs. Then read up on VIEWs and user-defined functions in order to "flatten" out your schema where necessary for people who wish to query it. Sorry for the general answer, but it's kind of a general question...

Resources