Query retrieving vs. calculus - best performance? - database

I have a doubt, is best, in term of performance, retrieving from a large database (50k+ rows growing) on MySQL a "calculated" data or perform the calculus on the fly?
The calculus are simply division and multiplication of some data yet retrieved from DB, but a lot of data (minimum ~300 calculus per load).
Another note, this "page" can be loaded multiple times (it's not an only one-load).
I think that calculate the value (e.g. using a background task) and then only retrieving from a DB is the best solution in term of performance, right?
There is a lot of difference in the two approach?

I'd play to the strengths of each system.
Aggregating, joining and filtering logic obviously belongs on the data layer. It's faster, not only because most DB engines have 10+ years of optimisation for doing just that, but you minimise the data shifted between your DB and web server.
On the other hand, most DB platforms i've used have very poor functionality for working with individual values. Things likes date formatting and string manipulation just suck in SQL, you're better doing that work in PHP.
Basically, use each system for what it's built to do.
In terms of maintainability, as long as the division between what happens where is clear, separating these to types of logic shouldn't cause much problem and certainly not enough to out way the benefits. In my opinion code clarity and maintainability are more about consistency than about putting all the logic in one place.
TO SUM UP:
I'll defenetly use Database for your problem.

Related

Is a strongly normalized relational database not efficient?

I was reading this question https://meta.stackexchange.com/questions/26398/stackoverflow-database-design-join-issues and I got the following question: using a very normalized db is not efficient?
How should be found the right compromise?
I'm not sure if this question better fits here or on programmer. Here there are some similar but if I should move, just ask me.
Whether it will speed it up or slow it down depends strongly on the nature of the data, the size of the tables, the type of querying, the indexing. I have seen it go both ways although, more often than not in my experience, normalization to the third normal form speeds things up. Relational databases are built to be normalized and designed so that those things are expected.
One thing the denormalization advocates often forget is that speed is critical to transactions (possibly more critical due to blocking potential) and that denormalization often slows down updates. You can't measure performance just on select statements. Denormalized database tables are often wider and wider tables can often cause slowdowns too.
Denormalized databases are a major problem to keep the data integrity in and a change of a company name in a normalized database might result in one record needing to be updated and in a denormalized one might result in 100,000,000 records needing to be updated. That is why denormalization is generally only preferred for databases (like data warehouses) where the data is loaded through an ETL process but the database itself is frequently queried for complex reporting scenarios. Transactional databases that have a lot of user updates and deletions and inserts are often much faster if they are normalized to the third normal forma at least. Now you can go crazy with normalization too, don't get me wrong. I shouldn't have to join to 10 tables to get a simple address especially if I get them often. Data that is often used together often belongs together especially if the items are unlikely to change a million records if a change is made. For instance in the address, it would require a large update if Chicago changed it's name to New Chicago, but those types of massive address changes are pretty rare in my part of the world. On the other hand, company name changes are frequent and could cause massive disruption if they needed to be made to millions of denormalized records.
If you are not designing a data warehouse, then normalize your data. Never denormalize unless you are a database specialist with at least 5 years experience in large systems. You can harm things tremendously if you don't know what you are doing. If things are slow denormalization is one of the last performance improvements to try. Generally, the problem is fixed by writing better queries that are sargable and which do not use poorly performing techniques like correlated subqueries or by getting the correct indexing applied.
Normalization optimizes storage requirements and data consistency. As a tradeoff, it can make queries more complex and slow.
How should be found the right compromise?
Unfortunately, that cannot be answered with generality.
It all depends on your application and its requirements.
If your queries run too slow, and indexing or caching or query rewriting or database parameter tuning don't cut it, denormalization may be appropriate for you.
(OTOH, if your queries run just fine, or can be made to run just fine, there is probably no need to go there).
It depends. Every time I've worked to normalize a database, it has radically sped up. But, the performance problems with the non-normalized DBs were that they needed many indices, most of which were not used for any particular query, having too many columns, forced DISTINCT constraints on queries that wouldn't have been needed with a normalized DB, and inefficient table searching.
If common queries need to perform many joins on large tables for the simplest of lookups, or hit many tables for writes to update what the user/application sees as an atomic update of a single entity, then as traffic grows, so will that burden, at a rate higher than with lower/no normalization. Typically what happens is that everything runs OK until either the database and application are put on different production servers, while they were on the same dev server, or when the data gets big enough to start hitting the disks all the time.
DBMS products couple logical layout and physical storage, so while it may be as likely to increase speed as decrease it, normalization of base tables will in some way affect performance of the system.
Usually, the right compromise is views, with an SQL DBMS. If you are using any variation of design by contract, views are likely the correct design decision even without any concerns for normalization or performance, so that the application gets a model fitting its needs. Scalability concerns, like for major websites, create problems that don't have quick and easy solutions, at this point in time.
Additionally to Thilo's post:
normalizing on SAP HANA is wrong due to the fact the db normalize the data itself. If you do it anyway you will slow down the database.

Best Database for remote sensor data logging

I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database

What's the attraction of schemaless database systems?

I've been hearing a lot of talk about schema-less (often distributed) database systems like MongoDB, CouchDB, SimpleDB, etc...
While I can understand they might be valuable for some purposes, in most of my applications I'm trying to persist objects that have a specific number of fields of a specific type, and I just automatically think in the relational model. I'm always thinking in terms of rows with unique integer ids, null/not null fields, SQL datatypes, and select queries to find sets.
While I'm attracted to the distributed nature and easy JSON/RESTful interfaces of these new systems, I don't understand how loosely typed key/value hashes will help me with my development. Why would a loose typed, schema-less system be good for keeping clean data sets? How can I for example, find all items with dates between x and y when they might not have dates? Is there any concept of a join?
I understand many systems have their own differences and strengths, but I'm wondering at the difference in paradigm. I suppose this is an open-ended question, but perhaps the community's answers and ways they have personally seen the advantages of these systems will help enlighten me and others about when I would want to make use of these (admittedly more hip) systems instead of the traditional RDBMS.
I'll just call out one or two common reasons (I'm sure people will be writing essay answers)
With highly distributed systems, any given data set may be spread across multiple servers. When that happens, the relational constraints which the DB engine can guarantee are greatly reduced. Some of your referential integrity will need to be handled in application code. When doing so, you will quickly discover several pain points:
your logic is spread across multiple layers (app and db)
your logic is spread across multiple languages (SQL and your app language of choice)
The outcome is that the logic is less encapsulated, less portable, and MUCH more expensive to change. Many devs find themselves writing more logic in app code and less in the database. Taken to the extreme, the database schema becomes irrelevant.
Schema management—especially on systems where downtime is not an option—is difficult. reducing the schema complexity reduces that difficulty.
ACID doesn't work very well for distributed systems (BASE, CAP, etc). The SQL language (and the entire relational model to a certain extent) is optimized for a transactional ACID world. So some of the SQL language features and best practices are useless while others are actually harmful. Some developers feel uncomfortable about "against the grain" and prefer to drop SQL entirely in favor of a language which was designed from the ground up for their requirements.
Cost: most RDBMS systems aren't free. The leaders in scaling (Oracle, Sybase, SQL Server) are all commercial products. When dealing with large ("web scale") systems, database licensing costs can meet or exceed the hardware costs! The costs are high enough to change the normal build/buy considerations drastically towards building a custom solution on top of an OSS offering (all the significant NOSQL offerings are OSS)
The primary concern should be what do you need to do with your data. If you have a huge data set and are finding a traditional RDBMS to be a bottleneck then you may want to experiment with a schemaless or a a NOSQL solution.
Most environments that I am aware of using NOSQL solutions also use an RDBMS solution in some form or fashion. RDBMS based solutions are the norm where data integrity is extremely important and you need ACID transactions. However if your system is not highly transaction based but you need to scale up or scale out real quick, a NOSQL solution may be desirable.
Schemaless is great for two reasons:
Brain optimising intuitiveness of document storage
Resolves Sparse-Matrix and Entity-Attribute-Value storage problems.
I've used both SQL and No-SQL for production applications in Ruby on Rails. I'm not a database expert and I have to confess to googling ACID and similar terms as they're not familiar to me.
"Ah ha! Another know-nothing trend follower jumping on the latest bandwagon" you may say. But, actually, I'm really pleased with my decision to use MongoDB on our most recent 2 year old app and here's why...
The flip-side of brain-optimising intuitiveness was my experience with the Magento e-commerce system. I don't want to bash it because it served me well at the time but it really hit the processor hard trying to calculate the attributes for each product. The underlying reason was the Entity-Attribute-Value store of product data. Cache or be damned was the solution.
The major advantage to me is the optimisation in the only place that really matters - your own brain. So many technologies are critiqued on their efficiency in memory, processors, hardware and yet having a DB that's extremely intuitive to understand brings its own merits. We've found it quick to add features to our code because the database simply looks a lot like the real world we're modelling. When I've asked e-commerce clients to present me with their product list they will naturally tend to use Excel (think table store). The first columns are easy:
Product Name
Price
Product Type (
Then it gets harder and covered in notes, colour coding and links to other tables (yep.. relationships)
Colour (Only some products)
Size (X Large, Large, Small) - only for products 8'9'10, golf clubs use a different scale
Colour 2. The cat collars have two colour choices.
Wattage
Fixing type (Male, Female)
So it ends in a terrible mess of Excel tables that make no sense to me and not much sense to the people who work with the products day in and day out. We throw our arms in the air and decide to go through the catalogue and then it hits me! Wouldn't it be great if you could store the data as it appears in the catalogue!? Just collections of records on each product that just lists the attribute of that product. You can then pick out common attributes to index for retrieval at a later date. Of course, that's a document store.
In summary, document stores are great when you have a sparse matrix problem or objects that mutate their attributes over time. Having lived in a No-SQL world for 2 years, I can't think of a real world application that doesn't have those features because the world itself looks like a document store.
I've only played with MongoDB but one thing that really interested me was how you could nest documents. In MongoDB a document is basically like a record. This is really nice because traditionally, in a RDBMS, if you needed to pull a "Person" record and get the associated address, employer info, etc. you'd frequently have to go to multiple tables, join them up, make multiple database calls. In a NoSQL solution like MongoDB, you can just nest the associated records (documents) and not have to mess with foreign keys, joining, multiple database calls. Everything associated with that one record is pulled.
This is especially handy when dealing with objects. You can in many cases just store an object as a series of nested documents.
NoSQL databases are not schemaless; the schema is embedded in the data. They are properly called semistructured. In some KV data stores, however, the schema may even be embedded in code. The advantage of the semi-structured approach is two fold: flexibility in which columns are part of a row (one row could have 5 columns and another have 5 different columns, and flexibility in the characteristics of the columns (e.g., variable lengths)
Normally the attraction is that of snake oil - most people favourising them have no clue about the relational theorem and speak SQL on a level making professionals puke. No idea what ACID conditions are, ehy they are important etc.
Not saying they do not have valid uses.... just saying that mostly the attraction is people not knowing what they should know and making stupid conclusions. Again, not everyone is like that, but most developers favouring them are - not good in their understanding what a database system acutally is responsible for.

Should Databases be used just for persistence

A lot of web applications having a 3 tier architecture are doing all the processing in the app server and use the database for persistence just to have database independence. After paying a huge amount for a database, doing all the processing including batch at the app server and not using the power of the database seems to be a waste. I have a difficulty in convincing people that we need to use best of both worlds.
What "power" of the database are you not using in a 3-tier archiecture? Presumably we exploit SQL to the full, and all the data management, paging, caching, indexing, query optimisation and locking capabilities.
I'd guess that the argument is where what we might call "business logic" should be implemented. In the app server or in database stored procedure.
I see two reasons for putting it in the app server:
1). Scalability. It's comparatively hard to add more datbase engines if the DB gets too busy. Partitioning data across multiple databases is really tricky. So instead pull the business logic out to the app server tier. Now we can have many app server instances all doing business logic.
2). Maintainability. In principle, Stored Procedure code can be well-written, modularised and resuable. In practice it seems much easier to write maintainable code in an OO language such as C# or Java. For some reason re-use in Stored Procedures seems to happen by cut and paste, and so over time the business logic becomes hard to maintain. I would concede that with discipline this need not happen, but discipline seems to be in short supply right now.
We do need to be careful to truly exploit the database query capabilities to the full, for example avoiding pulling large amounts of data across to the app server tier.
It depends on your application. You should set things up so your database does things databases are good for. An eight-table join across tens of millions of records is not something you're going to want to handle in your application tier. Nor is performing aggregate operations on millions of rows to emit little pieces of summary information.
On the other hand, if you're just doing a lot of CRUD, you're not losing much by treating that large expensive database as a dumb repository. But simple data models that lend themselves to application-focused "processing" sometimes end up leading you down the road to creeping unforeseen inefficiencies. Design knots. You find yourself processing recordsets in the application tier. Looking things up in ways that begin to approximate SQL joins. Eventually you painfully refactor these things back to the database tier where they run orders of magnitude more efficiently...
So, it depends.
No. They should be used for business rules enforcement as well.
Alas the DBMS big dogs are either not competent enough or not willing to support this, making this ideal impossible, and keeping their customers hostage to their major cash cows.
I've seen one application designed (by a pretty smart guy) with tables of the form:
id | one or two other indexed columns | big_chunk_of_serialised_data
Access to that in the application is easy: there are methods that will load one (or a set) of objects, deserialising it as necessary. And there are methods that will serialise an object into the database.
But as expected (but only in hindsight, sadly), there are so many cases where we want to query the DB in some way outside that application! This is worked around is various ways: an ad-hoc query interface in the app (which adds several layers of indirection to getting the data); reuse of some parts of the app code; hand-written deserialisation code (sometimes in other languages); and simply having to do without any fields that are in the deserialised chunk.
I can readily imagine the same thing occurring for almost any app: it's just handy to be able to access your data. Consequently I think I'd be pretty averse to storing serialised data in a real DB -- with possible exceptions where the saving outweighs the increase in complexity (an example being storing an array of 32-bit ints).

In terms of databases, is "Normalize for correctness, denormalize for performance" a right mantra?

Normalization leads to many essential and desirable characteristics, including aesthetic pleasure. Besides it is also theoretically "correct". In this context, denormalization is applied as a compromise, a correction to achieve performance.
Is there any reason other than performance that a database could be denormalized?
The two most common reasons to denormalize are:
Performance
Ignorance
The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)
I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"
To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.
Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.
Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note that I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.
So one reason people have a strong feeling about normalization is that the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation that the parties may be bringing to the debate.
Having said that, here's a more direct answer to the original question :)
It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.
You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.
Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.
If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMS's that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.
There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.
When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.
In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.
Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.
Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.
I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.
Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".
Of course, this assumes that you are confident of your abilities and value to the company.
Mantras almost always oversimplify their subject matter. This is a case in point.
The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.
These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.
If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.
The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.
A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.
Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.
In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.
Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.
There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.
Data warehouses in a dimensional model are often modelled in a (denormalized) star schema. These kinds of schemas are not (normally) used for online production or transactional systems.
The underlying reason is performance, but the fact/dimensional model also allows for a number of temporal features like slowly changing dimensions which are doable in traditional ER-style models, but can be incredibly complex and slow (effective dates, archive tables, active records, etc).
Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risks of bugs in code increases, making the whole system less and less sustainable.
Good luck!
Normalization has nothing to do with performance. I can't really put it better than Erwin Smout did in this thread:
What is the resource impact from normalizing a database?
Most SQL DBMSs have limited support for changing the physical representation of data without also compromising the logical model, so unfortunately that's one reason why you may find it necessary to demormalize. Another is that many DBMSs don't have good support for multi-table integrity constraints, so as a workaround to implement those constraints you may be forced to put extraneous attributes into some tables.
Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as #aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.
You don't normalize for 'correctness' per se. Here is the thing:
Denormalized table has the benefit of increasing performance but requires redundancy and more developer brain power.
Normalized tables has the benefit of reducing redundancy and increasing ease of development but requires performance.
It's almost like a classic balanced equation. So depending on your needs (such as how many that are hammering your database server) you should stick with normalized tables unless it is really needed. It is however easier and less costly for development to go from normalized to denormalized than vice versa.
No way. Keep in mind that what you're supposed to be normalizing is your relations (logical level), not your tables (physical level).
Denormalized data is much more often found at places where not enough normalization was done.
My mantra is 'normalize for correctness, eliminate for performance'. RDBMs are very flexible tools, but optimized for the OLTP situation. Replacing the RDBMS by something simpler (e.g. objects in memory with a transaction log) can help a lot.
I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.
Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.
Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".
If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:
Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.
Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.
Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.
In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.
There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:
Can date range queries be done efficiently even for large tables?
Is it possible to guarantee non-overlap of date ranges?
Is it possible to trace status events back to operator, transaction, or reason for change? (probably yes, but this is additional overhead)
By creating a more complicated versioning system, are you putting the right owners in charge of the right data?
I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.
Reporting system and transaction system have different requirements.
I would recommend for transaction system, always use normalization for data correctness.
For reporting system, use normalization unless denormaliztion is required for whatever reason, such as ease of adhoc query, performance, etc.
Simplicity? Not sure if Steven is gonna swat me with his newspaper, but where I hang, sometimes the denormalized tables help the reporting/readonly guys get their jobs done without bugging the database/developers all the time...

Resources