Accounting Payables/ Receivables database schema - database

I searched for this topic and there are lots of explanation to collect. Mostly to do separate tables for a/p and a/r.
I'm thinking of another approach from where to combine the two tables into 1 because a/c and a/r are 98% similar and only differ by 1 or 2 attributes (e.g. a/p: vendor/supplier).
Question:
Is it possible to merge the 2 tables and just categorize each transaction with Transaction_Type with values of Acnt_Payable or acnt_Receivable?
I know this is possible to do but is this a good practice? and what are the circumstances that my system will face when dealing with Reports?

Your question is: "another table or another column". Given that I've done lots of MySql and accounting database manipulation, I wouldn't hesitate to have type as an extra column.
The deal with reports is that you'll need to rework the code that pulls the reports. Having the extra column means less joins (if you have them now). You will need to step up your error checking, as doing a change to already designed software seems to always catch me off guard where I hadn't before considered a nuance of the design that is suddenly a paradigm shift for the logic.
I'd say rewrite the report code from a blank canvas; between your increased experience from "last time" and a fresh look at it this time, you might do a much better job of it.
IMHO, for performance aspects, most businesses and most average coders are OK with an off-the-shelf install of MySql or PostgreSQL. When I observe a query taking more than a second, it's usually because I need to add an index. Given that, you could have yet another table with the IDs representing AR vs AP (or any other human-terms), as I think numbers will make faster performance.
Once a company gets to where performance really is an issue, they'll have the resources to hire a real MySql guy to come in and mess with it. Sarcasm: Unless, of course, you're using quickbooks, in which case, once you're past a 20 transaction-per-day mom and pop joint, you're overtaxing the software.

Related

Relational Database schema design for metric storage

Considering a system that has the following characteristics:
Stores time series data/metrics collected from multiple sensors/inputs.
Data points (metrics) are collected from many different systems at different times.
Each of these metrics is generally one data point (e.g. temp and humidity are not reported at the same time, but rather individually and will have a different timestamp)
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.
I can think of three ways of modeling this.
1. Wide table - with table per category (covered)
Notes: has lots of sparse values due to the data points being collected individually. Storage of new metrics require a new column
2. Narrow table - with table per metric (covered)
Notes: Storage of new metrics require a new table
3. Typed table (not covered) - with single metric table (not covered)
Notes: Storage of new metrics just require a new row in the metricType table, no schema changes. Concerned about performance implications due to chunk size although grouping by a time bucket across all metrics would not require joins and could therefore be faster?
I was wondering if anyone could comment or the options presented, point me to some performance bench marks that include 3 as well as 1 and 2 or generally give any advice on the suitability of each approach. I'm planning to run my own experiments on this and I will post the results when done, but any insight at this stage would be gratefully received. :)
Please note, do not suggest a nosql solution, I'm aware of the options in that space and am assessing that option separately
1 Proposal
"Wide table"
That has gross Normalisation errors (as well as, if taken seriously, it has masses of Nulls and integrity problems). It is unuseable, no further comment is required.
"Narrow table"
That is free of errors, but the Normalisation is not yet complete.
"Typed table"
That is sort of complete, the "best" of your three scenarios. But it views the issue through a narrow lens, and in total isolation from the context in which the issue exists. Thus it is in error for reasons other than those you inquire about.
2 Problem
The first problem is that you are comparing three things which are not reasonably comparable, not reasonably equal to each other.
The second problem is, EAV is the flavour of the month, and many people are attracted to it. However, it has major problems, and requires an additional set of "metadata" tables if it is to be implemented with some data integrity. The point is, EAV is not needed.
3 Solution
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
This is actually a straight-forward Relational database problem, which is solved by a perfectly ordinary Relational design, which provides full Relation Power; Relational Integrity; and Speed (which other designs will not have).
3.1 Caveat
But there are a few caveats, due to the fact that what is marketed as "relational" is not Relational.
Get rid of the Record ID fields, they are anti-Relational.
Record IDs reduce your schema to a 1970's style Record Filing system (located in an SQL container for convenience).
Record IDs do not provide row uniqueness, which is demanded by the Relational Model.
Further, they require one additional field and one additional index per file.
When modelling a database (Relational or not), perceive the data, as data, and nothing but data. Do not view the data in terms of your need re the GUI, or some query or other.
It is an error to concern yourself with performance issues at this (modelling) stage. First get it right. Second, make it fast. Do not reverse the prescribed sequence.
Relational Keys provide meaning, as well as Relational Integrity (which is Logical, and distinct from Referential Integrity, which is a physical facility of SQL). What this addresses is the context in which an object exists.
A Sensor does not exist in isolation (except when it is in a package on a shelf in a shop ... but even then, it exists in the context of the shop inventory)
An active Sensor exists only in the context of the object in which it is housed. You have not provided any info regarding that. Let's call the thing Article as a generic label.
Further, it is the Article that requires a limit on the Metric that is being measured by the Sensor (for the purpose of out-of-range alarms, etc), and not the Sensor itself. (The Sensor may have a range, which is a different thing.)
Likewise, a Sensor exists in a Location, which is a second vector. Or else, the Article exists in a Location, and the Article Key carries the Location. I have modelled the latter.
3.2 Data Model
Here is the solution:
Sensor Data Model
Inline graphics may not show up in some browsers. In that case, here it is in PDF.
It will satisfy both OLTP and OLAP (Dimension-Fact) requirements.
If you provide more context, we can get that modelled precisely. This may take a bit of to-and-fro.
It is limited to the info provided.
I have taken MetricType and SensorType to be synonymous.
Article is shown as Dependent on (exists within) Location, alternately they could be separate vectors. In any case, Article and Location together qualify Sensor.
Since SensorSerialNo is unique (AK2), therefore Reading(SensorSerialNo, DateTime) is unique. An index is not required. However, in the event there are many queries on Reading via SensorSerialNo alone, such an index will boost performance.
Please feel free to ask questions, and I will answer.
For those who are completely new to IDEF1X, refer to IDEF1X Introduction.
For those who are familiar with IDEF1X, and only want a brush-up, refer to IDEF1X Anatomy.
4 Performance
Your concern re performance is good, but far too premature to be applied at this stage. First get the data model right, second get the data structures fast. The reasons for that are many, not the least of which is, when the data is Normalised, Relationally, the structures are already very fast. Further, one should never optimise for a particular query (one can add indices, if necessary, in the second stage).
Nevertheless, I will respond to your stated concerns.
Eg. a ClusteredIndex on the prescribed Reading PK will:
Serve most queries, most Dimensions (except queries that use SensorSerialNo alone, in case of which I have suggested an additional index)
Serve all OLTP Transactions and ensure the highest concurrency, because the Sensors are distributed per the real world: across Locations and Articles`.
Whereas an Index on a Record ID guarantees a HotSpot on every single INSERT. Great for creating Deadlocks.
4.1 Benchmark
I do have a hundred or so benchmarks for data structures such as this, collected over the last four decades for both OLTP & OLAP use. Most of my customers are banks (Think: Sensor Readings are very much like Stock Prices that change over the period of a day; several vectors (Dimensions); billions of rows). Banks are very strict about confidentiality, so I cannot publish the benchmarks as is, and redacting them will take time and effort.
I do have one benchmark for a very similar requirement, that is public. In fact, it was included in an Answer to a SO Question re Time Series data, but the seeker got the moderators to excise it (it is embarrassing to Oracle). Here is the Benchmark Summary for the Sybase ASE vs Oracle 10.2 benchmark on a fixed DDL (Time Series data) and population.
Finally, the structures and code required are simple enough for you to run your own benchmark.
5 Response to Other Answers
Re Neville's comments:
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
Noting that his comments regard EAV, but that it may imply that it applies equally to the subject table (an ordinary Relational database table (non-EAV) Reading) in this case, because it concerns the query type (and not the EAV concept vs the Relational concept):
The declaration does not apply to Relational tables (it may well apply to EAV; the masses of problems introduced due to Record IDs; etc)
As long as you have
a genuine Relational database schema (as I have suggested), and
a genuine SQL platform (not a pretend "sql", which does not comply but fraudulently uses the name), and
you understand IN and NOT IN, and how to compare Sets in SQL
... such queries are straight-forward to code.
6 Response to Comments
Record ID is Anti-Relational
Do you have any links on the record_id being anti-relational, I don't disbelieve you for a second but I'm interested to learn more about why this anti-pattern is so prevalent.
In this mess of anti-science, the academics manufacture and contrive various "solutions" to "problems", that do not exist in the Relational Model, and then you have a second level of endless "debates" about which correction to the non-problem is better or worse.
You don't need links because there is nothing to "debate", and whatever "debate" you might happen to read misses the above point.
The one and only authority is the great Dr E F Codd. All the authors of all books and textbooks alleging to be about the Relational Model, other than Codd, are actually false, they are about implementing 1970's style Record Filing Systems, and anti-Relational (no Relational Power; no Relational Integrity; no Relational Speed). They made the mistake, from 1970, of trying to fit the RM into their 1970's RFS mindset, rather than releasing it and taking on the RM mindset. And they have spent the last FIVE DECADES reinforcing that, even justifying it with "mathematical definitions"; 17 "relational algebras"; 42 abnormal "normal forms". All completely anti-Relational. And they cite each other, so they get published.
The second problem is, sites such as SO are predicated on the basis of populism. The popular answer is not the best or correct answer. For that you need an Authority (very scary to populists), and objective, absolute truth. (People love their relative or subjective "truths", that change all the time).
Therefore, you need just the single, authoritative definition, the original paper, the Relational model.
Yes, the terms are out-dated, and not well understood these days.
Yes, it is seminal (every word counts, has deep meaning).
No, you need not read section 2 (math).
You need to glean from that, that:
the Relational Key is “made up from the data” (my paraphrase, to the several entries, which are layers in the RM), which is Logical
that surrogates are (a) not only against that definition, (b) they are the pre-Relational paradigm, that is Physical pointers, the very thing the RM replaces, and (c) explicitly prohibited.
Very important, you need to understand not only the definition of the Relational Key, but the whys and the wherefores.
Eg. that it transcends import/export problems that pointer-based systems have.
Eg. the temporal definition (seminal; 8 letters; scary).
Therefore, there is no argument, no "debate", to be had.
Anyone going against that is anti-Relational. Not because I say so, but because it contradicts evidenced facts, and the single Authority.
I have named the explicit technical benefits of using the RM correctly (Relational Power; Relational Integrity; Relational Speed), but an expansion of that requires a fair amount of effort
The consequence of NOT complying with the RM is, you get (a) none of the benefits, AND (b) you get the complete set of problems that pre-Relational Record Filing Systems had in 1970, AND (c) the contrived "solutions" supplied by the "academics" that have never worked.
If you need an expansion of those benefits of the RM, which of course you do need to understand to some degree, because each one is very deep and very important, the best I can provide is this. As you can imagine, this is a battle that I have to fight on every Answer that relates to this subject, so I have posted a fair amount, over the years, across many Answers.
Go to my profile, select All Answers, and read any that relate to this subject.
Why is this Record ID anti-pattern so prevalent ?
The short answer is, people love their ignorance, their subjective "truths", and will fight tooth and nail to protect it. They quickly accept and repeat any justification for remaining the same. Learning something that is a paradigm shift away from what they know, is very scary, because it threatens their comfortable ignorance, and exposes it for what it really is. They will have to admit that what they have been writing for FIVE DECADES is wrong. That is why populism thrives. In ignorance.
The slightly longer answer is this. Just look at the internet. In the old days, for any particular subject, we had one source, one absolute authority: eg. buy the Encyclopædia Britannica; spend your entire childhood devouring it. Permanent truth. Honest history. But now anyone with a keyboard and two fingers plus some connective tissue (no brain required) can post. As an instant "authority". The web is chock-full of (a) superficial answers (the anti-thesis of "Now THAT is an answer") (b) in many flavours (c) that get upvoted due to populism (d) that are nowhere near the correct or full answer. Sound bites that can be easily understood by the populace. Very few want the depth of the full answer.
Even when an authority of sorts becomes established (eg. Wikipedia; Stack Overflow), it is easily subverted, because there are literally millions of people who change the entries (truth does not change, therefore, as long as something is changing, it is not truth). Mostly to serve their political positions; their ideologies; their re-writes of history to make the past wrong (it wasn't, it already happened), and the present insanity "good".
The definitive answer is this: academic envy. It took a whole decade for Codd's Relational Model to be understood and accepted. And even then, only by the few. IBM, and Britton-Lee (which became Sybase) implemented Codd's RM, in spirit and word. (Digital Equipment Corp did as well, but they are defunct.) Those academics who appeared to be working with Codd turned out to be actually working against him (by virtue of the evidence). They hated the fact that they did not come up with it themselves, that one man came up with the first real model, with a sound; logical; mathematical, foundation, complete with a Relational Algebra. All integrated. All requirements of the day (eg. the Bill Of Materials problem) answered. That has stood the test of time: five decades and nothing has been added or changed.
Typically they will declare, "but Codd did not define this or that, so here I am defining it ...". So they came up with their own RA. Now they have 17, all irrelevant. And abnormal "normal forms" to elevate fragmented bits of their Record Filing Systems to seem "relational". Now they have 42, all irrelevant. And many books, alleging to be "relational", but by evidenced fact, anti-Relational. Each "academic" seeks to reinforce their "academic" position, against all others.
Which is why I say, again, go to the one and only Authority. Read nothing from the anti-Relational crowd, because it will diminish your understanding of the RM (at best), or poison your mind (at worst).
One Clarification
If you examine a Relation PK (eg) Location.Location, it may seem odd. This is a %Code or %ShortName that is data, that the user actually uses. Usually 4 to 6 characters, max 12. As distinct from the long Name, which has to exist, and which is an Alternate Key. And of course, it is definitely not a number of any kind (which is not data, not something that the user uses). Users too, like their short forms. Obviously, use any International Standard if such exists.
The Key must be stable (not static, nothing in the universe is static), and one that is used in the real world to uniquely identify the object (data row).
Eg. for Security, which is a company listed on the stock exchange, in America, it would be TickerSymbol, in Australia ASXCode. The ISO code, an ISINCode, is an AlternateKey.
For cities, use one of the geographic location standards: ISO; FIPS; etc. (I use Statoids because it existed long before the others, but those days are numbered). At worst, use Airport Code.
Genuine SQL Platform
What do you consider to be genuine SQL? Sql Server, Postgres, MySQL, Oracle I guess all would be?
No. I mean any platform that actually complies with the published SQL Standard, and therefore can actually support relational tables; relational processing of Sets; and ACID Transactions.
That automatically excludes freeware/vapourware/nowhere/"open source", for which bits are written by 10,000 developers spread across the universe, with no governing principles. Eg. no ACID Transactions, or the structures that are required for it, which are required in every code segment. Too late to insert that now, because it will require a 100% re-write, and heaven forbid ... a Server Architecture.
Commercial
which means paid-for and supported, is also important. Either you have a maintenance contract and support is immediate, or you post a bug report and you check for updates every day for the next year or three.
Server Architecture
If either scalability or performance (high throughput; high concurrency; low latency) is required, then the Server Architecture is most important. Again, that excludes the freeware, and Oracle, because they have no Architecture, they are massive collections of interacting programs, that get the o/s to perform all the functions that a architected Database Server would normally perform.
Check this Comparison of Oracle vs Sybase Architecture.
The exact same applies to PostgreSQL and other freeware. PostgreSQL (son of the total failure Ingres) famously failed under pressure, with masses of locking problems and very low concurrency.
1 High-End, Commercial, SQL Compliant
Something like 5% market share, but 95% of the Financial Services and Automation markets. Great Architecture, hopeless marketing.
**Sybase ASE
IBM DB2**
2 Commercial, SQL Compliant
MS SQL Server
Easily the most common. Good Architecture (originally stolen from Sybase) and then "progressed" in the usual insane MS style. Pain to use; masses of overhead; poorly integrated with various add-ons and must-uses.
3 Commercial, SQL Non-Compliant
Hopeless Architecture, great marketing.
Oracle
Generally, Oracle developers are quite good at using the product in the ways that are required to get it to work, but that means they have strayed quite far away from the Relational Model.
Eg. in the Time Series benchmark, the whole point was, Oracle cacks itself when a Subquery is requested, so it has to use an "Inline View". Which the OP alleged was just as fast as a Subquery (avoiding the fact that it requires far more code, and the coder must step outside the Relational mindset). Which the benchmark proved to be hilariously false, in each scenario tested (Oracle was 3 to 4.8 times slower than Sybase on a COUNT(), 26 to 36 times slower on a SUM()
...and the Subquery (Sybase 2.1 secs) had to be abandoned after 120 mins.
Eg. Oracle is non-compliant re ACID Transactions, and developers work around that obstacle to a degree, but Phantom Updates and Lost Updates (technical terms) are simply not prevented. If the work-arounds are not written properly, entire rows (UPDATES or INSERTS are lost).
All that applies to the below ...
4 Non-Commercial, SQL Non-Compliant
These guys spend an awful lot of time developing "features" that are not required for a Relational database, but very attractive to the anti-Relational Record ID Filing Systems.
Eg. "deferred constraint checking"; ENUMs; etc.
They lack the basics of SQL compliance. Eg. no genuine ACID Transactions.
Further, as explained above, zero Architecture. This results in systems that perform wonderfully under single-use, and fail miserably under any order of pressure from concurrency or scalability.
Due to their non-compliance with the SQL requirement, they take pains to post a notice of compliance on every page in the Commands manual. (Just one declaration of compliance at the front of the manual is all that is required.) Of course, the missing commands are simply missing, so gee whiz, they do not have a compliance declaration.
PostgreSQL
The worst piece of software I have ever had to examine since the days of Ingres. Dearly loved by the "academic" crowd, simply because it was scrawled by a fellow "academic".
5 user max, or deal with the concurrency problems (just take a cursory look at the problems reported on SO).
MySQL
Head and shoulders above PostgreSQL, but still in this category.
The InnoDB engine is distinctly better in the performance department, but nowhere near the Sybase/DB2 level (still no genuine Server Architecture). No respite in the SQL non-compliance department.
5 Summary
You get what you pay for.
Server Architecture, most visibly, performance in every scenario.
SQL Compliance, thought through deeply, and implemented in every applicable code segment.
Last but not least, Support.
Whatever you choose, remember, when you port it to another platform, your SQL code will require a complete check-and-change, because the "flavours" of SQL (or NON-sql) are very different. For the Non-Commercial program suites, that means a complete rewrite. Therefore choose carefully, with the long term implementation in mind.
It depends largely on the types of query you'll need to run. I think performance may not be your biggest concern if, as you say
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario.
As queries in all scenarios would hit an indexable timestamp column, it really is just a question of the performance of joins, and pretty much every relational database is really good at that.
If your queries really are just "show data for a time range", your option 3 (an entity/attribute/value design) is most effective from a development effort point of view. .
Your query would have a single, inner join, and the timestamp column would provide a good index. As you say, you wouldn't need to change schema or queries when collecting new measurement points.
The alternative designs would require outer joins for each table. In performance terms, that's not a huge deal, but managing the schema and associated queries would be a pain.
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
TimescaleDB's documentation discusses wide versus narrow data models:
https://docs.timescale.com/timescaledb/latest/overview/data-model-flexibility/
In summary:
"A narrow model makes sense if you collect each metric independently. It allows you to add new metrics as you go by adding a new tag without requiring a formal schema change."
"If you typically query multiple metrics together, it is both faster and easier to store them in a wide table format"
Indeed, the way 3 is a sort of EAV modeling on the relational storage including timestamps into EAV key.
+---------+ +-----+ +-------------+
| Sensors | -- 1:M --< | EAV | >-- M:1 -- | Value kinds |
+---------+ +-----+ +-------------+
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario
If queries don't require joins but need to be grouped by time, the clustered index on timestamp column ensures the performance.
However, any queries with joins (i.e. comparing values of different sensors) risque to degrade the performance. The solution can be a separated OLAP storage for collected EAV data.
From a developer's point of view, I would like to recommend the third option. In your third option, you might consider having indexes on the MetricType (i.e. typeId) and the timestamp column, which will greatly optimize the query performance.
Whereas your first table requires a system downtime, as when a new column needs to be inserted, you need to shut down your live system first to add the column, initialize with some default or null values, and then bring back the system to live again. In my opinion, it will contain un-necessary data (garbage) for the previous rows from the point it was being added in the system. The size of the database table will be huge and might contain garbage data in a significant amount hence affecting the query time.
The second idea shows an improvement over the first, however, in spite of having garbage data, this will require joining multiple tables which will increase the query performance over time. You cannot have indexes on multiple tables as you could for the third option.
Hence I think going for the third option is the most effective. The tables are normalized and effective indexes will provide efficient query results.
I would like to suggest another thing. You might also consider having a separate table which will contain aggregated data. For example, if your system requires aggregated data, you might consider having the data in a denormalized style in a separate table where the aggregated values for a certain timeline can be stored so that you can remove the data from your original table which are already processed. I am referring to the OLAP database where you might consider looking into.
I wouldn't recommend an ERD design where you need to Alter whenever you add a sensor (as long as you know you will). That's why I believe you should eliminate option 1. Whenever you alter your table you will get plenty of null values and unnecessary work you might have in your code.
The same applies on Option 2, maybe except for nulls, but still you will get unnecessary work whenever you add a new data source to your system.
Option 3 Looks good fit to me, as its ready to expanding data sources and keeps data clean and neat.

Sql server Index Size is 100 GB [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

Will normalising my database kill the scalability?

I have a database that will form part of a highly trafficked web app.
I'm wondering if I should normalise the tables so things such as (e.g.) 'question_type' should be in a separate table too all the basic information about the question such as 'title' and 'question_body'?
I'm only asking because I need this database to be as scalable as possible and I'm told normalisation isn't always the way to go when you need scalability.
Thanks
The thing that makes normalization an issue with scaling is that it tends to need to have multiple tables join together. Joins are great on small tables but the larger the table grows the harder the server needs to work.
The main thing to look to is avoiding joins. If you can do the query without a join by adding a field to one of the tables, you just speed up the performance of that query.
If your table has a question_body and question_type, then I don't see how moving it to another table achieves normalization. e.g.:
table question (
question_body text,
question_user text,
question_user_rank integer,
question_type text
);
Splitting out a single value into a single column table won't achieve anything other than useless joins. That is:
select * from question q join question_type qt on (q.qt_id = qt.id)
where qt.name = 'sql questions';
is an equivalent, but wasteful form of
select * from question
where question_type = 'sql questions';
On the other hand, (using the example above), it makes a lot of sense to split out the question user information into its own table:
table question (
question_body text,
question_type text,
question_user_id integer references question_user(id) on delete cascade
);
table question_user (
id integer,
name text,
rank integer
);
So if a user has his rank changed (ala SO), you only have to change it in one place rather than in every row where he's asked a question. You've increased your ability to handling scaling since you've changed hundreds of updates into a single update.
Now that's a loaded question. Normalization isn't a hard rule so much as a guideline. Designing a database is made up of a series of decisions regarding the level of normalization that makes sense given your need for code efficiency, performance and integrity, among other things. That's greatly oversimplifying it, but the spectrum of design decisions spans volumes of well-authored books.
Can you tell me a little bit more about your application and intended platform? I might be able to steer you in the direction of some very useful reference material if I can better understand your situation.
Will adding salt make my food taste better?
Same question. Noone can answer.
The main proble mis that it depends on your USAGE patterns and to soem degree your competence as programmer, to use lookup caches in the application instead of database joins. Quite a lot of programmers never get above the "scrambled eggs, burned" level of SQL, to keep a cooking analogy.
For scalability application design AND database technology have a lot more to say. Hard to beat an Oracle RAC installation. Depending on what you need on an Exadata platform. Cost is I think around half a million USD for the smallest unit. Still sure you need "as scalable as possible"? Not joking here - I right now work on a 6000 gb data warehouse, we just ordered 3 of those monsters, and not the smallest one.
So, what do you mean with "as scalable as possible"? THis is like "my car needs to go as fast as a car ever has gone and more", then you end up with a special made car with a jet engine in it ;)
General rule:
* Separate transactions and reporting into two databases. The second being a data warehouse.
* Normalize transactional db
* Use star schema on data warehouse.
BIG chance is: you dont kno what you talk ab out, never did scalability, so there is a 80% chance your "high scalability" requirement is a joke for a decent database server. Now, that is not meant insulting, but i have seen SO many people say "I have a ton of data in a table" which turns ou to be 10.000 rows maximum. That is not a ton - it is a joke. We load 100 million daily into our data warehouse main table (and have to keep them many years). Most peope dont really get the speed a decent database server can provide. Which means many discs.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Database development mistakes made by application developers [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

Resources