Updating a table with INSERT vs UPDATE (performance vs clearness) - sql-server

I update my big table incrementally. My big table is a result of a join of many tables. There are so many joins of so tables in my query that it became maze-like and I stumbled on a problem while adding another join to get another column to the table.
Do you recommend to make the query even more complex by adding another join - all this for making then one insert to the table. Or do you recommend to split the query into two queries - one insert and then update. In a second query just update the column based on a query.
I could imagine that one insert of all columns could be faster then small insert of few columns accompanied by update of the rest columns. But the number of tables in a join has its limits at least in the terms of clearness. What are the best practices here.

Have you heard of the MERGE statement? Handles inserts/updates/deletes in the same query. Aaron Bertrand has an article talking about the pitfalls of it but overall, it helps simplify when you have to do any of those operations: http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/
As for your question, maintenance-wise and index tuning wise, it would be much more advantageous to break those two operations out. Adding another join to a table does not make a query complex. That is how an RDBMS is supposed to function. Well commented code and concise naming standards should make any joins clear to someone reading the query.
If adding another join complicates the update/insert process, then yes, break them out. If it is just a matter of deciphering what the query is doing but the operation is fine, give it some context with comments.
Asking for best practices on something like this can tend to have people say 'it depends'. It is kind of a broad subject and can be handled multiple ways, depending on what your situation is like. Someone with a small row count that isn't expecting it to get large could get away with a somewhat expensive operation. Someone with a large row count (read millions-billions) would most likely be looking for a new job if they brought the server to its knees all the time.

Related

Can joining with an iTVF be as fast as joining with a temp table?

Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.

Sql server Index Size is 100 GB [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

How to Pre Stage Queries on SQL Server?

I've looked for tips of how to speed up sql intensive application and found this particularly useful link.
In point 6 he says
Do pre-stage data This is one of my favorite topics because it's an old technique that's often overlooked. If you have a report or a
procedure (or better yet, a set of them) that will do similar joins to
large tables, it can be a benefit for you to pre-stage the data by
joining the tables ahead of time and persisting them into a table. Now
the reports can run against that pre-staged table and avoid the large
join.
You're not always able to use this technique, but when you can, you'll
find it is an excellent way to save server resources.
Note that many developers get around this join problem by
concentrating on the query itself and creating a view-only around the
join so that they don't have to type the join conditions again and
again. But the problem with this approach is that the query still runs
for every report that needs it. By pre-staging the data, you run the
join just once (say, 10 minutes before the reports) and everyone else
avoids the big join. I can't tell you how much I love this technique;
in most environments, there are popular tables that get joined all the
time, so there's no reason why they can't be pre-staged
From what I understood you join the tables once and several SQL Queries can "benefit" from it. That looks extremely interesting for the application I'm working at.
The thing is, I've been looking for pre staging data around and couldn't find anything that seems to be related to that technique. Maybe I'm missing a few keywords ?
I'd like to know how to use the described technique within SQL Server. The link says it's an old technique so it shouldn't be a problem that I'm using SQL Server 2008.
What I would like is the following: I have several SELECT queries that run in a row. All of them join the same 7-8 tables and they're all really heavy, that impacts performance. So I'm thinking of joining them, run the queries and them drop this intermediate table. How/Can it be done ?
If your query meets the requirements for an indexed view then you can just create such an object materialising the result of your query. This means that it will always be up-to-date.
Otherwise you would need to write code to materialise it yourself. Either eagerly on a schedule or potentially on first request and then cached for some amount of time that you deem acceptable.
The second approach is not rocket science and can be done with a TRUNCATE ... INSERT ... SELECT for moderate amounts of data or perhaps an ALTER TABLE ... SWITCH if the data is large and there are concurrent queries that might be accessing the cached result set.
Regarding your edit it seems like you just need to create a #temp table and insert the results of the join into it though. Then reference the #temp table for the several SELECTs and drop it. There is no guarantee that this will improve performance though and insufficient details in the question to even hazard a guess.

How to deal with extremely wide tables in SQL server?

I have encountered the following dilemma several times and would be interested to hear how others have addressed this issue or if there is a canonical way that the situation can be addressed.
In some domains, one is naturally led to consider very wide tables. Take, for instance, time series surveys that evolve over many years. Such surveys can have hundreds, if not thousands, of variables. Typically though there are probably only a few thousand or tens-of-thousands of rows. It is absolutely natural to consider such a result set as a table where each variable corresponds to a column in the table however, in SQL Server at least, one is limited to 1024 (non sparse) columns.
The obvious workarounds are to
Distribute each record over multiple tables
Stuff the data into a single table with columns of say, ResponseId, VariableName, ResponseValue
Number 2. I think is very bad for a number of reasons (difficult to query, suboptimal storage, etc) so really the first choice is the only viable option I see. This choice can be improved perhaps by grouping columns that are likely to be queried together into the same table - but one can't really know this until the database is actually being used.
So, my basic question is: Are there better ways to handle this situation?
You might want to put a view in front of the tables to make them appear as if they are a single table. The upside is that you can rearrange the storage later without queries needing to change. The downside is that only modifications to the base table can be done through the view. If necessary, you could mitigate this with stored procedures for frequently used modifications. Based on your use case of time series surveys, it sounds like inserts and selects are far more frequent than updates or deletes, so this might be a viable way to stay flexible without forcing your clients to update if you need to rearrange things later.
Hmmm it really depends on what you do with it. If you want to keep the table as wide as it is (possibly this is for OLAP or data warehouse), I would just use proper indexes. Also based on the columns that are selected more often , I could also use covering indexes. Based on the rows that are searched more often, I could also use filtered indexes. If there are, let’s say, billions of records in the table, you could partition the table as well. If you just want to store the table over multiple tables, definitely use proper normalization techniques, probably up to 3NF or 3.5NF, to divide the big table into smaller tables. I would use the first method of yours, normalization, to store data for the big table just because it seems like it makes sense better to me that way.
This is an old topic but something we are currently working on resolving. Neither of the above answers really give as many benefits as the solution we feel we have found.
We previously believed that having wide tables wasn't really a problem. Having spent time analysing this we have seen the light and realise that costs on inserts/updates are indeed getting out of control.
As John states above, the solution is really to create a VIEW to provide your application with a consistent schema. One of the challenges in any redesign may be, as in our case, that you have thousands or millions of lines of code referencing an old wide table and you may want to provide backwards compatibility.
Views can also be used for UPDATES and INSERTS as John alludes to, however a problem we found initially was that if you take the example of myWideTable which may have hundreds of columns and you want to split this to myWideTable_a with columns a, b and c and myWideTable_b with columns x, y and z then an insert to a view which only sets column a will only insert a record for myWideTable_a
This causes a problem when you want to later update your record and set myWideTable.z as this will fail.
The solution we're adopting, and performance testing, is to have an 'insteadof' trigger on the View insert to always insert to our split-tables so that we can continue to update or read from the view with impunity.
The question as to whether using this trigger on inserts provides more overhead than a wide table is still open, but it is clear that it will improve subsequent writes to columns in each split table.

Database development mistakes made by application developers [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

Resources