Alternatives to SQL COUNT(*) practices? - database

I was looking for improvements to PostgreSQL/InnoDB MVCC COUNT(*) problem when I found an article about implementing a work around in PostgreSQL. However, the author made a statement that caught my attention:
MySQL zealots tend to point to
PostgreSQL’s slow count() as a
weakness, however, in the real world,
count() isn’t used very often, and if
it’s really needed, most good database
systems provide a framework for you to
build a workaround.
Are there ways to skip using COUNT(*) in the way you design your applications?
Is it true that most applications are designed so they don't need it? I use COUNT() on most of my pages since they all need pagination. What is this guy talking about? Is that why some sites only have a "next/previous" link?
Carrying this over into the NoSQL world, is this also something that has to be done there since you can't COUNT() records very easily?

I think when the author said
however, in the real world, count() isn’t used very often
they specifically meant an unqualified count(*) isn't used very often, which is the specific case that MyISAM optimises.
My own experience backs this up- apart from some dubious Munin plugins, I can't think of the last time I did a select count(*) from sometable.
For example, anywhere I'm doing pagination, it's usually the output of some search. Which implies there will be a WHERE clause to limit the results anyway- so I might be doing something like select count(*) from sometable where conditions followed by select ... from sometable limit n offset m. Neither of which can use the direct how-many-rows-in-this-table shortcut.
Now it's true that if the conditions are purely index conditions, then some databases can merge together the output of covering indices to avoid looking at the table data too. Which certainly decreases the number of blocks looked at... if it works. It may be that, for example, this is only a win if the query can be satisfied with a single index- depends on the db implementation.
Still, this is by no means always the case- a lot of our tables have an active flag which isn't indexed, but often is filtered on, so would require a heap check anyway.
If you just need an idea of whether a table has data in it or not, Postgresql and many other systems do retain estimated statistics for each table: you can examine the reltuples and relpages columns in the catalogue for an estimate of how many rows the table has and how much space it is taking. Which is fine so long as ~6 significant figures is accurate enough for you, and some lag in the statistics being updated is tolerable. In my use case that I can remember (plotting the number of items in the collection) it would have been fine for me...
Trying to maintain an accurate row counter is tricky. The article you cited caches the row count in an auxiliary table, which introduces two problems:
a race condition between SELECT and INSERT populating the auxiliary table (minor, you could seed this administratively)
as soon as you add a row to the main table, you have an update lock on the row in the auxiliary table. now any other process trying to add to the main table has to wait.
The upshot is that concurrent transactions get serialised instead of being able to run in parallel, and you've lost the writers-dont-have-to-block-either benefits of MVCC- you should reasonably expect to be able to insert two independent rows into the same table at the same time.
MyISAM can cache the row count per table because it tacks on exclusive lock on the table when someone writes to it (iirc). InnoDB allows finer-grained locking-- but it doesn't try to cache the row count for the table. Of course if you don't care about concurrency and/or transactions, you can take shortcuts... but then you're moving away from Postgresql's main focus, where data integrity and ACID transactions are primary goals.
I hope this sheds some light. I must admit, I've never really felt the need for a faster "count(*)", so to some extent this is simply a "but it works for me" testament rather than a real answer.

While you're asking more of an application design than database question really, there is more detail about how counting executes in PostgreSQL and the alternatives to doing so at Slow Counting. If you must have a fast count of something, you can maintain one with a trigger, with examples in the references there. That costs you a bit on the insert/update/delete side in return for speeding that up. You have to know in advance what you will eventually want a count of for that to work though.

Related

Advice on adding index in db2

Good day,
In my java web application, I have a table, which having 107 columns, and this table also a parent table, and having many child tables. Currently this table is having more than 10 millions row of records in production.
Since last year, the java web application keep hitting slowness issue. After checking and debugging, we found that the slowness is happen during update or select data from this table.
Every time having this issue, I will take the select query or update query to run a db2advis command to check its result, and everytime I am getting result that need >99% improvement to apply the recommended indexes. After add those indexes, will solve the slowness issue.
So until now, there are already 7~8 indexes being apply in this table. Today, I am being reported there is a slowness issue again. After checking, found that its also slowness issue during a select statement from this table and join other table. Same way, I run the db2advis command and result also >99% improvement and few recommended indexes.
However, I am starting to question myself, is all these solution is a good solution? If there is another slowness issue in future, should I apply the same solution again?
And everytime I get the result of db2advis, it will also have a part of unused existing indexes that list of drop index query, those indexes are the index that I insert previously. I believe this is because of those indexes is not related to current query for db2advis? So I can ignore this? Or these existing indexes will affected the performance?
As my understanding, there are disadvantage for index also, specially for insert and update statement.
Additionally, there is a policy for the system owner to keep the data for at least 7 years, thus, the owner is not going to do housekeeping for the database.
Would like to ask for advice, other than add index, and change the query to better query, is there any other way to overcome this issue?
This answer contains general advice about levers that may be available to you.
Your situation happens in many companies that are subject to regulatory requirements for multi-year online data retention.
When the physical data model is not designed to exploit range-partitioning for easy roll out of old data (without delete), performance can degrade over time especially when business changes or legal changes impact data distributions.
Your question is not about programming, but instead it is about performance management, and that is a big topic.
Because of that reason, your question may be more suitable for dba.stackexchange.com. This stackoverflow website is intended for more specific programming questions.
Always focus on the whole workload, not only a single query. A "good solution" for one query may be bad for another aspect of functionality.
Adding one index can speed up one query but negatively impact other insert/update/delete activities, as you mention.
Companies that have a non-production environment that has the same (or higher) volumes of data with matching distributions can exploit such environments for performance-measurement , especially if they have a realistic test workload-generator and instrumentation for profiling.
Separately, keep in mind the importance of designing the statistics collection properly - sometimes column-group-statistics can have a big impact to help index selection even for existing indexes, other times the use of distribution-statistics can greatly help dynamic SQL, and statistical-views can help with many problems. So before adding new indexes always consider if other kinds of techniques can help especially if the join columns are already indexed correctly, and foreign-key indexes are present , but for some reason the Db2-optimiser is ignoring the indexes.
If the Db2 index lastused column (in syscat.indexes) shows that an index is never used or used extremely rarely, then you should investigate why the index was created, and why some queries that might be expected to benefit from that specific index are ignoring the index. Sometimes, it's necessary to reorder the columns in the index to ensure that the highest selectivity columns are at the lowest ordinal position.
There are other levers you can adjust, MQT, MDC, optimisation profiles (hints), registry settings, optimisation-levels, but the start point is a good data model and good measurements.

Sql server Index Size is 100 GB [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

What are the types and inner workings of a query optimizer?

As I understand it, most query optimizers are "cost-based". Others are "rule-based", or I believe they call it "Syntax Based". So, what's the best way to optimize the syntax of SQL statements to help an optimizer produce better results?
Some cost-based optimizers can be influenced by "hints" like FIRST_ROWS(). Others are tailored for OLAP. Is it possible to know more detailed logic about how Informix IDS and SE's optimizers decide what's the best route for processing a query, other than SET EXPLAIN? Is there any documentation which illustrates the ranking of SELECT statements as to what's the fastest way to access rows, assuming it's indexed?
I would imagine that "SELECT col FROM table WHERE ROWID = n" is the fastest (rank 1).
If I'm not mistaking, Informix SE's ROWID is a SERIAL(INT) which allows for a max. of 2GB nrows, or maybe it uses INT9 for TB's nrows? SE's optimizer is cost based when it has enough data but it does not use distributions like the IDS optimizer.
IDS'ROWID isn't an INT, it is the logical address of the row's page left
shifted 8 bits plus the slot number on the page that contains the row's data.
IDS' optimizer is a cost based optimizer that uses data
about the index depth and width, number of rows, number of pages, and the
data distributions created by update statistics MEDIUM and HIGH to decide
which query path is the least expensive, but there's no ranking of statements?
I think Oracle uses HEX values for ROWID. Too bad ROWID can't be oftenly used, since a rows ROWID can change. So maybe ROWID can be used by the optimizer as a counter to report a query progress?, an idea I mentioned in my "Begin viewing query results before query completes" question? I feel it wouldn't be that difficult to report a query's progress while being processed, perhaps at the expense of some slight overhead, but it would be nice to know ahead of time: A "Google-like" estimate of how many rows meet a query's criteria, display it's progress every 100, 200, 500 or 1,000 rows, give users the ability to cancel it at anytime and start displaying the qualifying rows as they are being put into the current list, while it continues searching?.. This is just one example, perhaps we could think other neat/useful features, the ingridients are more or less there.
Perhaps we could fine-tune each query with more granularity than currently available? OLTP queries tend to be mostly static and pre-defined. The "what-if's" are more OLAP, so let's try to add more control and intelligence to it? So, therefore, being able to more precisely control, not just "hint/influence" the optimizer is what's needed. We can then have more dynamic SELECT statements for specific situations! Maybe even tell IDS to read blocks of index nodes at-a-time instead of one-by-one, etc. etc.
I'm not really sure what your are after but here is some info on SQL Server query optimizer which I've recently read:
13 Things You Should Know About Statistics and the Query Optimizer
SQL Server Query Execution Plan Analysis
and one for Informix that I just found using google:
Part 1: Tuning Informix SQL
For Oracle, your best resource would be Cost Based oracle Fundamentals. It's about 500 pages (and billed as Volume 1 but there haven't been any followups yet).
For a (very) simple full-table scan, progress can sometimes be monitored through v$session_longops. Oracle knows how many blocks it has to scan, how many blocks it has scanned, how many it has to go, and reports on progress.
Indexes are a different matter. If I search for records for a client 'Frank', and use the index, the database will make a guess at how many 'Frank' entries are in the table, but that guess can be massively off. It may be that you have 1000 'Frankenstein' and just 1 'Frank' or vice versa.
It gets even more complicated as you add in other filter and access predicates (eg where multiple indexes can be chosen), and makes another leap as you include table joins. And thats without getting into the complex stuff about remote databases, domain indexes like Oracle Text and Locator.
In short, it is very complicated. It is stuff that can be useful to know if you are responsible for tuning a large application. Even for basic development you need to have some grounding in how the database can physically retrieve that data you are interested.
But I'd say you are going the wrong way here. The point of an RDBMS is to abstract the details so that, for the most part, they just happen. Oracle employs smart people to write query transformation stuff into the optimizer so us developers can move away from 'syntax fiddling' to get the best plans (not totally, but it is getting better).

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.
You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.
BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.
In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.
With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Resources