I am planning to setup a database for some rudimentary aggregations. The plan is to provide queries like SELECT SUM(energy) WHERE... to our users.
energy is a plain numerical field. The WHERE clause is more interesting, because we expose some (limited) customizability to the user, basically just ANDing some fields for equality. Like A=3, A=3 AND B=92, etc.
I'm no DBA, but my performance sense is tingling. As it stands, I'm anticipating a load of O(user * record) on the database, should the queries be fired off all at once. Is there a way to optimize this better?
If the WHERE condition were fixed, then we could simply provide a view or otherwise precalculate and cache the sum. Unfortunately, in this case, we will be providing limited ability to customize the WHERE expression, basically offering some fields for the users to AND at will.
Looks to me like each of these aggregation queries would traverse basically the whole table, or significant subsections of the table, one traversal per user query. Does that make sense?
What are some ways to optimize for this kind of workflow? Should I just shard over many of my fields? I'm considering how many replicas to use, though I am unsure if |replicas| would be able to outpace growth of users or data, due to the data totality of the aggregation queries.
In terms of low-level performance, would it make sense to structure these queries as SELECT SUM(energy)>N WHERE..., and hope that PostgreSQL is smart enough to terminate early when the subtotal is found to already exceed the threshold N?
Finally, would a NoSQL or TSDB offer advantages for this workflow, or would their performance be comparable to a SQL database?
Update
Since most of these queries will be run on a schedule, I think I will stagger them to spread the load throughout the day. But I'm still keen on finding better ways to optimize the tables for this load, should a bunch of active users suddenly submit aggregation queries all at once.
I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).
I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.
What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.
TIA!
Edit, to answer the comments:
The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.
#Mark Wilkins:
The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.
I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.
However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.
If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.
In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.
Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.
Take a look at the following:
PRAGMA Statements: http://www.sqlite.org/pragma.html
Custom Builds of SQLite: http://www.sqlite.org/custombuild.html
SQLite Query Planner: http://www.sqlite.org/optoverview.html
SQLite Compile Options: http://www.sqlite.org/compile.html
This is all of course assuming a database is what you need.
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
I was looking for improvements to PostgreSQL/InnoDB MVCC COUNT(*) problem when I found an article about implementing a work around in PostgreSQL. However, the author made a statement that caught my attention:
MySQL zealots tend to point to
PostgreSQL’s slow count() as a
weakness, however, in the real world,
count() isn’t used very often, and if
it’s really needed, most good database
systems provide a framework for you to
build a workaround.
Are there ways to skip using COUNT(*) in the way you design your applications?
Is it true that most applications are designed so they don't need it? I use COUNT() on most of my pages since they all need pagination. What is this guy talking about? Is that why some sites only have a "next/previous" link?
Carrying this over into the NoSQL world, is this also something that has to be done there since you can't COUNT() records very easily?
I think when the author said
however, in the real world, count() isn’t used very often
they specifically meant an unqualified count(*) isn't used very often, which is the specific case that MyISAM optimises.
My own experience backs this up- apart from some dubious Munin plugins, I can't think of the last time I did a select count(*) from sometable.
For example, anywhere I'm doing pagination, it's usually the output of some search. Which implies there will be a WHERE clause to limit the results anyway- so I might be doing something like select count(*) from sometable where conditions followed by select ... from sometable limit n offset m. Neither of which can use the direct how-many-rows-in-this-table shortcut.
Now it's true that if the conditions are purely index conditions, then some databases can merge together the output of covering indices to avoid looking at the table data too. Which certainly decreases the number of blocks looked at... if it works. It may be that, for example, this is only a win if the query can be satisfied with a single index- depends on the db implementation.
Still, this is by no means always the case- a lot of our tables have an active flag which isn't indexed, but often is filtered on, so would require a heap check anyway.
If you just need an idea of whether a table has data in it or not, Postgresql and many other systems do retain estimated statistics for each table: you can examine the reltuples and relpages columns in the catalogue for an estimate of how many rows the table has and how much space it is taking. Which is fine so long as ~6 significant figures is accurate enough for you, and some lag in the statistics being updated is tolerable. In my use case that I can remember (plotting the number of items in the collection) it would have been fine for me...
Trying to maintain an accurate row counter is tricky. The article you cited caches the row count in an auxiliary table, which introduces two problems:
a race condition between SELECT and INSERT populating the auxiliary table (minor, you could seed this administratively)
as soon as you add a row to the main table, you have an update lock on the row in the auxiliary table. now any other process trying to add to the main table has to wait.
The upshot is that concurrent transactions get serialised instead of being able to run in parallel, and you've lost the writers-dont-have-to-block-either benefits of MVCC- you should reasonably expect to be able to insert two independent rows into the same table at the same time.
MyISAM can cache the row count per table because it tacks on exclusive lock on the table when someone writes to it (iirc). InnoDB allows finer-grained locking-- but it doesn't try to cache the row count for the table. Of course if you don't care about concurrency and/or transactions, you can take shortcuts... but then you're moving away from Postgresql's main focus, where data integrity and ACID transactions are primary goals.
I hope this sheds some light. I must admit, I've never really felt the need for a faster "count(*)", so to some extent this is simply a "but it works for me" testament rather than a real answer.
While you're asking more of an application design than database question really, there is more detail about how counting executes in PostgreSQL and the alternatives to doing so at Slow Counting. If you must have a fast count of something, you can maintain one with a trigger, with examples in the references there. That costs you a bit on the insert/update/delete side in return for speeding that up. You have to know in advance what you will eventually want a count of for that to work though.
I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.