I created a table with two columns, a and b. Column a is simply the numbers 1 to 100 million. Column b is a random integer between 0 and 999 inclusive. I wanted to use this table to check how indexes improve calculations. So I checked the following:
select count(*) from my_table where b = 332
select avg(a) from my_table where b = 387
The 332 and 387 are just random integers, I wanted to make sure it wasn't caching anything so I switched it.
Then I created an index:
create bitmap index myindx1 on my_table (b);
commit;
This brought the count(*) down from 14 seconds to 75 milliseconds, success!
But the avg(a) didn't fare so well. It actually got worse, going from 8 seconds to 10 seconds. I didn't test this a ton of times and based on the plans it looks to be a fluke, but at the very least it doesn't seem to be doing much better as I expected it to.
The explain plan without the index looks like:
The explain plan with the index looks like:
So it looks like it's helping a bit, but is it really that much more expensive to average numbers than count them? And way more expensive to average numbers than to do a full table scan? I thought this index would cut my query into a fraction of the original cost rather than just shaving off a little bit of time. Is there something else I can do to speed up this query?
Thanks.
The problem is the way you set up your test - it isn't realistic and it is bad for indexes.
First: you have just two integer columns in your table, so each row is VERY small. So, Oracle can fit a lot of rows into each database block -- like a few thousand rows per block.
Second: you created your indexed data randomly, with values between 0 and 999.
Put those two facts together and what can we guess? Answer: just about every single database block is going to have at least one row with any given value of column B.
So, no matter what value of B you look for, you are going to wind up reading every block in your table one at a time (i.e.: "sequential read").
Compare that to the plan using no index -- a full table scan -- where Oracle will still read every single block, but it will read them several blocks at a time (i.e., "scattered read").
No wonder your index didn't help.
If you want a better test, add column C to your test table that is just a string of 200-300 characters (e.g., "XXXXXXXXX..."). This will reduce the number of rows per block to a more realistic value and you should see better gains from your index.
LAST NOTE: be very careful about using a BITMAP index. They are all but unusable on tables that have any sort of DML (insert, update, deleting) happening on them! Read all about them before using one!
UPDATE
Clarification on this:
So it looks like it's helping a bit, but is it really that much more expensive to average numbers than count them? And way more expensive to average numbers than to do a full table scan?
The reason your index helped your COUNT(*) query is that the index by itself will tell Oracle how many rows meet the condition B=332, so it does not need to read the table blocks and therefore does not suffer from the problem I described above (i.e., reading each table block one-by-one).
It's not that COUNT() is "faster" then AVG(). It's just that, in your test, the COUNT could be computed using only the index, whereas AVG needed information from the table.
Bitmap indexes should not be used in OTLP systems. Their maintenance cost is too high.
IHMO pure B*tree index will be enough. INDEX RANGE SCAN traverses from root to leftest leaf heaving value "332" and then iterates from left to right visiting all leaves having the same value of "B". This is all you want.
If you want to speed it up even more you can create so called covering index. Put both column "B" and "A" (in this order) into index. Then you will avoid lookup into table for value of "A" when "B" is matched. It is especially helpful if table contains many columns you do not care about.
Related
Consider an enum of different (≈ 8-12) string values, that appears across 100 million rows in a PostgreSQL database. This column is used in a complex search query in conjunction with other conditions.
Objectively speaking which indexing algorithm (amongst GiST and BTREE) would offer the most performance gains for this specific column?
If the distribution is even, and a WHERE condition on this column will reduce the result set by a factor of 8 to 12, then an index on the column may well make sense.
However, you should never think about creating an index just looking at the data in the table. The most important part to consider is the query that should become faster. Once you know the query, an answer can be much more definitive.
If I understand clearly you have about 8 to 12 different string values in a column into a 100 million row table.
If the distribution of these 8 to 12 values are equals that means that a filter for one of these values will return about 10 million rows and this is to much to use any index that will have only this column. You have to create indexes that involves all the columns included in :
first, the WHERE predicate for an equal search
second, the WHERE predicate for an inequal search (>, <...)
third, the ON of the JOIN operator
fourth, the GROUP BY or DISTINCT clauses
fifth, the ORDER BY if any
In that specific ordre
I execute a simple query:
SELECT * FROM TABLE1
WHERE ID > 9 AND ID < 11
and the query verbose plan is:
[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1, ...
But after changing the where clause to
WHERE ID = 10
the query verbose plan changes:
[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1000, ...
(where 1000 is the total number of rows in TABLE1).
Why is it so? How does the estimation work?
The optimizer of any cost-based database is always full of surprises, and this one is not unusual across the platforms im familiar with.
A couple of questions:
- have you created statistics on the table? (otherwise you are flying blind)
- what is the datatype for that column ? (i hope it is an integer of some sort, not a NUMBER(x,y), even if y=0)
Furthermore:
The statistics for a column in netezza contains no distribution statistics (it won't know if there are more "solved" than "unsolved" cases in a support-system table with 5 years worth of data). Instead it relies on two things:
1) for all tables: simple statistics if you create them (number of distinct values, max+min values, number of nulls)
2) for large'ish tables (I think the configureable minimum value is close to 100 mill rows) it creates JIT syatistics (Just In Time) by scanning a few random data pages on the dataslices that all live up to the zone-mappable whereclauses and creating statistics for this one query.
The last feature is actually quite powerfull, even though is adds runtime to planning-phase of the query. It significantly increases the likelyhood that if there are SOME correlation between two whereclauses on a table, this will be taken into account.
An example: a whereclause on (AGE>60 and Retired=true) in a list of all citizens in a major city. It is most likely more or less irrelevant to add the AGE restriction, and Netezza will know that.
In general you should not worry about estimated number of rows being a bit off (as in this case) with netezza, it will most often get it "right enough" and throw hardware at the problem to compensate for any minor mistakes.
Untill recently I worked with SQLserver which is notorius (may be better in newer version) for being overly optimistic about the value of where clauses, and ending up in access plans with 5 levels of nested-loop joins with millions of rows in each, when joining 6 tables. Changing where clauses much like you did in the question, will cause sqlserver to put LESS empathesis on a specific restriction, and that can cause the 5 joins to become a more efficient HASH or other algorithm, resulting in better performance. In my experience that is MUCH too frequent an occurance on databases that relies TOO heavily on these estimates - perhaps because the optimizer were not created/tuned for a warehouse-like workload.
I have a table with 1.3 million rows
I had smallint (indexed) column this table, and when I was runing very simple query:
select * from table where field = x order by id limit 100
sometimes (when I was changing x with different values) query was very slow (10-20 seconds sometimes).
Then I altered this column with int type, and also created index on this column.
Now, same queries are much faster than previous, almost always, they takes not more than 1 second.
So, smallint takes less space on disk, but reading on int type, is much better performed.
It's right? if so, Why?
The reason is probably either data skew or stale index statistics.
The first is the distribution of values. If there are only a few values in the column, then Postgres is smart enough not to use the the index. So, it depends on the selectivity of the index.
The same thing can happen if the index statistics need to be updated.
It is highly unlikely that the difference in data types would be driving this. More likely, the fresh index that is created has up-to-date statistics.
does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.
I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.