Let there be a KDB table with columns A,B and C AND it is sorted on column A.
I wanted to understand the complexity of insertion of records to this table (given that it has to keep the table sorted on A).
would it help (in terms of complexity) if the insertions to this table are guranteed to be in sorted order of A. Which means that at any time t2>t1, A(t2)>A(t1) ?
is there a way i can exploit the above fact (t2>t1 => A(t2)>A(t1)) and optimize the queries without even applying the sorted attribute on A ?
I know there is one way to perform binary search on a column, but I mainly wanted to know if there is a way in which I can tell the query planner to "assume" a column is sorted (without actually having sorted attribute on it because i want to avoid insertion complexity associated with it) and execute the query accordingly ?
My thoughts (some of which are just opinion as we can't see exactly what kdb does under the covers):
A. To clarify - kdb itself doesn't "keep the table sorted". Kdb will insert the data regardless, it's up to the user to ensure the table remains sorted.
B. I don't think you should be worried about overhead/complexity in kdb inserts - I would estimate that insert is (one of) the single most optimised operation in all of kdb
C. Whether the column has an attribute or not, kdb will do the insert either way and possibly only after the insert does it check to see if the attribute was retained or not. This would be a highly optimised check. s# will be lost on a non-sorted inserted. u# will be lost on a non-unique insert. p# will be lost on any insert as it's generally intended to be used on static/on-disk data.
D. The only case where there would be a non-negligable cost/complexity of insert would be in the case of maintaining a grouped attribute since g# is always retained on insert and there's the overhead of updating the hidden hash table. But even then, this overhead doesn't impact high-volume RDBs which have billions of inserts in a given day.
None of these are actual hard numbers or big O/complexity information but in my experience the big O/complexity is more relevant to the lookup of attribute-having data rather than the insert/maintenance of the attribute/data. In my experience the insert has never been a concern.
To answer your actual questions:
As I eluded to in (A), if you want to have a sorted attribute and you want to retain it then you must ensure the data is inserted in sorted order
If there's no attribute then kdb treats a column/vector like any other vector - it would scan the entire vector each time since there is no flag/attribute telling it to use an optimisation. The only exception to this is with as-of joins (or window joins) aj/wj where an aj on say `sym`time assumes that time is sorted within sym, without there being an explicit s# attribute on time.
Apart from the aj/wj exceptions above, no if you want to take advantage of the sorted nature of the data to speed up queries then you need to have an s# attribute on it. Unless of course you use a different attribute such as p# which as I mentioned earlier has its own caveats
Related
Having the following statements:
SELECT * FROM tadir ORDER BY PRIMARY KEY INTO TABLE #DATA(lt_tadir).
DATA lt_tadir TYPE SORTED TABLE OF tadir WITH UNIQUE KEY pgmid object obj_name.`
SELECT * FROM tadir INTO TABLE #lt_tadir.
Why is the first one around 4 times slower (verified on multiple systems? Relevant statements from the documentation:
For performance reasons, a sort should only take place in the database if supported by an index. This guaranteed only when ORDER BY PRIMARY KEY is specified
If a sorted resulting set is assigned to a sorted internal table, the internal table is sorted again according to the sorting instructions.
First I thought maybe column storage is an issue, but I tried another column storage table where both statements are around similar (even though the second one seems to be a bit faster each time).
Same for the row storage I tested.
Also I somehow would have expected that the sort in ABAP (second Docu snipped) would have an performance impact. But it outperforms the primary key index select.
Tables aren't fully buffered, and even if that was the case, ORDER BY PRIMARY KEY does use the buffer.
Any ideas?
I ran the test (several times) and got at least similar execution times for both statements.
The Application server sort version about 25% faster than HANA sort.
So my mileage was different.
That HANA sort isnt "faster" is only mildly surprising until you look at the table definition. Sorting the entire inverted hash index not what it was designed for. :)
Some "rules" are meant to be broken.
Sorting 5 Million keys with inverted hashes might a good example.
And now you have 5 Million records in memory, reading rows quickly by key will favor the internally sorted table. anyway ;)
DATA lt_tadir TYPE SORTED TABLE OF tadir WITH UNIQUE KEY pgmid object obj_name
is access friendly than the simple Standard Table anyway.
#Data(lt_tab)
There are known disadvantages with inverted hash indexes.
With typical access not normally noticed.
https://www.stechies.com/hana-inverted-individual-indexes-interview-questionsand-ans/
I'd like to argue that this performance test is senseless and any conclusions drawn from it are incorrect as:
Sorting in ABAP and sorting on the database only yields the same results if the whole table is selected. If the number of results is limited (e.g. get the last 100 invoices) then sorting on the database yields the correct result while sorting in ABAP (after the limit) does not. As one rarely selects a whole table the test case is completely unrealistic.
Sorting by the primary key (pgmid, object, obj_name) does not make any sense. In which scenario would that be useful? You might want to search for a certain object, then sorting by obj_name might be useful, or you might want to see recent transported objects and sort by the correction (korrnum)
Just to demonstrate run the following example:
REPORT Z_SORT_PERF.
DATA:
start TYPE timestampl,
end TYPE timestampl.
DATA(limit) = 10000.
* ------------- HANA UNCOMMON SORT ----------------------------------------
GET TIME STAMP FIELD start.
SELECT * FROM tadir ORDER BY PRIMARY KEY INTO TABLE #DATA(hana_uncommon) UP TO #limit ROWS.
GET TIME STAMP FIELD end.
WRITE |HANA uncommon sort: { cl_abap_tstmp=>subtract( tstmp1 = end tstmp2 = start ) }s|.
NEW-LINE.
* ------------- HANA COMMON SORT ----------------------------------------
GET TIME STAMP FIELD start.
SELECT * FROM tadir ORDER BY KORRNUM INTO TABLE #DATA(hana_common) UP TO #limit ROWS.
GET TIME STAMP FIELD end.
WRITE |HANA common sort: { cl_abap_tstmp=>subtract( tstmp1 = end tstmp2 = start ) }s|.
NEW-LINE.
On my test system this runs in 1.034s (the uncommon one) vs 0.08s (the common one). That's 10 times faster (though that comparison also makes little sense). Why is that? Because there is a secondary index defined for the KORRNUM field for exactly that purpose (sorting) while the primary key index is supposed for enforcing uniqueness constraints and retrieving single records.
In general a database is not meant to optimize for one single query, but is optimized for overall system performance (under memory constraints). This is usually achieved by sacrificing the performance of uncommon queries for the optimization of common ones.
For performance reasons, a sort should only take place in the database if supported by an index. This guaranteed only when ORDER BY PRIMARY KEY is specified
That's a slightly misleading formulation. There are indices that are not optimal for sorting, and although there is no guarantee that a database creates a secondary index and uses it while sorting, it is very likely that it does so in common cases.
Whilst analyzing the database structure of a legacy application, I discovered in several tables there are 2 unique indices which both have the exact same columns, except in a different order.
Having 2 unique indices covering the same columns is clearly redundant, so my first instinct was to completely drop one of them. But then I thought some of the queries emmitted by the application might be making use of the index I might delete, so I thought to convert it instead into a regular index.
To the best of my knowledge, whenever a row is inserted/updated in a table having a unique index, SQL Server spends some milliseconds validating each unique index/constraint still holds true - so by converting one of these indices into a non-unique I hope processing of this table might be sped up a bit, please confirm or dispel.
On the other hand, I don't understand what's the benefit in having to unique indices covering the same columns on a table. Any ideas what this could be done for? Could something get lost if I convert one of them onto a regular one?
check the index usage stats to see if they are both being used.
sys.dm_db_index_usage_stats.
If not, delete the unused index.
Generally speaking, indexes are used for filtering, then ordering. It is possible that you may have queries that are needing to filter on the leading columns of both indexes. If that is the case, you'll reduce how deep the query can be optimized by getting rid of one. That may not be a big deal as it may still be able to satisfactorily use the remaining index.
For example, if I have 2 indexes with four columns:
1: Columns A, B, C, D
2: Columns A, B, D, C
Any query that currently prefers #2 could still gain benefits by using #1 if #2 is not available. It would just limit the selectivity to column B rather than all the way down to column D.
If you're not sure, try disabling (not deleting) the less used index and see if you notice any problems. If something slows down, it is simple enough to enable it again.
As always, try it in a non-production environment first.
UPDATE
Yes you can safely remove the uniqueness of one of the indexes. It only needs to be enforced by one of them. The only concern would be if the vendor decided to do the same and chooses the other index.
However, since this is from a vendor, I'd recommend you contact them if there are performance concerns. If you're not running into a performance issue worth a support request to them, then just leave it alone.
For any given single table that has more than 1 column that you look up against using a WHERE clause, where those columns are int or bigint, at what point does it become worth creating an index on those columns.
I am aware that i should create those columns anyway, this question is about when does the performance advantage of having those indexes there, kick in in terms of table size.
Instantly, granted you wouldn't notice it on a small table, but it needs to do a table scan if an index isn't in place on these columns.
So in execution time you wouldn't notice it but if you look at CPU used by the query, and reads you'll notice with the index the query instantly start performing better.
The advantage would happen when data must be retrieved and the most restrictive where clause involve the indexed column
Note that: On CUD statement, indexes will add an overhead (which might be compensated if the CUD involve some data retrieving such as explained above).
Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.
I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.