How should I go about selecting rows with thousands of columns? - database

There are about a couple million records that look like this.
idA(text), idB(int), prop1(boolean), prop2(boolean), ..., prop6000(boolean) (more prop's can be added later on)
And the primary task will be finding records with some combination of prop values.
eg: SELECT idA, idB WHERE prop30=true AND prop1987=false AND ... AND prop5754=true
If the SELECT speed were the main concern, how should I go about this problem?
--
I was thinking about defining the props as a list of int and adding values only where the value are true and use CONATINS when SELECTing.
ie: INSERT INTO tbl VALUES('id1', 1, [10, 24, 2977]) -> if prop10, prop24 and prop 2977 were true
But then it is said that the secondary index does not scale very well and should not be used heavily.
Does it hold true even for lists? (I'm thinking maybe it's different for lists as they are sorted?)

One of the key things in Cassandra query performance is that you must - MUST - hit a partition before applying an index filter. In addition, when you apply multiple index filters, it only hits one index and filters the rest in mem (i.e. only one index is used). In your query, you're not hitting a partition, and as such, it'll be a cluster wide query, that is most likely to timeout.
In cassandra 3.0, the rules will be somewhat relaxed with the introduction of global indexes. Even then, your query won't really work that great.
If all your properties are booleans, you can consider storing them as a bitfield. One integer can then hold 64 flags. That might be more efficient. One the querying side, you will still need to find a partition key by which you'll hit a partition. With the flags approach, you can simply read in the integer and do a filter on the clientside. All rows in the partition will be loaded, but unless you've got hundreds of thousands of rows in the same partition, it shouldn't be a problem.
If you trully don't have a partition key, and all you can look up is props (as in your example above), then you'll need to manually carry out indexing. Built in indexes won't really work that well, and you can choose to create index tables yourself (which may be quite difficult) or use an indexing service like Lucene, which will allow you to do the search quickly.

Related

Is there a benefit in eliminating the unique-ness of a redundant unique index on SQL Server?

Whilst analyzing the database structure of a legacy application, I discovered in several tables there are 2 unique indices which both have the exact same columns, except in a different order.
Having 2 unique indices covering the same columns is clearly redundant, so my first instinct was to completely drop one of them. But then I thought some of the queries emmitted by the application might be making use of the index I might delete, so I thought to convert it instead into a regular index.
To the best of my knowledge, whenever a row is inserted/updated in a table having a unique index, SQL Server spends some milliseconds validating each unique index/constraint still holds true - so by converting one of these indices into a non-unique I hope processing of this table might be sped up a bit, please confirm or dispel.
On the other hand, I don't understand what's the benefit in having to unique indices covering the same columns on a table. Any ideas what this could be done for? Could something get lost if I convert one of them onto a regular one?
check the index usage stats to see if they are both being used.
sys.dm_db_index_usage_stats.
If not, delete the unused index.
Generally speaking, indexes are used for filtering, then ordering. It is possible that you may have queries that are needing to filter on the leading columns of both indexes. If that is the case, you'll reduce how deep the query can be optimized by getting rid of one. That may not be a big deal as it may still be able to satisfactorily use the remaining index.
For example, if I have 2 indexes with four columns:
1: Columns A, B, C, D
2: Columns A, B, D, C
Any query that currently prefers #2 could still gain benefits by using #1 if #2 is not available. It would just limit the selectivity to column B rather than all the way down to column D.
If you're not sure, try disabling (not deleting) the less used index and see if you notice any problems. If something slows down, it is simple enough to enable it again.
As always, try it in a non-production environment first.
UPDATE
Yes you can safely remove the uniqueness of one of the indexes. It only needs to be enforced by one of them. The only concern would be if the vendor decided to do the same and chooses the other index.
However, since this is from a vendor, I'd recommend you contact them if there are performance concerns. If you're not running into a performance issue worth a support request to them, then just leave it alone.

SQL Server - what kind of index should I create?

I need to make queries such as
SELECT
Url, COUNT(*) AS requests, AVG(TS) AS avg_timeSpent
FROM
myTable
WHERE
Url LIKE '%/myController/%'
GROUP BY
Url
run as fast as possible.
The columns selected and grouped are almost always the same, being the difference, an extra column on the select and group by (the column tenantId)
What kind of index should I create to help me run this scenario?
Edit 1:
If I change my base query to '/myController/%' (note there's no % at the begging) would it be better?
This is a query that cannot be sped up with an index. The DBMS cannot know beforehand how many records will match the condition. It may be 100% or 0.001%. There is no clue for the DBMS to guess this. And access via an index only makes sense when a small percentage of rows gets selected.
Moreover, how can such an index be structured and useful? Think of a telephone book and you want to find all names that contain 'a' or 'rs' or 'ems' or whatever. How would you order the names in the book to find all these and all other thinkable letter combinations quickly? It simply cannot be done.
So the DBMS will read the whole table record for record, no matter whether you provide an index or not.
There may be one exception: With an index on URL and TS, you'd have both columns in the index. So the DBMS might decide to read the whole index rather than the whole table then. This may make sense for instance when the table has hundreds of columns or when the table is very fragmented or whatever. I don't know. A table is usually much easier to read sequentially than an index. You can still just try, of course. It doesn't really hurt to create an index. Either the DBMS uses it or not for a query.
Columnstore indexes can be quite fast at such tasks (aggregates on globals scans). But even they will have trouble handling a LIKE '%/mycontroler/%' predicate. I recommend you parse the URL once into an additional computed field that projects the extracted controller of your URL. But the truth is that looking at global time spent on a response URL reveals very little information. It will contain data since the beginning of time, long since obsolete by newer deployments, and not be able to capture recent trends. A filter based on time, say per hour or per day, now that is a very useful analysis. And such a filter can be excellently served by a columnstore, because of natural time order and segment elimination.
Based on your posted query you should have a index on Url column. In general columns which are involved in WHERE , HAVING, ORDER BY and JOIN ON condition should be indexed.
You should get the generated query plan for the said query and see where it's taking more time. Again based n the datatype of the Url column you may consider having a FULLTEXT index on that column

Database indexes - does the table size matter?

What I mean is: Does a table with 20 columns benefit more from indexing a certain field (one that's used in search-ish queries) than a table that has just 4 columns?
Also: What is the harm in adding index to fields that I don't search with much, but might later in the future? Is there a negative to adding indexes? Is it just the size it takes up on disk, or can it make things run slower to add unnecessary indexes?
extracted from a comment
I'm using Postgres (latest version) and I have one table that I'll be doing a lot of LIKE type queries, etc but the values will undoubtedly change often since my clients have access to CRUD. Should I can the idea of indexes? Are they just a headache?
Does a table with 20 columns benefit more from indexing a certain field (one that's used in search-ish queries) than a table that has just 4 columns?
No, number of columns in a table has no bearing on benefits from having an index.
An index is solely on the values in the column(s) specified; it's the frequency of the values that will impact how much benefit your queries will see. For example, a column containing a boolean value is a poor choice for indexing, because it's a 50/50 chance the value will be one or the other value. At a 50/50 split over all the rows, the index doesn't narrow the search for a particular row.
What is the harm in adding index to fields that I don't search with much, but might later in the future?
Indexes only speed up data retrieval when they can be used, but they negatively impact the speed of INSERT/UPDATE/DELETE statements. Indexes also require maintenance to keep their value.
If you are doing LIKE queries you may find that indexes are not not much help anyway. While an index might improve this query ...
select * from t23
where whatever like 'SOMETHING%'
/
... it is unlikely that an index will help with either of these queries ...
select * from t23
where whatever like '%SOMETHING%'
/
select * from t23
where whatever like '%SOMETHING'
/
If you have free text fields and your users need fuzzy matching then you should look at Postgres's full text functionality. This employs the MATCH operator rather than LIKE and which requires a special index type. Find out more.
There is a gotcha, which is that full text indexes are more complicated than normal ones, and the related design decisions are not simple. Also some implementations require additional maintenance activities.

What are the methods for identifying unnecessary columns within a covering index?

What methods are there for identifying superfluous columns in covering indices: columns which are never searched against, and therefore may be extracted into Includes, or even removed completely without affecting the applicability of the index?
To clarify things
The idea of a covering index is that it also includes columns which may not be searched by (used in the WHERE clause and such) but may be selected (part of the SELECT columns list).
There doesn't seem to be any easy way to assert the existence of unused colums in a covering index. I can only think of a painstaking process below:
For a representative period of time, record all queries being run on the server (or on the table desired)
Filter out (through regular expression) queries not involving the underlying table
For remaining queries, obtain the query plan; discard queries not involving the index in question
For the remaining queries, or rather for each "template" of query (many queries are same but for the search criteria values), make the list of the columns from the index that are either in select or where clause (or in JOIN...)
the columns from the index not found in that list are positively good to go.
Now, there may be a few more [columns to remove] because the process above doesn't check in which context the covering index is used (it is possible that it be used for resolving the where, but that the underlying table is still accessed as well (for example to get to columns not in the covering index...)
The above clinical approach is rather unattractive. An analytical approach may be preferable:
Find all queries "templates" that may be used in all the applications using the server. For each of these patterns, find the ones which may be using the covering index. These are (again a few holes...) queries that:
include a reference to the underlying table
do not cite in any way a column from the underlying table that is not a column in the index
do not use a search criteria from the underlying table that is more selective that the columns of the index (in their very order...)
Or... without even going to the applications: think of all the use cases, and if queries that would serve these cases would benefit of not from all columns in the index. Doing so would imply that you have a relatively good idea of the selectivity of the index, regarding its first few columns.
If you do audits of your use cases and data points, obviously anything that isn't used or caught in the audit is a candidate for deletion. If the database lacks such a thorough audit, you can save a time-window's worth of queries that hit the database by running a trace and saving it. You can analyze the trace and see what type of queries are hitting the database and from there intuit which columns can be dropped.
Trace analysis is typically used to find candidates for missing indices, but I'm guessing that it could be also used to analyze usage trends.

Can Multiple Indexes Work Together?

Suppose I have a database table with two fields, "foo" and "bar". Neither of them are unique, but each of them are indexed. However, rather than being indexed together, they each have a separate index.
Now suppose I perform a query such as SELECT * FROM sometable WHERE foo='hello' AND bar='world'; My table a huge number of rows for which foo is 'hello' and a small number of rows for which bar is 'world'.
So the most efficient thing for the database server to do under the hood is use the bar index to find all fields where bar is 'world', then return only those rows for which foo is 'hello'. This is O(n) where n is the number of rows where bar is 'world'.
However, I imagine it's possible that the process would happen in reverse, where the fo index was used and the results searched. This would be O(m) where m is the number of rows where foo is 'hello'.
So is Oracle smart enough to search efficiently here? What about other databases? Or is there some way I can tell it in my query to search in the proper order? Perhaps by putting bar='world' first in the WHERE clause?
Oracle will almost certainly use the most selective index to drive the query, and you can check that with the explain plan.
Furthermore, Oracle can combine the use of both indexes in a couple of ways -- it can convert btree indexes to bitmaps and perform a bitmap ANd operation on them, or it can perform a hash join on the rowid's returned by the two indexes.
One important consideration here might be any correlation between the values being queried. If foo='hello' accounts for 80% of values in the table and bar='world' accounts for 10%, then Oracle is going to estimate that the query will return 0.8*0.1= 8% of the table rows. However this may not be correct - the query may actually return 10% of the rwos or even 0% of the rows depending on how correlated the values are. Now, depending on the distribution of those rows throughout the table it may not be efficient to use an index to find them. You may still need to access (say) 70% or the table blocks to retrieve the required rows (google for "clustering factor"), in which case Oracle is going to perform a ful table scan if it gets the estimation correct.
In 11g you can collect multicolumn statistics to help with this situation I believe. In 9i and 10g you can use dynamic sampling to get a very good estimation of the number of rows to be retrieved.
To get the execution plan do this:
explain plan for
SELECT *
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Contrast that with:
explain plan for
SELECT /*+ dynamic_sampling(4) */
*
FROM sometable
WHERE foo='hello' AND bar='world'
/
select * from table(dbms_xplan.display)
/
Eli,
In a comment you wrote:
Unfortunately, I have a table with lots of columns each with their own index. Users can query any combination of fields, so I can't efficiently create indexes on each field combination. But if I did only have two fields needing indexes, I'd completely agree with your suggestion to use two indexes. – Eli Courtwright (Sep 29 at 15:51)
This is actually rather crucial information. Sometimes programmers outsmart themselves when asking questions. They try to distill the question down to the seminal points but quite often over simplify and miss getting the best answer.
This scenario is precisely why bitmap indexes were invented -- to handle the times when unknown groups of columns would be used in a where clause.
Just in case someone says that BMIs are for low cardinality columns only and may not apply to your case. Low is probably not as small as you think. The only real issue is concurrency of DML to the table. Must be single threaded or rare for this to work.
Yes, you can give "hints" with the query to Oracle. These hints are disguised as comments ("/* HINT */") to the database and are mainly vendor specific. So one hint for one database will not work on an other database.
I would use index hints here, the first hint for the small table. See here.
On the other hand, if you often search over these two fields, why not create an index on these two? I do not have the right syntax, but it would be something like
CREATE INDEX IX_BAR_AND_FOO on sometable(bar,foo);
This way data retrieval should be pretty fast. And in case the concatenation is unique hten you simply create a unique index which should be lightning fast.
First off, I'll assume that you are talking about nice, normal, standard b*-tree indexes. The answer for bitmap indexes is radically different. And there are lots of options for various types of indexes in Oracle that may or may not change the answer.
At a minimum, if the optimizer is able to determine the selectivity of a particular condition, it will use the more selective index (i.e. the index on bar). But if you have skewed data (there are N values in the column bar but the selectivity of any particular value is substantially more or less than 1/N of the data), you would need to have a histogram on the column in order to tell the optimizer which values are more or less likely. And if you are using bind variables (as all good OLTP developers should), depending on the Oracle version, you may have issues with bind variable peeking.
Potentially, Oracle could even do an on the fly conversion of the two b*-tree indexes to bitmaps and combine the bitmaps in order to use both indexes to find the rows it needs to retrieve. But this is a rather unusual query plan, particularly if there are only two columns where one column is highly selective.
So is Oracle smart enough to search
efficiently here?
The simple answer is "probably". There are lots'o' very bright people at each of the database vendors working on optimizing the query optimizer, so it's probably doing things that you haven't even thought of. And if you update the statistics, it'll probably do even more.
I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.
The best approach would be to add foo to bar's index, or add bar to foo's index (or both). If foo's index also contains an index on bar, that additional indexing level will not affect the utility of the foo index in any current uses of that index, nor will it appreciably affect the performance of maintaining that index, but it will give the database additional information to work with in optimizing queries such as in the example.
It's better than that.
Index Seeks are always quicker than full table scans. So behind the scenes Oracle (and SQL server for that matter) will first locate the range of rows on both indices. It will then look at which range is shorter (seeing that it's an inner join), and it will iterate the shorter range to find the matches with the larger of the two.
You can provide hints as to which index to use. I'm not familiar with Oracle, but in Mysql you can use USE|IGNORE|FORCE_INDEX (see here for more details). For best performance though you should use a combined index.

Resources