Related
I have a table user with multiple columns, every user has a unique userid.
Because it is unique, I dont have to specify a clustering key unless I want to use the column in queries. Is this bad, because every partition consists of a single row? If it is bad for whatever reason, what is the best practise to do in this case?
Thank you for your help!
Edit: If I have a query that needs to return all usernames, how can I do that with a good performance? Doing it from this table seems not very efficient for me, should I make another table where I simply duplicate all usernames in a Collection? Then they are all in one place and the read doesn't have to jump over multiple nodes.
I just answered the similar question. Short story - it really depends on the access patterns, and table settings. You may need to tune the table parameters to get best performance, but the settings may depend on the amount of data, and other requirements.
There are always two (main) considerations when defining your primary keys in Cassandra:
Data distribution
Query pattern match
From a data distribution standpoint, you can't get much better than using a unique key as the partition key. The more of them, the more evenly they should hash-out and thus be evenly distributed.
However, a key which distributes well but doesn't fit the desired query pattern, is pretty useless.
tl;dr;
If that unique key is all you'll ever query the table by, then it makes a great choice for a partition key.
I am trying to compile a list of non-system-specific database indexes. I've looked at Oracle, DB2, MySQL, Postgres and Sybase, and almost every resource has a different list. So far I have seen:
clustered, multi-dimensional clustered, unclustered, unique,
non-unique, b-tree, hash, GiST, GIN, full-text, bitmap,
partitioned, function-based.
It seems that different systems have different names for the same types of indexes.
Are there standard index types across all systems?
If for whatever reason somebody else comes across this and is wondering the same thing, I ended up finding a good list at:
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems#Indexes
Many of the these concepts are orthogonal. A clustered index means that the rows are arranged in the table in the same order as they appear in the index. Independently, that index can be implemented using a B-tree, a B+ tree, a hash, spatially, etc. And then it may partition the table or not. One aspect may constrain but does not necessarily imply another.
You should scour harder :-) - Wiki gives a good description
http://en.wikipedia.org/wiki/Database_index
http://msdn.microsoft.com/en-us/library/ms175049.aspx
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.dbobj.doc%2Fdoc%2Fc0020180.html
these links might give more clear idea...
In docs for various ORMs they always provide a way to create indexes, etc. They always mention to be sure to create the appropriate indexes for efficiency, as if that is inherent knowledge to a non-hand-written-SQLer who needs to use an ORM. My understanding of indexes (outside of PK) is basically: If you plan to do LIKE queries (ie, search) based on the contents of a column, you should use a full text index for that column. What else should I know regarding indexes (mostly pertaining to efficiency)? I feel like there is a world of knowledge at my door step, but there's a huge folded mouse pad jammed up under it, so I can't get through (I don't know why I felt like I needed to say that, but thanks for providing the couch).
Think of an index very roughly like the index in the back of a book. It's a totally separate area from the content of the book, where if you are seeking some specific value, you can go to the index and look it up (indexes are ordered, so finding things there is much quicker than scanning every page of the book).
The index entry has a page number, so you can then quickly go to the page seeking your topic. A database index is very similar; it is an ordered list of the relevant information in your database (the field(s) included in the index), with information for the database to find the records which match.
So... you would create an index when you have information that you need to search on frequently. Normal indexes don't help you for 'partial' seeks like LIKE queries, but any time you need to get a set of results where field X has certain value(s), they keep the DBMS from needing to 'scan' the whole table, looking for matching values.
They also help when you need to sort on a column.
Another thing to keep in mind; If the DBMS allows you to create single indexes that have multiple fields, be sure to investigate the effects of doing so, specific to your DBMS. An index that includes multiple fields is likely only to be fully (or at all) useful if all those fields are being used in a query. Conversely, having multiple indexes for a single table, with one field per index, may not be of much (or any) help for queries that are filtering/sorting by multiple fields.
You mentioned Full Text indexes and PKs (Primary Keys). These are different than regular indexes, though they often serve similar purposes.
First, note that a Primary Key is usually an index (in MSSQL, a 'Clustered Index', in fact), but this does not need to be the case specifically. As an example, an MSSQL PK is a Clustered Index by default; clustered indexes are special in that they are not a separate bit of data stored elsewhere, but the data itself is arranged in the table in order by the Clustered Index. This is why a popular PK is an int value that is auto-generated with sequential, increasing values. So, a Clustered Index sorts the data in the table specifically by the field's value. Compare this to a traditional dictionary; the entries themselves are ordered by the 'key', which is the word being defined.
But in MSSQL (check your DBMS documentation for your information), you can change the Clustered Index to be a different field, if you like. Sometimes this is done on datetime based fields.
Full Text indexes are different kinds of beasts entirely. They use some of the same principles, but what they are doing isn't exactly the same as normal indexes, which I am describing. Also: in some DBMS's, LIKE queries do not use the full text index; special query operators are required.
These indexes are different because their intent is not to find/sort on the whole value of the column (a number, a date, a short bit of char data), but instead to find individual words/phrases within the text field(s) being indexed.
They can also often enable searching for similar words, different tenses, common misspellings and the like, and typically ignore noise words. The different way in which they work is why they also may need different operators to use them. (again, check your local documentation for your DBMS!)
This answer is Oracle-specific, but the main points in the answers apply to most relational database systems
How to choose and optimize oracle indexes?
As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...
Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.
Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!
It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.
Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'
In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)
It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.
Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.
Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.
A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.
The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data
It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.
Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.
How do you determine when to use table clusters? There are two types, index and hash, to use for different cases. In your experience, have the introduction and use of table clusters paid off?
If none of your tables are set up this way, modifying them to use table clusters would add to the complexity of the set up. But would the expected performance benefits outweight the cost of increased complexity in future maintenance work?
Do you have any favorite online references or books that describe table clustering well and give good implementation examples?
//Oracle tips greatly appreciated.
The killer feature of table clusters is that you can store related rows of different tables at the same physical location.
That can improve join performance by an order of magnitude. However, it doesn't pay of so often as it sounds.
The only time I used it was a three-table join, executed by two hash joins. It took too long ;). However, the join was on the same column, so it was possible to use a hash table cluster keyed by the join column. That caused all related rows to be stored alongside (ideally, in the same database block). Knowing that, Oracle can execute the join with a special optimization ("cluster join").
It's more or less pre-joined, but still feeling like normal tables (for INSERT/SELECT/UPDATE/DELETE).
On the other hand, there are "single table clusters" that are mostly used to control the "clustering factor" -- A similar idea like clustered indexes (called Index-Organized-Table in Oracle) but not adding high cost if using a secondary index.
One can speak a lot about clustering, but I found that almost ultimate explanation about Oracle clusters (pros and cons, when to use and how to use) can be found in Tom Kyte's book - Effective Oracle by Design, also you can search asktom for some specific cluster usage examples (1, 2 etc). You should definitely take a look at this book if you haven't yet.
Some info you can also find here.
But the thing you should always do before creating complex schema structures is to try, to test, to benchmark and choose the one solution that best fits your needs :)
Hope this helps.
I haven't used Oracle's table clusters myself, but I understand that its index table clusters are very much like MS SQL Server's clustered indexes. That is, the row data is physically organized by the clustered index's key.
That makes one ideal for a heavily-accessed column that has a reasonably small number of possible values (compared to the total number of rows), where most queries want to retrieve all rows with a particular value. Because all such rows are physically stored together, disk I/O, particularly seek time, is reduced.
"Reasonably small" is not easily defined, but postal or zip codes in an address table seems reasonable if you're often querying for all addresses in a single code's region. Province/state/territory codes are likely too small a selection for a country-wide address table.
So, you don't want to use them on columns with few possible values (e.g., M/F for gender) because then the clustering doesn't buy you anything and likely costs you for insertions. You also never want to use clustering on "autonumber" surrogate key columns (from sequences in Oracle) because that will create a "hot spot" in the last extent of the table as all insertions must physically happen there. You also don't want to apply clustering to a column value that will be updated because the RDBMS will have to physically move the record to maintain the clustered ordering.