Related
To start off, I'm not that great with database strategies, so I don't know really how to even approach this.
What I want to do is store some info in a database. Essentially the data is going to look like this
SensorNumber (int)
Reading (int)
Timestamp (Datetime?)(I just want to track down to the minute, nothing further is needed)
The only thing about this is that over a few months of tracking I'm going to have millions of rows (~5 million rows).
I really only care about searching by Timestamp and/or SensorNumber. The data in here is pretty much going to be never edited (insert once, read many times).
How should I go about building this? Is there anything special I should do other than create the table? and create the one index for SensorNumber and Temp?
Based on your comment, I would put a clustered index on (Sensor, Timestamp).
This will always cover when you want to search for SENSOR alone, but will also cover both fields checked in combination.
If you want to ever search for Timestamp alone, you can add a nonclustered index there as well.
One issue you will have with this design is the need to rebuild the table since you are going to be inserting rows non-sequentially - the new rows won't always belong at the end of the index.
Also, please do not name a field timestamp - this is a keyword in SQL Server and can cause you all kinds of issues if you don't delimit it everywhere.
You definitely want to use a SQL-Server "clustered index" for the most selective data you're likely to search on.
Here's more info:
http://www.sql-server-performance.com/2007/clustered-indexes/
http://odetocode.com/articles/70.aspx
http://www.sql-server-performance.com/2002/index-not-equal/
ELABORATION:
"Sensor" would be a poor choice - you're likely to have few sensors, many rows. This would not be a discriminating index.
"Time" would be discriminating... but it would also be a poor choice. Because the time itself, independent of sensor, temperature, etc, is probably meaningless to your query.
A clustered index on "sensor,time" might be ideal. Or maybe not - it depends on what you're after.
Please review the above links.
PS:
Please, too, consider using "datetime" instead of "timestamp". They're two completely different types under MSSQL ... and "datetime" is arguably the better, more flexible choice:
http://www.sqlteam.com/article/timestamps-vs-datetime-data-types
I agree with using a clustered index, you are almost certainly going to end up with one anyway - so it's better to define it.
A clustered index determines the order that the data is stored, adding to the end is cheaper than inserting into the middle.
Think of a deck of cards you are trying to keep in rank order as you add cards. If the highest rank is a 8, adding a 9 is trivial - put it at the top.
If you add a 5, it gets more complex, you have to work out where to put it and then insert it.
So adding items with a clustered index in order is optimal.
Given that I would suggest having a clustered index in (Timestamp,Sensor).
Clustering on (Sensor, Timestamp) will create a LOT of changes to the physical ordering of data which is very expensive (even using SSD).
If Timestamp,Sensor combo is unique then define it as being UNIQUE, otherwise Sql Server will add in a uniqueidentifier on the index to resolve duplicates.
Primary keys are automatically unique, almost all tables should have a primary key.
If (Timestamp,Sensor) is not unique, or you want to reference this data from another table, consider using an identity column as the clustered Primary Key.
Good Luck!
Typically, the databases are designed as below to allow multiple types for an entity.
Entity Name
Type
Additional info
Entity name can be something like account number and type could be like savings,current etc in a bank database for example.
Mostly, type will be some kind of string. There could be additional information associated with an entity type.
Normally queries will be posed like this.
Find account numbers of this particular type?
Find account numbers of type X, having balance greater than 1 million?
To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.
I am thinking about the below optimization.
Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.
It has below advantages.
1. Table size will be lot less because we will be storing small size values for each column data.
2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value.
3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values.
4. Infrequent values will never need to retrieved.
I am still having more optimizations in my mind. I will post those based on the feedback to this question.
I am not sure if this is already implemented in database, this is just a thought.
Thank you for reading this.
-- Bala
Update:
I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.
Updates:(Joe's answer)
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.
Now the table will look like this.
Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)
Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.
This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
I don't think your approach is very helpful.
Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.
Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.
You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.
If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.
Etc.
I need to store entries of the schema like (ID:int, description:varchar, updatetime:DateTime). ID is unique primary key. The usage scenario is, I will frequently insert new entries, frequently query entries by ID and less frequently remove expired entries (by updatetime field, using another SQL Job run daily to avoid database ever increasing). Each entry is with 0.5k size.
My question is how to optimize the database schema design (e.g. tricks to add index, transaction/lock levels or other options) in my scenario to improve performance? Currently I plan to store all information in a single table, not sure whether it is the best option.
BTW: I am using SQL Server 2005/2008.
thanks in advance,
George
Additionally to your primary key, just add index on updatetime.
Your decision to store everything in a single table needs to be reviewed. There are very few subject matters that can really be well modeled by just one table.
The problems that arise from using just one table are usually less obvious than the problems that arise from not creating the right indexes and things like that.
I'm interested in the "description" column (field). Do all descriptions describe the same kind of thing? Do you ever retrieve sets of descriptions, rather than just one description at a time? How do you group descriptions into sets?
How do you know the ID for the description you are trying to retrieve? Do you store copies of the ID in some toher place, in order to reference which ones you want?
Do you know what a "foreign key" is? Was your choice not to include any foreign keys in this table deliberate?
These are some of the questions that need to be answered before you can know whether a single table design really suits your case.
Your ID is your primary key and it has automatically an index.
You can put onther index for the expiration date. Indexes
are going to help you for searching but decreases the performance
when inserting, deleting and updating. Anyway one index is not
an issue.
It sounds for me somehow strange -I am not saying that it is an error-
that you have ALL the information in one table. Re-think that point.
See if you can refactorize something.
It sounds as simple as it gets, except for possibly adding an index on updatetime as OMax suggested (I recommend).
If you would also like to fetch items by description, you should also consider a text index or full-text index on that column.
Other than that - you're ready to go :)
Is there a performance gain or best practice when it comes to using unique, numeric ID fields in a database table compared to using character-based ones?
For instance, if I had two tables:
athlete
id ... 17, name ... Rickey Henderson, teamid ... 28
team
teamid ... 28, teamname ... Oakland
The athlete table, with thousands of players, would be easier to read if the teamid was, say, "OAK" or "SD" instead of "28" or "31". Let's take for granted the teamid values would remain unique and consistent in character form.
I know you CAN use characters, but is it a bad idea for indexing, filtering, etc for any reason?
Please ignore the normalization argument as these tables are more complicated than the example.
I find primary keys that are meaningless numbers cause less headaches in the long run.
Text is fine, for all the reasons you mentioned.
If the string is only a few characters, then it will be nearly as small an an integer anyway. The biggest potential drawback to using strings is the size: database performance is related to how many disk accesses are needed. Making the index twice as big, for example, could create disk-cache pressure, and increase the number of disk seeks.
I'd stay away from using text as your key - what happens in the future when you want to change the team ID for some team? You'd have to cascade that key change all through your data, when it's the exact thing a primary key can avoid. Also, though I don't have any emperical evidence, I'd think the INT key would be significantly faster than the text one.
Perhaps you can create views for your data that make it easier to consume, while still using a numeric primary key.
I'm just going to roll with your example. Doug is correct when he says that text is fine. Even for a medium sized (~50gig) database having a 3 letter code be a primary key won't kill the database. If it makes development easier, reduces joins on the other table and it's a field that users would be typing in...I say go for it. Don't do it if it's just an abbreviation that you show on a page or because it makes the athletes table look pretty. I think the key is the question "Is this a code that the user will type in and not just pick from a list?"
Let me give you an example of when I used a text column for a key. I was making software for processing medical claims. After the claim got all digitized a human had to look at the claim and then pick a code for it that designated what kind of claim it was. There were hundreds of codes...and these guys had them all memorized or crib sheets to help them. They'd been using these same codes for years. Using a 3 letter key let them just fly through the claims processing.
I recommend using ints or bigints for primary keys. Benefits include:
This allows for faster joins.
Having no semantic meaning in your primary key allows you to change the fields with semantic meaning without affecting relationships to other tables.
You can always have another column to hold team_code or something for "OAK" and "SD". Also
The standard answer is to use numbers because they are faster to index; no need to compute a hash or whatever.
If you use a meaningful value as a primary key you'll have to update it all through you're database if the team name changes.
To satisfy the above, but still make the database directly readable,
use a number field as the primary key
immediately create a view Athlete_And_Team that joins the Athlete and Team tables
Then you can use the view when you're going through the data by hand.
Are you talking about your primary key or your clustered index? Your clustered index should be the column which you will use to uniquely identify that row by most often. It also defines the logical ordering of the rows in your table. The clustered index will almost always be your primary key, but there are circumstances where they can be differant.
As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...
Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.
Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!
It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.
Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'
In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)
It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.
Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.
Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.
A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.
The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data
It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.
Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.