How long should a Primary Key delete take? - sql-server

Picture a simple table structure:
Table1 Table2
---------- ----------
ID<-------| ID
Name |-->Table1ID
Name
Table1 has a few million rows (say 3.5 million for example). I issue a delete by Primary Key:
DELETE FROM Table1 WHERE ID = 100;
There is no row in Table2 that references Table1 with ID = 100, so the delete works without violating any Foreign Key constraints.
How long would you expect the delete to take? On the order of a few milliseconds? A few hundred milliseconds? A second or more? A few seconds? Etc., assuming the machine is not bogged down and readily handles the request.
Now, I have this situation where a delete like this is taking around 700ms. To me, this seems too slow. I'm curious if I'm off-base or if others agree this is too slow, and recommendations to help make it faster!
Here is the actual execution plan:
(XML Execution plan here: http://pastebin.com/q9hSMLi3)
The Clustered Index Delete (81%) hits the Clustered PK, a Non-Clustered Unique Index, and a Non-Clustered Non-Unique Index.

The issue is the clustered index scan to validate the foreign key.
When the delete succeeds and there are no matching records that would cause a violation then all of table2 needs to be scanned. This table has 1,117,190 rows so this is an expensive operation that could definitely benefit from an index.
The 10% figure shown in the execution plan is just an estimate based on certain modelling assumptions.
The entire plan is costed at 0.0369164 with the scan on table 2 costed at 0.0036199 and everything else accounting for the remaining 0.0332965. However notice that for the clustered index scan operator the Estimated CPU Cost is 1.22907 and Estimated IO Cost is 10.7142 (totaling 11.94327 not 0.0369164).
The reason for this discrepancy is that the scan is under an anti semi join operator and the scan can stop as soon as a matching row is found. The estimated subtree cost is scaled down under the modelling assumption that this will happen after only a very small proportion of the table has been scanned.
In the case that there are no FK violations and the delete succeeds then the entire table needs to be scanned so it would be more informative to use the unscaled down figure.
If the percentages are reworked out using the 11.94327 cost for that operator that represents the full scan that happened in practice then this scan operator shows up as being 99.7% of the plan cost (11.94327 / (11.94327 + 0.0332965)).

If all pages being touched are in cache you can expect about 1ms or less for the CPU cost and the log write. The client library overhead might actually be more in terms of CPU than the server load.
For each page not in cache you can expect a disk seek of 5-10ms on a magnetic disk. Roughly, you can expect one such access per index being touched in Table1 plus one access in Table2 to validate the FK.
The execution plan tells you for sure which physical ops are to be performed.
700ms seems like a lot (70 indexes?!). Please post the actual execution plan. The server is unloaded and there is no blocking due to locks?

Related

Using a meaningless ID as my clustered index rather than my primary key

I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.

Effects of Clustered Index on DB Performance

I recently became involved with a new software project which uses SQL Server 2000 for its data storage.
In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:
Sequence numeric(18, 0)
Date datetime
Client varchar(9)
Hash tinyint
This table experiences a lot of inserts in the course of normal operation.
Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.
In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?
So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.
The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.
The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.
You said
In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?
That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.
As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.
It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:
having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.
That's not entirely true.
The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.
The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.
I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.
Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.
Last, there are some design tweaks that could help as well (in my opinion):
Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.
Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).
Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.
UPDATE
Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.
Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.
About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.
Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.
Clustered indexes (CI) work best over ever-increasing, narrow, rarely changing values. You'll want your CI to cover the column(s) that get hit the most often in queries with >=, <=, or BETWEEN statements.
I'm not sure how your data normally gets hit. Most often you'll see a CI on an IDENTITY column or another narrow column (because this column will also be returned "tacked on" to all non-clustered indexes, and we don't want a ton of data added on to every fetch if it isn't needed). It's possible the data might be getting queried most often on date, and that may be a good choice, but all four columns is likely not correct (I stress likely, because I don't know the set-up; this may not have anything wrong with it). There are some pointers here: http://msdn.microsoft.com/en-us/library/aa933131%28SQL.80%29.aspx
There are a few things you are misunderstanding about how SQL creates and uses indexes.
Clustered indexes aren't necessarily physically ordered on disk by the clustered index, at least not in real-time. They are just a logical ordering.
I wouldn't expect a major performance hit based on this structure and removing the clustered index before you have actually identified a performance issue related to that index is clearly premature optimization.
Also, an index can be useful (especially one with several fields in it) even for searches that don't sort or get queried on all columns included in it.
Obviously, there should be a justification for creating a multi-part clustered index, just like any index, so it makes sense to ask for that if you think it was added capriciously.
Bottom line: Don't optimize the indexes for insert performance until you have actually detected a performance problem with inserts. It usually isn't worth it.
If you have only that single clustered index on your table, that might not be too bad. However, the clustering index is also used for looking up the real data page for any hit in a non-clustered index - therefor, the clustered index (all its columns) are also part of each and every non-clustered index you might have on your table.
So if you have a few nonclustered indices on your table, then you're definitely a) wasting a lot of space (and not just on disk - also in your server's RAM!), and b) your performance will be bad.
A good clustered index ought to be:
small (best bet: a 4-byte INT) - yours is pretty bad with up to 28 bytes per entry
unique
stable (never change)
ever-increasing
I would bet your current setup violates at least two if not more of those requirements. Not following these recommendations will lead to waste of space, and as you rightfully say, lots of page and index fragmentation and page splits (having to "rearrange" the data when an insert happens somewhere in the middle of the clustered index).
Quite honestly: just add a surrogate ID INT IDENTITY(1,1) to your table and make that the primary clustered key - you should see quite a nice boost in performance, just from that, if you have lots of INSERT (and UPDATE) operations going on!
See some more background info on what makes a good clustering key, and what is important about them, here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
I ultimately agree with Erik's last paragraph:
"Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized..."
This is the basic thing I force people to learn: there's no universal solution.
You have to know your data and the actions performed against it. You have to know how frequent different type of actions are and their impact and expected execution times (you don't have to hard tune some rarely executed query and impact everything else if the end user agrees the query execution time is not so important--let's say waiting for few minutes for some report once per week is okay). Of course, as Erik said
"performance isn't just about user wait time or even query response time but also about server resources"
If such a query affects overall server performance, it should be considered as a serious candidate for optimization, even if execution time is fine. I've seen some very fast queries that used huge amount of CPU on multiprocessor servers, while slightly slower solution were incomparable "lighter" from resource utilization point of view. In that case I almost always go for the slower one.
Once you know what is your goal you can decide how many indexes you need and which one should be clustered. Unique constraints, filtered indexes, indexes with included columns are quite powerful tools for tuning. Choosing proper columns is important, but often choosing proper order of columns is even more important. And at the end, don't kill insert/update performance with tons of indexes if the table is frequently modified.

Database maintenance, big binary table optimization

i have a huge database, around 1 TB in size, most of the space is consumed by a table which stores images, the tables has right now almost 800k rows.
server response time has increased, i would like to know which techniques should i use or you recomend, partitioning? o how to reorganize the table
every row is accessed by the image id column, and it has its clustered index by that column, and every two days i reorganize the index and every 7 days i rebuild it, but it seems not to be working
any suggestions?
If the table is clustered by image_id and you access always by image_id then the size of the table is irrelevant, and so is the fragmentation (no need to rebuild).
If you see performance decrease, then there most be something else at play. You are doing range scans? Look in sys.dm_db_index_usage_stats, does the user_scans column differ from 0? It means you have queries that do scans.
Unless you measure where the time increase occurs, you'll be shooting blanks in the dark and never solve the problem correctly. Apply a methodological approach, like Waits and Queues to identify the problem.
One thing I can tell you right now: partitioning is never a performance improvement. It is intended for data maintenance (switch in/switch out) and for spreading the load on controlled fashion on filegroups. But you can never expect partitioning to improve performance, you can at best hope for equal performance with non-partitioned table.
If the response time is increasing, you must be doing more with this table than just pulling images for ids?
What other data columns are stored in your images table?
If you have a clustered index on an id (probably identity), that's fine, but adding an additional nonclustered index which can be covering for search criteria will probably help.
Say you also have columns for name or tag or region or whatever in this images table (and assuming you aren't going to vertically partition this table into separate tables), then having a nonclustered index on tag, id INCLUDE(name), say or something which matches your usage patterns will help a lot.
Remember: A clustered index is not an index, it's just the way the data is organized. It will usually not help much in any kind of search operations - it primarily works well on identity lookups, when you are reading almost every column, and streaming data in the order of the clustered index.

Is this a bad indexing strategy for a table?

The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.
My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?
The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.
A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.
If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.
Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.
Create combined indexes only on fields that are filtered/sorted on together, typically.
Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (WHERE FieldName = 'abc'). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.
Use EXPLAIN to see how your queries are being handled by your DBMS.
Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.
The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea.
You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?
It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).
I am aware we desperately need a DB maintenance strategy.
+1 for identifying that need
As far as my background, I'm a college student working part time at this company
Keep studying, gain experience, but get an experienced consultant in in the meantime.
The table is populated by 24 workers running 4 threads each
I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.
There is a clustered index on ResultID and Fieldname
Is ResultID the first column in the PK, as you indicate?
If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)
What's the result of:
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.
Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?
Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.
Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.
You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.
The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).
Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)

What's your approach for optimizing large tables (+1M rows) on SQL Server?

I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.
At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
No! This kind of optimization is so premature it's probably stillborn.
Another approach would be using the partition feature of SQL Server.
Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.
What would be other good approachs to make this the fastest possible?
The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.
A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?
As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.
I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.
Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.
You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.
reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.
My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.
After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.
I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?
But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...
I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.
With regards to select speed, I would certainly take advantage of caching in your circumstance.
You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).
I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.
The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?
The execution plan doesn't seems to show anything about lock issues.
Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.
another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p
The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.
The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.
Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.
When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.
For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.

Resources