Which Database Design is more effective in this scenario? - database

DB design 1:
There is 1 table
Create Table (id int primary key, name varchar(20), description varchar(10000));
DB design 2:
There are 2 tables
Create Table1 (id int primary key, name varchar(20));
Create Table2 (id int primary key, description varchar(10000));
Note: each id must have a description associated with it. We don't query the description so often like name.
In the design 1, 1 simple query can get name & description, no need join but what if we have 1 million records, then will it be slow?
In the design 2, we need join so the database needs some searching & matching id --> this could be slow, but we don't query description often so it will be slow for sometime not all time.
So what is the better design in this scenario?

That's called vertical partitioning or "row splitting" and is no silver bullet (nothing is). You are not getting "better performance" you are just getting "different performance". Whether one set of performance characteristics is better than the other is a matter of engineering tradeoff and varies from one case to another.
In your case, 1 million rows will fit comfortably into DBMS cache on today's hardware, producing excellent performance. So unless some of the other reasons apply, keep it simple, in a single table.
And if its 1 billion rows (or 1 trillion or whatever number is too large for the memory standards of the day), keep in mind that if you have indexed your data correctly, the performance will remain excellent long after it became bigger than the cache.
Only in the most extreme of cases will you need to vertically partition the table for performance reasons - in which case you'll have to measure in your own environment with your own access patterns, and determine if it brings any performance benefit at all; and is it large enough to make up for the increased JOINing.

That's over-optimization for 1 million records in my opinion. It's really not that much. You could try to test the actual performance by generating dummy data on about a million rows for a dummy database and query it. You'll see how it performs.

Related

Optimal Strategy to Resolve Performance in Search Operations - SQL Server 2008

I'm working on a mobile website which is growing in popularity and this is leading to growth in some key database tables - and we're starting to see some performance issues when accessing those tables. Not being database experts (nor having the money to hire any at this stage) we're struggling to understand what is causing the performance problems. Our tables are not that big so SQL Server should be able to handle them fine and we've done everything we know to do in terms of optimising our queries. So here's the (pseudo) table structure:
[user] (approx. 40,000 rows, 37 cols):
id INT (pk)
content_group_id INT (fk)
[username] VARCHAR(20)
...
[content_group] (approx. 200,000 rows, 5 cols):
id INT (pk)
title VARCHAR(20)
...
[content] (approx. 1,000,000 rows, 12 cols):
id INT (pk)
content_group_id INT (fk)
content_type_id INT (fk)
content_sub_type_id INT (fk)
...
[content_type] (2 rows, 3 cols)
id INT (pk)
...
[content_sub_type] (8 rows, 3 cols)
id INT (pk)
content_type_id INT (fk)
...
We're expecting those row counts to grow considerably (in particular the user, content_group, and content tables). Yes the user table has quite a few columns - and we've identified some which can be moved into other tables. There are also a bunch of indexes we've applied to the affected tables which have helped.
The big performance problems are the stored procedures we're using to search for users (which include joins to the content table on the content_group_id field). We have tried to modify the WHERE and AND clauses using various different approaches and we think we have got them as good as we can but still it's too slow.
One other thing we tried which hasn't helped was to put an indexed view over the user and content tables. There was no noticeable performance gain when we did this so we've abandoned that idea due to the extra level of complexity inherent in having a view layer.
So, what are our options? We can think of a few but all come with pros and cons:
Denormalise of the Table Structure
Add multiple direct foreign key constraints between the user and content tables - so there would be a different foreign key to the content table for each content sub type.
Pros:
Joining the content table will be more optimal by using its primary key.
Cons:
There will be a lot of changes to our existing stored procedures and website code.
Maintaining up to 8 additional foreign keys (more realistically we'll only use 2 of these) will not be anywhere near as easy as the current single key.
More Denormalisation of the Table Structure
Just duplicate the fields we need from the content table into the user table directly.
Pros:
No more joins for to the content table - which significantly reduces the work SQL has to do.
Cons
Same as above: extra fields to maintain in the user table, changes to SQL and website code.
Create a Mid-Tier Indexing Layer
Using something like Lucene.NET, we'd put an indexing layer above the database. This would in theory improve performance of all search and at the same time decrease the load on the server.
Pros:
This is a good long-term solution. Lucene exists to improve search engine performance.
Cons:
There will be a much larger development cost in the short term - and we need to solve this problem ASAP.
So those are the things we've come up with and at this stage we're thinking the second option is the best - I'm aware that denormalising has it's issues however sometimes it's best to sacrifice architectural purity in order to get performance gains so we're prepared to pay that cost.
Are there any other approaches which might work for us? Are there any additional pros and/or cons with the approaches I've outlined above which may influence our decisions?
non clustered index seek from the content table using the
content_sub_type_id. This is followed by a Hash Match on the
content_group_id against the content table
This description would indicate that your expensive query filters the content table based on fields from content_type:
select ...
from content c
join content_type ct on c.content_type_id = ct.id
where ct.<field> = <value>;
This table design, and the resulting problem you just see, is quite common actually. The problems arise mainly due to the very low selectivity of the lookup tables (content_type has 2 rows, therefore the selectivity of content_type_id in content is probably 50%, huge). There are several solutions you can try:
1) Organize the content table on clustered index with content_type_id as the leading key. This would allow the join to do range scans and also avoid the key/bookmark lookup for the projection completeness. As a clustered index change, it would have implications on other queries so it has to be carefully tested. The primary key on content would obviously have to be enforced with a non-clustered constraint.
2) Pre-read the content_type_id value and then formulate the query without the join between content and content_type:
select ...
from content c
where c.content_type_id = #contentTypeId;
This works only if the selectivity of content_type_id is high (many distinct values with few rows each), which I doubt is your case (you probaly have very few content types, with many entries each).
3) Denormalize content_Type into content. You mention denormalization, but your proposal of denormalizing content into users makes little sense to me. Drop the content_type table, pull in the content_type fields into the content table itself, and live with all the denormalization problems.
4) Pre-join in a materialized view. You say you already tried that, but I doubt that you tried the right materialized view. You also need to understand that only Enterprise Edition uses the materialized view index automatically, all other editions require the NOEXPAND hint:
create view vwContentType
with schemabinding
as
select content_type_id, content_id
from dbo.content c
join dbo.content_type_id ct on c.content_type_id = ct.content_type_id;
create unique clustered index cdxContentType on vwContentType (content_type_id, content_id);
select ...
from content c
join vwContentType ct with (noexpand)
on ct.content_id = c.content_id
where ct.content_type_id = #contentTypeId;
Solutions 2), 3) and 4) are mostly academic. Given the very low selectivity of content_type_id, your only solution that has a standing chance is to make it the leading key in the clustered index of content. I did not expand the analysis to content_Sub_type, but with only 8 rows I'm willing to bet it has the very same problem, which would require to push it also into the clustered index (perhaps as the second leading key).

Approaches to table partitioning in SQL Server

The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.

Is this a bad indexing strategy for a table?

The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.
My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?
The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.
A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.
If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.
Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.
Create combined indexes only on fields that are filtered/sorted on together, typically.
Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (WHERE FieldName = 'abc'). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.
Use EXPLAIN to see how your queries are being handled by your DBMS.
Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.
The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea.
You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?
It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).
I am aware we desperately need a DB maintenance strategy.
+1 for identifying that need
As far as my background, I'm a college student working part time at this company
Keep studying, gain experience, but get an experienced consultant in in the meantime.
The table is populated by 24 workers running 4 threads each
I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.
There is a clustered index on ResultID and Fieldname
Is ResultID the first column in the PK, as you indicate?
If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)
What's the result of:
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.
Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?
Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.
Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.
You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.
The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).
Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)

Choice of primary key type

I have a table that potentially will have high number of inserts per second, and I'm trying to choose a type of primary key I want to use. For illustrative purposes let's say, it's users table. I am trying to chose between using GUID and BIGINT as primary key and ultimately as UserID across the app. If I use GUID, I save a trip to database to generate a new ID, but GUID is not "user-friendly" and it's not possible to partition table by this ID (which I'm planning to do). Using BIGINT is much more convenient, but generating it is a problem - I can't use IDENTITY (there is a reason fro that), so my only choice is to have some helper table that would contain last used ID and then I call this stored proc:
create proc GetNewID #ID BIGINT OUTPUT
as
begin
update HelperIDTable set #ID=id, id = id + 1
end
to get the new id. But then this helper table is an obvious bottleneck and I'm concerned with how many updates per second it can do.
I really like the idea of using BIGINT as pk, but the bottleneck problem concerns me - is there a way to roughly estimate how many id's it could produce per second? I realize it highly depends on hardware, but are there any physical limitations and what degree are we looking at? 100's/sec? 1000's/sec?
Any ideas on how to approach the problem are highly appreciated! This problem doesn't let me sleep for many night now!
Thanks!
Andrey
GUID seem to be a natural choice - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table - the single value that uniquely identifies the row in the database.
What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
So to sum it up: unless you have a really good reason, I would always recommend a INT IDENTITY field as the primary / clustered key on your table.
Marc
I try to use GUID PKs for all tables except small lookup tables. The GUID concept ensures that the identity of the object can safely be created in memeory without a roundtrip to the database and saving later without changing the identity.
When you need a "human readable" id you can use an auto increment int when saved. For partitioning you could also create the BIGINTs later by a database schedule for many users in one shot.
Do you want a primary key, for business reasons, or a clustred key, for storage concerns?
See stackoverflow.com/questions/1151625/int-vs-unique-identifier-for-id-field-in-database for a more elaborate post on the topic of PK vs. clustered key.
You really have to elaborate why can't you use IDENTITY. Generating the IDs manually, and specially on the server with an extra rountrip and an update just to generate each ID for the insert it won't scale. You'd be lucky to reach lower 100s per second. The problem is not just the rountrip and update time, but primarily from the interaction of ID generation update with insert batching: the insert batching transaction will serialize ID generation. The woraround is to separate the ID generation on separate session so it can autocommit, but then the insert batching is pointless because the ID genartion is not batched: it has to wait for log flush after each ID genrated in order to commit. Compared to this uuid will be running circles around your manual ID generation. But uuid are horrible choice for clustred key because of fragmentation.
try to hit your db with a script, perhaps with the use of jmeter to simulate concurrent hits. Perhaps you can then just measure yourself how much load you can handle. Also your DB could cause a bottle neck. Which one is it? I would prefure PostgreSQL for heavy load, like yahoo and skype also do
An idea that requires serious testing: try creating (inserting) new rows in batches -- say 1000 (10,000? 1M?) a time. You could have a master (aka bottleneck) table listing the next one to use, or you might have a query that does something like
select min(id) where (name = '')
Generate a fresh batch of emtpy rows in the morning, every hour, or whenever you're down to a certain number of free ones. This only addresses the issue of generating new IDs, but if that's the main bottleneck it might help.
A table partitioning option: Assuming a bigint ID column, how are you defining the partition? If you are allowing for 1G rows per day, you could set up the new partition in the evening (day1 = 1,000,000,000 through 1,999,999,999, day2 = 2,000,000,000 through 2,999,999,999, etc.) and then swap it in when it's ready. You are of course limited to 1000 partitions, so with bigints you'll run out of partitions before you run out of IDs.

What's your approach for optimizing large tables (+1M rows) on SQL Server?

I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.
At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
No! This kind of optimization is so premature it's probably stillborn.
Another approach would be using the partition feature of SQL Server.
Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.
What would be other good approachs to make this the fastest possible?
The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.
A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?
As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.
I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.
Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.
You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.
reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.
My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.
After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.
I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?
But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...
I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.
With regards to select speed, I would certainly take advantage of caching in your circumstance.
You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).
I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.
The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?
The execution plan doesn't seems to show anything about lock issues.
Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.
another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p
The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.
The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.
Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.
When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.
For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.

Resources