Related
I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following:
I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted.
Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability?
I'll explain the case a little bit more:
We have Companies and every Company has many Recruiters.
ViewsAssigner_Identifier Table
Id: int PK
Company_Id: int FK NON-CLUSTERED
Views_Assigned: int NON-CLUSTERED
Date: date NON-CLUSTERED
CandidateViewCounts Table
Id: int PK
Recruiter_id: int FK NON-CLUSTERED ?
Candidate_id: int FK NON-CLUSTERED ?
ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ?
DateViewed: date NON-CLUSTERED
I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id]
We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored.
Result:
I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids.
Query Example:
SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1
Any recommendations?
If you think that each recruiter might view each candidate once, you're talking about a max of 60,000 * 2,000,000 rows. That's a large number, but they aren't very wide rows; as ErikE explained you will be able to get many rows on each page, so the total I/O even for a table scan will not be quite as bad as it sounds.
That said, for maintenance reasons, as long as you don't search by CandidateID, you may want to partition this table on RecruiterID. For example, your partition scheme could have one partition for RecruiterID between 1 and 2000, one partition for 2001 -> 4000, etc. This way you max out the number of rows per partition and can plan file space accordingly (you can put each partition on its own filegroup, separating I/O).
Another point is this: if you are looking to run queries such as "how many views on this candidate (and we don't care which recruiters)?" or "how many candidates has this recruiter viewed (and we don't care which candidates)?" then you may consider indexed views. E.g.
CREATE VIEW dbo.RecruiterViewCounts
WITH SCHEMABINDING
AS
SELECT RecruiterID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_rvc ON dbo.RecruiterViewCounts(RecruiterID);
GO
CREATE VIEW dbo.CandidateViewCounts
WITH SCHEMABINDING
AS
SELECT CandidateID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_cvc ON dbo.CandidateViewCounts(CandidateID);
GO
Now, these clustered indexes are expensive to maintain, so you'll want to test your write workload against them. But they should make those two queries extremely, extremely fast without having to seek into your large table and potentially read multiple pages for a very busy recruiter or a very popular candidate.
If your table is clustered on the RecruiterID you will have a very fast seek and in my opinion no performance issue at all.
In such a narrow table as you've described, finding out the profiles viewed for any one recruiter should require a single read 99+% of the time. (Assume fillfactor = 80 with minimal page splits; row width assuming two int columns = 16 bytes + overhead, call that 20 bytes; 8040 or so bytes per page; say they get 4 views at average 2.5 rows per recruiter = ballpark 128 recruiters per data page). The total number of rows in the table is irrelevant because it can seek into the clustered index. Yeah, it has to traverse the tree, but it is still going to be very fast. There is no better way so long as the views have to be counted once per candidate. If it were simply total views, you could keep a count instead.
I don't think you have much to worry about. If you are concerned that the system could grow to tens of thousands of request per second and you'll get some kind of limiting hotspot of activity, as long as the recruiters visiting at any one point in time do not coincidentally have sequential IDs assigned to them, you will be okay.
The big principle here is that you want to avoid anything that would have to scan the table top to bottom. You can avoid that as long as you always search by RecruiterID or RecruiterID, CandidateID. The moment you want to search by CandidateID alone, you will be in trouble without an additional index. Adding a nonclustered index on CandidateID will double the space your table takes (half for the clustered, half for the nonclustered) but that is no big deal. Then searching by CandidateID will be just as fast, because the nonclustered index will properly cover the query and no bookmark lookup will be required.
Update
This is a response to the substantially new information you provided in the update to your question.
First, your CandidateViewCounts table is named incorrectly. It's something more like CandidateFirstViewedByRecruiterAtCompany. It can only indirectly answer the question you have, which is about the Company, not the Recruiters, so in my opinion the scenario you're describing really calls for a CompanyCandidateViewed table:
CompanyID int FK
CandidateID int FK
PRIMARY KEY CLUSTERED (CompanyID, CandidateID)
Store the CompanyID of the recruiter who viewed the candidate, and the CandidateID. Simple! Now my original answer still works for you, simply swap RecruiterID with CompanyID.
If you really do want to track which recruiters viewed which candidates, then do so in a RecruiterCandidateViewed table (and store all recruiter->candidate views). That can be queried later or in a data warehouse. But your real-time OLTP needs will be met by the table described above.
Also, I would like to mention that it is possible you are putting identity columns in tables that that don't need them. You should avoid identity columns unless the column is going to be used as an FK in another table (and not always even then, as sometimes in proper data modeling in order to prevent possible denormalization you must use composite keys in FKs). For example, your ViewsAssigner_Identifier table seems to me to need some help (of course I don't have all the information here and could be off base). If the Company and the Date are what's most important about that table, make them together the clustered PK and get rid of the identity column if at all possible.
We are using SQL Server 2008 Enterprise version. We have a large table FooTable (billions of rows).
FooTable Columns: site:varchar(7), device:varchar(7), time(datetime), value(float)
Every day we insert millions of new rows.
We created a clustered index for site, device and time (in order).
As we can see, site and device are relatively constant, but time will keep changing as time goes by.
The queries executed against this table would be:
INSERT INTO FooTable SELECT * FROM #BULK_INSERTED_TEMP_TABLE
SELECT value FROM FooTable WHERE site = 'fooSite' AND device = 'fooDevice' AND time = 'fooTime'
SELECT SUM(value) FROM FooTable WHERE site = 'fooSite' AND device = 'fooDevice' AND time > 'startTime' AND time <= 'endTime'
What is the best clustered index design?
There's no one true answer for the best clustered index design. In general, I look at clustered indexes two ways. First, they store the data, so you need to consider them from the data storage aspect. Are you creating a cluster that is likely to be constantly splitting pages as new data arrives? Second, because they store data, you should consider the queries that are going to be used most frequently to retrieve the data. Will those queries be able to use the clustered index to get at the data?
Knowing next to nothing about your set up, do you have an optimal choice for clustered index? I would say possibly not. What you've defined is a valid primary key candidate, but the structure you've outlined, with the two columns that are going to group the data into a particular structure combined with an ever increasing piece of data which will cause insertions all over the place within the distribution of the first two columns suggests you're going to be looking at lots of page splits. That may or may not be an issue, but it's something you'll need to monitor.
I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.
The table in question is part of a database that a vendor's software uses on our network. The table contains metadata about files. The schema of the table is as follows
Metadata
ResultID (PK, int, not null)
MappedFieldname (char(50), not null)
Fieldname (PK, char(50), not null)
Fieldvalue (text, null)
There is a clustered index on ResultID and Fieldname. This table typically contains millions of rows (in one case, it contains 500 million). The table is populated by 24 workers running 4 threads each when data is being "processed". This results in many non-sequential inserts. Later after processing, more data is inserted into this table by some of our in-house software. The fragmentation for a given table is at least 50%. In the case of the largest table, it is at 90%. We do not have a DBA. I am aware we desperately need a DB maintenance strategy. As far as my background, I'm a college student working part time at this company.
My question is this, is a clustered index the best way to go about this? Should another index be considered? Are there any good references for this type and similar ad-hoc DBA tasks?
The indexing strategy entirely depends on how you query the table and how much performance you need to get out of the respective queries.
A clustered index can force re-sorting rows physically (on disk) when out-of-sequence inserts are made (this is called "page split"). In a large table with no free space on the index pages, this can take some time.
If you are not absolutely required to have a clustered index spanning two fields, then don't. If it is more like a kind of a UNIQUE constraint, then by all means make it a UNIQUE constraint. No re-sorting is required for those.
Determine what the typical query against the table is, and place indexes accordingly. The more indexes you have, the slower data changes (INSERTs/UPDATEs/DELETEs) will go. Don't create too many indexes, e.g. on fields that are unlikely to be filtered/sorted on.
Create combined indexes only on fields that are filtered/sorted on together, typically.
Look hard at your queries - the ones that hit the table for data. Will the index serve? If you have an index on (ResultID, FieldName) in that order, but you are querying for the possible ResultID values for a given Fieldname, it is likely that the DBMS will ignore the index. By contrast, if you have an index on (FieldName, ResultID), it will probably use the index - certainly for simple value lookups (WHERE FieldName = 'abc'). In terms of uniqueness, either index works well; in terms of query optimization, there is (at least potentially) a huge difference.
Use EXPLAIN to see how your queries are being handled by your DBMS.
Clustered vs non-clustered indexing is usually a second-order optimization effect in the DBMS. If you have the index correct, there is a small difference between clustered and non-clustered index (with a bigger update penalty for a clustered index as compensation for slightly smaller select times). Make sure everything else is optimized before worrying about the second-order effects.
The clustered index is OK as far as I see. Regarding other indexes you will need to provide typical SQL queries that operate on this table. Just creating an index out of the blue is never a good idea.
You're talking about fragmentation and indexing, does it mean that you suspect that query execution slows down? Or do you simply want to shrink/defragment the database/index?
It is a good idea to have a task to defragment indexes from time to time during off-hours, though you have to consider that with frequent/random inserts it does not hurt to have some spare space in the table to prevent page splits (which do affect performance).
I am aware we desperately need a DB maintenance strategy.
+1 for identifying that need
As far as my background, I'm a college student working part time at this company
Keep studying, gain experience, but get an experienced consultant in in the meantime.
The table is populated by 24 workers running 4 threads each
I presume this is pretty mission critical during the working day, and downtime is bad news? If so don't clutz with it.
There is a clustered index on ResultID and Fieldname
Is ResultID the first column in the PK, as you indicate?
If so I'll bet that it is insufficiently selective and, depending on what the needs are of the queries, the order of the PK fields should be swapped (notwithstanding that this compound key looks to be a poor choice for the clustered PK)
What's the result of:
SELECT COUNT(*), COUNT(DISTINCT ResultID) FROM MyTable
If the first count is, say, 4 x as big as the second, or more, you will most likely be getting scans in preference to seeks, because of the low selectively of ResultsID, and some simple changes will give huge performance improvements.
Also, Fieldname is quite wide (50 chars) so any secondary indexes will have 50 + 4 bytes added to every index entry. Are the fields really CHAR rather than VARCHAR?
Personally I would consider increased the density of the leaf pages. At 90% you will only leave a few gaps - maybe one-per-page. But with a large table of 500 million rows the higher packing density may mean fewer levels in the tree, and thus fewer seeks for retrieval. Against that almost every insert, for a given page, will require a page split. This would favour inserts that are clustered, so may not be appropriate (given that your insert data is probably not clustered). Like many things, you'd need to make a test to establish what index key density works best. SQL Server has tools to help analyse how queries are being parsed, whether they are being cached, how many scans of the table they cause, which queries are "slow running", and so on.
Get a consultant in to take a look and give you some advice. This aint a question that answers here are going to give you a safe solution to implement.
You really REALLY need to have some carefully thought through maintenance policies for tables that have 500 millions rows and shed-loads of inserts daily. Sorry, but I have enormous frustration with companies that get into this state.
The table needs defragmenting (your options will become fewer if you don't have a clustered index, so keep that until you decide that there is a better candidate). "Online" defragmentation methods will have modest impact on performance, and can chug away - and can safely be aborted if they overrun time / CPU constraints [although that will most likely take some programming]. If you have a "quiet" slot then use it for table defragmentation and updating the statistics on indexes. Don't wait until the weekend to try to do all tables in one go - do as much/many as you can during any quiet time daily (during the night presumably).
Defragmenting the tables is likely to lead to a huge increased in Transaction log usage, so make sure that any TLogs are backed up frequently (we have a 10 minute TLog backup policy, which we increase to every minute during table defragging so that the defragging process doesn't become the definition of required Tlog space!)
I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.
You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.
While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;
If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.
Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...
As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.
Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.
It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.